Skip to content

Conversation

@walterddr
Copy link
Contributor

@walterddr walterddr commented Jan 21, 2021

This is a follow up on #49869.

Previously CUDA early termination only happens for generic test classes that extends from DeviceTypeTestBase. However, JIT test cases which extends from common_utils.TestCase cannot benefit from the early termination.

This change moves the early termination logic into common_utils.TestCase class.

  • all tests extended from common_utils.TestCase now should early terminate if CUDA assert occurs.
  • For TestCases that extends from common_device_type.DeviceTypeTestBase, still only do torch.cuda.synchronize() when RTE is thrown.
  • For TestCases extends common_utils.TestCase, regardless of whether a test case uses GPU or not, it will always synchronize CUDA as long as torch.cuda.is_initialize() returns true.
  • Disabling this on common_distributed.py

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jan 21, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jan 21, 2021

💊 CI failures summary and remediations

As of commit ecaec43 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@walterddr walterddr force-pushed the early_terminate_cuda_jit branch 2 times, most recently from f9fa8cd to cc7205b Compare January 21, 2021 22:02
@codecov
Copy link

codecov bot commented Jan 22, 2021

Codecov Report

Merging #50914 (5ac57b4) into master (d5a2429) will increase coverage by 0.13%.
The diff coverage is 82.35%.

@@            Coverage Diff             @@
##           master   #50914      +/-   ##
==========================================
+ Coverage   80.77%   80.90%   +0.13%     
==========================================
  Files        1952     1924      -28     
  Lines      213967   210016    -3951     
==========================================
- Hits       172827   169921    -2906     
+ Misses      41140    40095    -1045     

@walterddr walterddr marked this pull request as ready for review January 22, 2021 15:33
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@walterddr
Copy link
Contributor Author

walterddr commented Jan 22, 2021

This change shouldn't affect the behavior of device generic test cases. thus

PYTORCH_TEST_WITH_SLOW=1 python test/test_testing.py -k test_cuda_assert_should_stop -v

should still pass

@walterddr walterddr force-pushed the early_terminate_cuda_jit branch 3 times, most recently from a749d9e to d9275ac Compare January 22, 2021 22:13
@walterddr walterddr marked this pull request as draft January 23, 2021 15:02
@walterddr walterddr force-pushed the early_terminate_cuda_jit branch from f673a9e to 612da61 Compare January 24, 2021 00:43
@mruberry
Copy link
Collaborator

  • For TestCases extends common_utils.TestCase, regardless of whether a test case uses GPU or not, it will always synchronize CUDA as long as torch.cuda.is_available() returns true.

How much does this increase test time?

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks like a smart generalization of the previous approach. I have a question and a few small inline comments about the draft.

@walterddr
Copy link
Contributor Author

Thanks @mruberry for the suggestion. i will modify the comments.

I converted it back to a draft because I am still trying to avoid blindly running torch.cuda.synchronize() for any tests. I think python descriptor might be a good solution to dynamically determine when to run cuda sync. But if there's any better solution please kindly comment and let me know :-)

@mruberry
Copy link
Collaborator

Thanks @mruberry for the suggestion. i will modify the comments.

I converted it back to a draft because I am still trying to avoid blindly running torch.cuda.synchronize() for any tests. I think python descriptor might be a good solution to dynamically determine when to run cuda sync. But if there's any better solution please kindly comment and let me know :-)

I'm not sure there is a good solution short of converting the test suite to use the device generic test framework properly and inheriting the previous fix.

@walterddr
Copy link
Contributor Author

walterddr commented Jan 25, 2021

I see. in this case I will do one final round of profiling to check the overhead in test time. and if it looks good I will enable this first then figure out how to make it more generic.
Since we are mostly on device generic test now. the only ones I saw is JIT and distributed. which both has their own common_*.py utility, it might be easier to alter JITCommonTestCase and MultiProcessTestCase to do smarter cuda sync

@ngimel
Copy link
Collaborator

ngimel commented Jan 26, 2021

You can use torch.cuda.is_initialized() instead of torch.cuda.is_available(), hopefully that won't always be true.

@walterddr walterddr force-pushed the early_terminate_cuda_jit branch from 5cda284 to 6bc9dc8 Compare January 26, 2021 17:26
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@walterddr walterddr marked this pull request as ready for review January 28, 2021 21:50
@walterddr walterddr changed the title Early terminate CUDA on common TestCases as well Early terminate CUDA on common_utils TestCases Jan 28, 2021
@mruberry
Copy link
Collaborator

How much of a perf impact does this have on test builds with CUDA?

@walterddr
Copy link
Contributor Author

based on a quick eyeball on circleCI, i would say < 2% on CI jobs comparing with master.

@mruberry
Copy link
Collaborator

based on a quick eyeball on circleCI, i would say < 2% on CI jobs comparing with master.

OK. The fix looks correct to me. Whether it's worth the perf impact is for @malfet to decide, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, additional torch.cuda.synchronize() after every test would be quite expensive.
Can you measure the slowdown?
Also, have you checked, if exposing cudaGetLastError into a python runtime would achieve the same but will be much faster?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.synchronize, if nothing is running on the gpu, is ~1 us, pretty much like any other pytorch operation that was called by the test. cuda tests are expected to do their own synchronization before this call anyway, to test the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I actually ran some scuba queries to measure the slowdown (from this PR against he base master commit)
the result is not significant on cudnn7 tests CI jobs, and if comparing against master commits around that time, it's actually some times faster.

it would be handy to have @samestep 's test time reporting tool here

@walterddr
Copy link
Contributor Author

walterddr commented Feb 3, 2021

Ping on this -- looks like latest master failure can be fixed by this PR - https://app.circleci.com/pipelines/github/pytorch/pytorch/268772/workflows/0d84bcf6-8228-4e94-825e-8420270b8409/jobs/10635515/tests#failed-test-0 (test_optim.py is not using device generic test case class)

I will try out Sam's #50171 and report the test time increase here

@walterddr
Copy link
Contributor Author

walterddr commented Feb 8, 2021

rebased on Sam's reporting diff and the result is promising. (#51876)
I guess it follows with #49023 (comment)

  • All CUDA test should already be designed to have an implicit synchronization at the end, when CUDA tensors are copied to the host to be compared with CPU tensors or printed.

List of tests running on GPU machines had minimum impact (<2%):
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759657 (-20.47s)
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759659 (+3.63s)
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759660 (+20.97s)
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759661 (+7.18s)
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759662 (+0.12s)
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759663 (+10.35s)

Differences in the following are relatively large (2%-10%), but these tests also have relatively large variations. and the hosts actually doesn't have a GPU so it is most likely not related to this PR.
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759656 (-109.77s)
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759658 (+584.90s)
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759664 (+533.23s)
https://app.circleci.com/pipelines/github/pytorch/pytorch/270960/workflows/79c6c620-7789-4fcd-b25f-9c22b3f636d7/jobs/10759665 (+225.92s)

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@walterddr walterddr force-pushed the early_terminate_cuda_jit branch from 5ac57b4 to fa7836d Compare February 8, 2021 23:10
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Rong Rong and others added 5 commits February 9, 2021 10:58
@walterddr walterddr force-pushed the early_terminate_cuda_jit branch from fa7836d to ecaec43 Compare February 9, 2021 18:59
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 9f1f563.

@walterddr
Copy link
Contributor Author

This change shouldn't affect the behavior of device generic test cases. thus

PYTORCH_TEST_WITH_SLOW=1 python test/test_testing.py -k test_cuda_assert_should_stop -v

should still pass

culprit test, failing one was named test_cuda_assert_should_not_stop

@facebook-github-bot
Copy link
Contributor

@walterddr merged this pull request in c1b7ca8.

facebook-github-bot pushed a commit that referenced this pull request Feb 12, 2021
Summary:
Take 2 of #50914
This change moves the early termination logic into common_utils.TestCase class.

Pull Request resolved: #52126

Test Plan: CI with ci-all tag

Reviewed By: malfet

Differential Revision: D26391762

Pulled By: walterddr

fbshipit-source-id: a149ecc47ccda7f2795e107fb95915506ae060b4
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
This is a follow up on pytorch#49869.

Previously CUDA early termination only happens for generic test classes that extends from `DeviceTypeTestBase`. However, JIT test cases which extends from common_utils.TestCase cannot benefit from the early termination.

This change moves the early termination logic into common_utils.TestCase class.
- all tests extended from common_utils.TestCase now should early terminate if CUDA assert occurs.
- For TestCases that extends from common_device_type.DeviceTypeTestBase, still only do torch.cuda.synchronize() when RTE is thrown.
- For TestCases extends common_utils.TestCase, regardless of whether a test case uses GPU or not, it will always synchronize CUDA as long as `torch.cuda.is_initialize()` returns true.
- Disabling this on common_distributed.py

Pull Request resolved: pytorch#50914

Reviewed By: malfet

Differential Revision: D26019289

Pulled By: walterddr

fbshipit-source-id: ddc7c1c0d00db4d073a6c8bc5b7733637a7e77d1
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Take 2 of pytorch#50914
This change moves the early termination logic into common_utils.TestCase class.

Pull Request resolved: pytorch#52126

Test Plan: CI with ci-all tag

Reviewed By: malfet

Differential Revision: D26391762

Pulled By: walterddr

fbshipit-source-id: a149ecc47ccda7f2795e107fb95915506ae060b4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants