-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Early terminate CUDA on common_utils TestCases #50914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit ecaec43 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
f9fa8cd to
cc7205b
Compare
Codecov Report
@@ Coverage Diff @@
## master #50914 +/- ##
==========================================
+ Coverage 80.77% 80.90% +0.13%
==========================================
Files 1952 1924 -28
Lines 213967 210016 -3951
==========================================
- Hits 172827 169921 -2906
+ Misses 41140 40095 -1045 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
This change shouldn't affect the behavior of device generic test cases. thus should still pass |
a749d9e to
d9275ac
Compare
f673a9e to
612da61
Compare
How much does this increase test time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks like a smart generalization of the previous approach. I have a question and a few small inline comments about the draft.
|
Thanks @mruberry for the suggestion. i will modify the comments. I converted it back to a draft because I am still trying to avoid blindly running |
I'm not sure there is a good solution short of converting the test suite to use the device generic test framework properly and inheriting the previous fix. |
|
I see. in this case I will do one final round of profiling to check the overhead in test time. and if it looks good I will enable this first then figure out how to make it more generic. |
|
You can use |
5cda284 to
6bc9dc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
How much of a perf impact does this have on test builds with CUDA? |
|
based on a quick eyeball on circleCI, i would say < 2% on CI jobs comparing with master. |
OK. The fix looks correct to me. Whether it's worth the perf impact is for @malfet to decide, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, additional torch.cuda.synchronize() after every test would be quite expensive.
Can you measure the slowdown?
Also, have you checked, if exposing cudaGetLastError into a python runtime would achieve the same but will be much faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.cuda.synchronize, if nothing is running on the gpu, is ~1 us, pretty much like any other pytorch operation that was called by the test. cuda tests are expected to do their own synchronization before this call anyway, to test the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. I actually ran some scuba queries to measure the slowdown (from this PR against he base master commit)
the result is not significant on cudnn7 tests CI jobs, and if comparing against master commits around that time, it's actually some times faster.
it would be handy to have @samestep 's test time reporting tool here
|
Ping on this -- looks like latest master failure can be fixed by this PR - https://app.circleci.com/pipelines/github/pytorch/pytorch/268772/workflows/0d84bcf6-8228-4e94-825e-8420270b8409/jobs/10635515/tests#failed-test-0 (test_optim.py is not using device generic test case class) I will try out Sam's #50171 and report the test time increase here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
5ac57b4 to
fa7836d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This could slow down non-GPU tests since it runs cuda synchronize when cuda is available, not is the target device
fa7836d to
ecaec43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
This pull request has been reverted by 9f1f563. |
culprit test, failing one was named test_cuda_assert_should_not_stop |
|
@walterddr merged this pull request in c1b7ca8. |
Summary: Take 2 of #50914 This change moves the early termination logic into common_utils.TestCase class. Pull Request resolved: #52126 Test Plan: CI with ci-all tag Reviewed By: malfet Differential Revision: D26391762 Pulled By: walterddr fbshipit-source-id: a149ecc47ccda7f2795e107fb95915506ae060b4
Summary: This is a follow up on pytorch#49869. Previously CUDA early termination only happens for generic test classes that extends from `DeviceTypeTestBase`. However, JIT test cases which extends from common_utils.TestCase cannot benefit from the early termination. This change moves the early termination logic into common_utils.TestCase class. - all tests extended from common_utils.TestCase now should early terminate if CUDA assert occurs. - For TestCases that extends from common_device_type.DeviceTypeTestBase, still only do torch.cuda.synchronize() when RTE is thrown. - For TestCases extends common_utils.TestCase, regardless of whether a test case uses GPU or not, it will always synchronize CUDA as long as `torch.cuda.is_initialize()` returns true. - Disabling this on common_distributed.py Pull Request resolved: pytorch#50914 Reviewed By: malfet Differential Revision: D26019289 Pulled By: walterddr fbshipit-source-id: ddc7c1c0d00db4d073a6c8bc5b7733637a7e77d1
Summary: Take 2 of pytorch#50914 This change moves the early termination logic into common_utils.TestCase class. Pull Request resolved: pytorch#52126 Test Plan: CI with ci-all tag Reviewed By: malfet Differential Revision: D26391762 Pulled By: walterddr fbshipit-source-id: a149ecc47ccda7f2795e107fb95915506ae060b4
This is a follow up on #49869.
Previously CUDA early termination only happens for generic test classes that extends from
DeviceTypeTestBase. However, JIT test cases which extends from common_utils.TestCase cannot benefit from the early termination.This change moves the early termination logic into common_utils.TestCase class.
torch.cuda.is_initialize()returns true.