-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Enable test_distributed for ROCm but only with nccl backend [REDUX] #32551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable test_distributed for ROCm but only with nccl backend [REDUX] #32551
Conversation
💊 CircleCI build failures summary and remediationsAs of commit 1e35a0d:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🕵️ 1 new failure recognized by patternsThe following build failures do not appear to be due to upstream breakage:
|
|
Successful CI Run 1: 2 failing checks: clang-tidy and pytorch_macos_10_13_py3_test. Neither failure seems to be related to the PR changes. |
|
@pytorchbot retest this please |
|
Test is failing https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-trigger/18270 |
Yes, thank you for putting the details here. I'm running the tests locally to see if I can reproduce the failure. |
…due to known comgr issue (flaky?)
|
Successful CI Run 2: 2 failing checks: pytorch_linux_xenial_py3_6_gcc5_4_build and pytorch_macos_10_13_py3_test. Neither failure seems to be related to the PR changes. Testing yet again for good measure. @pytorchbot retest this please. |
|
Successful CI Run 3: 2 failing checks: pytorch_linux_xenial_py3_6_gcc5_4_build and pytorch_macos_10_13_py3_test. Neither failure seems to be related to the PR changes. @bddppq Seems ready for review and merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@ezyang I notice there are some trivial conflicts in test/run_test.py. Would you like me to resolve them here, or will they be resolved while importing into Phabricator? |
|
If you could resolve them that would be great, sorry about the delay |
|
Nice, thanks! |
|
All 10 CI runs above passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…ytorch#32551) Summary: This is a redux of the original PR pytorch#28814 which was reverted in PR pytorch#29736 due to test_DistributedDataParallel being suspected as being flaky. Further investigation revealed it wasn't flakiness, but a bug in the PyTorch source code which has been now fixed in PR pytorch#32356. This PR is another attempt at enabling the test_distributed unit test suite only for the nccl backend. Pull Request resolved: pytorch#32551 Differential Revision: D19729966 Pulled By: bddppq fbshipit-source-id: 12a0d850991a903cc7723d63693b6157071d7115
This is a redux of the original PR #28814 which was reverted in PR #29736 due to test_DistributedDataParallel being suspected as being flaky. Further investigation revealed it wasn't flakiness, but a bug in the PyTorch source code which has been now fixed in PR #32356. This PR is another attempt at enabling the test_distributed unit test suite only for the nccl backend.