Skip to content

Conversation

@jithunnair-amd
Copy link
Collaborator

This is a redux of the original PR #28814 which was reverted in PR #29736 due to test_DistributedDataParallel being suspected as being flaky. Further investigation revealed it wasn't flakiness, but a bug in the PyTorch source code which has been now fixed in PR #32356. This PR is another attempt at enabling the test_distributed unit test suite only for the nccl backend.

@kostmo
Copy link
Member

kostmo commented Jan 23, 2020

💊 CircleCI build failures summary and remediations

As of commit 1e35a0d:

  • 1/1 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

See CircleCI build pytorch_linux_xenial_py2_7_9_test (1/1)

Step: "Test" (full log | pattern match details)

Feb 06 23:11:08 RuntimeError: test_dataloader failed!
Feb 06 23:11:08 Ran 62 tests in 84.445s 
Feb 06 23:11:08  
Feb 06 23:11:08 FAILED (failures=1, skipped=8) 
Feb 06 23:11:08  
Feb 06 23:11:08 Generating XML reports... 
Feb 06 23:11:08 Traceback (most recent call last): 
Feb 06 23:11:08   File "test/run_test.py", line 486, in <module> 
Feb 06 23:11:08     main() 
Feb 06 23:11:08   File "test/run_test.py", line 479, in main 
Feb 06 23:11:08     raise RuntimeError(message) 
Feb 06 23:11:08 RuntimeError: test_dataloader failed! 
Feb 06 23:11:08 + cleanup 
Feb 06 23:11:08 + retcode=1 
Feb 06 23:11:08 + set +x 
Feb 06 23:11:08 =================== sccache compilation log =================== 
Feb 06 23:11:08 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/tmp/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/tmp/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Feb 06 23:11:08  
Feb 06 23:11:08 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Feb 06 23:11:08 Compile requests                 52 
Feb 06 23:11:08 Compile requests executed        30 
Feb 06 23:11:08 Cache hits                       23 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 5 times.

@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Jan 24, 2020

Successful CI Run 1: 2 failing checks: clang-tidy and pytorch_macos_10_13_py3_test. Neither failure seems to be related to the PR changes.

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot retest this please

@ngimel ngimel added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 25, 2020
@ngimel ngimel requested a review from bddppq January 25, 2020 23:27
@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 25, 2020
@bddppq
Copy link
Contributor

bddppq commented Jan 28, 2020

Test is failing https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-trigger/18270

17:51:01 ======================================================================
17:51:01 ERROR: test_broadcast_coalesced_nccl (__main__.CommTest)
17:51:01 ----------------------------------------------------------------------
17:51:01 Traceback (most recent call last):
17:51:01   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper
17:51:01     self._join_processes(fn)
17:51:01   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 256, in _join_processes
17:51:01     self._check_return_codes(elapsed_time)
17:51:01   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 275, in _check_return_codes
17:51:01     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
17:51:01 RuntimeError: Process 1 terminated or timed out after 156.1109459400177 seconds

@jithunnair-amd
Copy link
Collaborator Author

Test is failing https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-trigger/18270

17:51:01 ======================================================================
17:51:01 ERROR: test_broadcast_coalesced_nccl (__main__.CommTest)
17:51:01 ----------------------------------------------------------------------
17:51:01 Traceback (most recent call last):
17:51:01   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper
17:51:01     self._join_processes(fn)
17:51:01   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 256, in _join_processes
17:51:01     self._check_return_codes(elapsed_time)
17:51:01   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 275, in _check_return_codes
17:51:01     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
17:51:01 RuntimeError: Process 1 terminated or timed out after 156.1109459400177 seconds

Yes, thank you for putting the details here. I'm running the tests locally to see if I can reproduce the failure.

@jithunnair-amd
Copy link
Collaborator Author

Successful CI Run 2: 2 failing checks: pytorch_linux_xenial_py3_6_gcc5_4_build and pytorch_macos_10_13_py3_test. Neither failure seems to be related to the PR changes.

Testing yet again for good measure.

@pytorchbot retest this please.

@jithunnair-amd
Copy link
Collaborator Author

Successful CI Run 3: 2 failing checks: pytorch_linux_xenial_py3_6_gcc5_4_build and pytorch_macos_10_13_py3_test. Neither failure seems to be related to the PR changes.

@bddppq Seems ready for review and merge

@jithunnair-amd jithunnair-amd requested a review from ezyang February 3, 2020 17:46
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@jithunnair-amd
Copy link
Collaborator Author

@ezyang I notice there are some trivial conflicts in test/run_test.py. Would you like me to resolve them here, or will they be resolved while importing into Phabricator?

@ezyang
Copy link
Contributor

ezyang commented Feb 6, 2020

If you could resolve them that would be great, sorry about the delay

@jithunnair-amd
Copy link
Collaborator Author

Nice, thanks!

@bddppq bddppq requested a review from xw285cornell February 6, 2020 23:45
@jithunnair-amd
Copy link
Collaborator Author

All 10 CI runs above passed.

Copy link
Contributor

@bddppq bddppq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@bddppq bddppq added the module: rocm AMD GPU support for Pytorch label Feb 7, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@bddppq merged this pull request in 3c4cec5.

ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
…ytorch#32551)

Summary:
This is a redux of the original PR pytorch#28814 which was reverted in PR pytorch#29736 due to test_DistributedDataParallel being suspected as being flaky. Further investigation revealed it wasn't flakiness, but a bug in the PyTorch source code which has been now fixed in PR pytorch#32356. This PR is another attempt at enabling the test_distributed unit test suite only for the nccl backend.
Pull Request resolved: pytorch#32551

Differential Revision: D19729966

Pulled By: bddppq

fbshipit-source-id: 12a0d850991a903cc7723d63693b6157071d7115
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants