Skip to content

Conversation

@mcarilli
Copy link
Collaborator

@mcarilli mcarilli commented Jun 19, 2020

supporting #40358

@mcarilli mcarilli changed the title [DO NOT MERGE] Trying #40178 DDP's tests on master without grad layout diffs [DO NOT REVIEW] Trying #40178 DDP's tests on master without grad layout diffs Jun 19, 2020
@dr-ci
Copy link

dr-ci bot commented Jun 19, 2020

💊 CI failures summary and remediations

As of commit b21cfa1 (more details on the Dr. CI page):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_1_cudnn7_py3_multigpu_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jun 22 22:58:34 RuntimeError: Process 0 terminated or timed out after 100.07569599151611 seconds
Jun 22 22:58:34 ====================================================================== 
Jun 22 22:58:34 ERROR [100.087s]: test_grad_layout_1devicemodule_2replicaperprocess (__main__.DistributedDataParallelTest) 
Jun 22 22:58:34 ---------------------------------------------------------------------- 
Jun 22 22:58:34 Traceback (most recent call last): 
Jun 22 22:58:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper 
Jun 22 22:58:34     self._join_processes(fn) 
Jun 22 22:58:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes 
Jun 22 22:58:34     self._check_return_codes(elapsed_time) 
Jun 22 22:58:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 344, in _check_return_codes 
Jun 22 22:58:34     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Jun 22 22:58:34 RuntimeError: Process 0 terminated or timed out after 100.07569599151611 seconds 
Jun 22 22:58:34  
Jun 22 22:58:34 ====================================================================== 
Jun 22 22:58:34 ERROR [1.918s]: test_param_layout_mismatch_error (__main__.DistributedDataParallelTest) 
Jun 22 22:58:34 ---------------------------------------------------------------------- 
Jun 22 22:58:34 Traceback (most recent call last): 
Jun 22 22:58:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper 
Jun 22 22:58:34     self._join_processes(fn) 
Jun 22 22:58:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes 
Jun 22 22:58:34     self._check_return_codes(elapsed_time) 
Jun 22 22:58:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 339, in _check_return_codes 

Extra GitHub checks: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 46 times.

@mcarilli mcarilli changed the title [DO NOT REVIEW] Trying #40178 DDP's tests on master without grad layout diffs [DO NOT REVIEW] Trying #40358 DDP tests on master without grad layout diffs Jun 21, 2020
facebook-github-bot pushed a commit that referenced this pull request Jun 23, 2020
… memory layout (#40358)

Summary:
#40129 fixed the error responsible for the first revert, but exposed another error in the same test.

This PR is intended as the "master copy" for merge, and it runs on full CI.
Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`).
- #40290 tries the test with purely rowmajor contiguous params on an untouched master.  In other words #40290 contains none of this PR's diffs aside from the test itself.
- #40178, for comparison, tries the test with this PR's diffs.

Both fail the same way, indicating failure is unrelated to this PR's other diffs.
Pull Request resolved: #40358

Differential Revision: D22165785

Pulled By: albanD

fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e
@ngimel
Copy link
Collaborator

ngimel commented Jul 7, 2020

Closing, reopen if needed.

@ngimel ngimel closed this Jul 7, 2020
@facebook-github-bot facebook-github-bot deleted the ci-all/nhwc_accumulate_grad_testsonly branch January 27, 2021 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants