[wip][ci-all] fix processgroupnccl profiling #48664

rohan-varma · 2020-12-01T19:28:26Z

Differential Revision: D25250227

[ghstack-poisoned]

Fixes #{issue number}

Differential Revision: [D25250227](https://our.internmc.facebook.com/intern/diff/D25250227/) [ghstack-poisoned]

dr-ci · 2020-12-01T21:15:17Z

💊 CI failures summary and remediations

As of commit bfc4f77 (more details on the Dr. CI page):

8/13 failures possibly* introduced in this PR
- 2/8 non-CircleCI failure(s)
5/13 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

🕵️ 5 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 07 23:16:34 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future

Dec 07 23:16:34 At: 
Dec 07 23:16:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:16:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:16:34  
Dec 07 23:16:34 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:16:34  
Dec 07 23:16:34 At: 
Dec 07 23:16:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:16:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:16:34  
Dec 07 23:16:34 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:16:34  
Dec 07 23:16:34 At: 
Dec 07 23:16:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:16:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:16:34  
Dec 07 23:16:34 [W tensorpipe_agent.cpp:547] RPC agent for worker3 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:16:35 ok (1.636s) 
Dec 07 23:16:36   test_return_future_remote (__main__.TensorPipeRpcTestWithSpawn) ... [W tensorpipe_agent.cpp:547] RPC agent for worker1 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:16:36 [W tensorpipe_agent.cpp:547] RPC agent for worker3 encountered error when reading incoming request from worker2: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:16:36 [W tensorpipe_agent.cpp:547] RPC agent for worker0 encountered error when reading incoming request from worker2: EOF: end of file (this is expected to happen during shutdown)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_nogpu_NO_AVX_test (2/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 07 23:44:58 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future

Dec 07 23:44:58 At: 
Dec 07 23:44:58   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:44:58   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:44:58  
Dec 07 23:44:58 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:44:58  
Dec 07 23:44:58 At: 
Dec 07 23:44:58   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:44:58   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:44:58  
Dec 07 23:44:58 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:44:58  
Dec 07 23:44:58 At: 
Dec 07 23:44:58   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:44:58   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:44:58  
Dec 07 23:44:58 [W tensorpipe_agent.cpp:547] RPC agent for worker3 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:44:58 [W tensorpipe_agent.cpp:547] RPC agent for worker2 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:44:58 /opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.cpp:104.) 
Dec 07 23:44:58   return torch._C._cuda_getDeviceCount() > 0 
Dec 07 23:44:58 [W tensorpipe_agent.cpp:547] RPC agent for worker1 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown)

pytorch_parallelnative_linux_xenial_py3_6_gcc5_4_test (3/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 07 23:20:32 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future

Dec 07 23:20:32 At: 
Dec 07 23:20:32   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:20:32   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:20:32  
Dec 07 23:20:32 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:20:32  
Dec 07 23:20:32 At: 
Dec 07 23:20:32   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:20:32   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:20:32  
Dec 07 23:20:32 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:20:32  
Dec 07 23:20:32 At: 
Dec 07 23:20:32   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:20:32   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:20:32  
Dec 07 23:20:32 [W tensorpipe_agent.cpp:547] RPC agent for worker2 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:20:32 [W tensorpipe_agent.cpp:547] RPC agent for worker0 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:20:32 ok (1.836s) 
Dec 07 23:20:34   test_return_future_remote (__main__.TensorPipeRpcTestWithSpawn) ... [W tensorpipe_agent.cpp:547] RPC agent for worker1 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:20:34 ok (1.836s)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_nogpu_NO_AVX2_test (4/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 07 23:42:07 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future

Dec 07 23:42:07 At: 
Dec 07 23:42:07   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:42:07   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:42:07  
Dec 07 23:42:07 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:42:07  
Dec 07 23:42:07 At: 
Dec 07 23:42:07   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:42:07   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:42:07  
Dec 07 23:42:07 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:42:07  
Dec 07 23:42:07 At: 
Dec 07 23:42:07   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:42:07   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:42:07  
Dec 07 23:42:07 [W tensorpipe_agent.cpp:547] RPC agent for worker2 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:42:07 [W tensorpipe_agent.cpp:547] RPC agent for worker3 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:42:07 /opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.cpp:104.) 
Dec 07 23:42:07   return torch._C._cuda_getDeviceCount() > 0 
Dec 07 23:42:07 /opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.cpp:104.)

pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_test (5/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 07 23:28:26 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future

Dec 07 23:28:26 At: 
Dec 07 23:28:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:28:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:28:26  
Dec 07 23:28:26 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:28:26  
Dec 07 23:28:26 At: 
Dec 07 23:28:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:28:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:28:26  
Dec 07 23:28:26 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 07 23:28:26  
Dec 07 23:28:26 At: 
Dec 07 23:28:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 07 23:28:26   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 07 23:28:26  
Dec 07 23:28:26 [W tensorpipe_agent.cpp:547] RPC agent for worker1 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:28:26 [W tensorpipe_agent.cpp:547] RPC agent for worker2 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:28:26 ok (1.735s) 
Dec 07 23:28:28   test_return_future_remote (__main__.TensorPipeRpcTestWithSpawn) ... [W tensorpipe_agent.cpp:547] RPC agent for worker2 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Dec 07 23:28:28 [W tensorpipe_agent.cpp:547] RPC agent for worker0 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown)

1 job timed out:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test

❄️ 5 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (1/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Dec 07 23:36:02 RuntimeError: Process 0 terminated or timed out after 500.07820081710815 seconds

Dec 07 23:36:02 ====================================================================== 
Dec 07 23:36:02 ERROR [500.098s]: test_grad_layout_1devicemodule_1replicaperprocess (__main__.DistributedDataParallelTest) 
Dec 07 23:36:02 ---------------------------------------------------------------------- 
Dec 07 23:36:02 Traceback (most recent call last): 
Dec 07 23:36:02   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper 
Dec 07 23:36:02     self._join_processes(fn) 
Dec 07 23:36:02   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes 
Dec 07 23:36:02     self._check_return_codes(elapsed_time) 
Dec 07 23:36:02   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 436, in _check_return_codes 
Dec 07 23:36:02     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Dec 07 23:36:02 RuntimeError: Process 0 terminated or timed out after 500.07820081710815 seconds 
Dec 07 23:36:02  
Dec 07 23:36:02 ====================================================================== 
Dec 07 23:36:02 FAIL [5.002s]: test_default_store_timeout_nccl (__main__.TimeoutTest) 
Dec 07 23:36:02 ---------------------------------------------------------------------- 
Dec 07 23:36:02 Traceback (most recent call last): 
Dec 07 23:36:02   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1404, in wrapper 
Dec 07 23:36:02     return func(*args, **kwargs) 
Dec 07 23:36:02   File "distributed/test_c10d.py", line 635, in test_default_store_timeout_nccl 
Dec 07 23:36:02     self._test_default_store_timeout("nccl") 
Dec 07 23:36:02   File "distributed/test_c10d.py", line 620, in _test_default_store_timeout

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test (2/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Dec 08 01:32:19 RuntimeError: Process 0 terminated or timed out after 400.03903794288635 seconds

Dec 08 01:32:19 ====================================================================== 
Dec 08 01:32:19 ERROR [400.065s]: test_ddp_uneven_inputs (__main__.TestDistBackendWithSpawn) 
Dec 08 01:32:19 ---------------------------------------------------------------------- 
Dec 08 01:32:19 Traceback (most recent call last): 
Dec 08 01:32:19   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper 
Dec 08 01:32:19     self._join_processes(fn) 
Dec 08 01:32:19   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes 
Dec 08 01:32:19     self._check_return_codes(elapsed_time) 
Dec 08 01:32:19   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 436, in _check_return_codes 
Dec 08 01:32:19     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Dec 08 01:32:19 RuntimeError: Process 0 terminated or timed out after 400.03903794288635 seconds 
Dec 08 01:32:19  
Dec 08 01:32:19 ---------------------------------------------------------------------- 
Dec 08 01:32:19 Ran 163 tests in 6403.538s 
Dec 08 01:32:19  
Dec 08 01:32:19 FAILED (errors=4, skipped=105) 
Dec 08 01:32:19  
Dec 08 01:32:19 Generating XML reports... 
Dec 08 01:32:19 Generated XML report: test-reports/dist-nccl/TEST-TestDistBackendWithSpawn-20201207234535.xml 
Dec 08 01:32:19 Traceback (most recent call last): 
Dec 08 01:32:19   File "test/run_test.py", line 874, in <module>

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (3/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Dec 08 01:31:38 RuntimeError: Process 0 terminated or timed out after 400.0543022155762 seconds

Dec 08 01:31:38 ====================================================================== 
Dec 08 01:31:38 ERROR [400.080s]: test_ddp_uneven_inputs (__main__.TestDistBackendWithSpawn) 
Dec 08 01:31:38 ---------------------------------------------------------------------- 
Dec 08 01:31:38 Traceback (most recent call last): 
Dec 08 01:31:38   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper 
Dec 08 01:31:38     self._join_processes(fn) 
Dec 08 01:31:38   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes 
Dec 08 01:31:38     self._check_return_codes(elapsed_time) 
Dec 08 01:31:38   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 436, in _check_return_codes 
Dec 08 01:31:38     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Dec 08 01:31:38 RuntimeError: Process 0 terminated or timed out after 400.0543022155762 seconds 
Dec 08 01:31:38  
Dec 08 01:31:38 ---------------------------------------------------------------------- 
Dec 08 01:31:38 Ran 163 tests in 6395.893s 
Dec 08 01:31:38  
Dec 08 01:31:38 FAILED (errors=4, skipped=105) 
Dec 08 01:31:38  
Dec 08 01:31:38 Generating XML reports... 
Dec 08 01:31:38 Generated XML report: test-reports/dist-nccl/TEST-TestDistBackendWithSpawn-20201207234502.xml 
Dec 08 01:31:38 Traceback (most recent call last): 
Dec 08 01:31:38   File "test/run_test.py", line 874, in <module>

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc7_test (4/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Dec 08 01:27:33 RuntimeError: Process 0 terminated or timed out after 400.05966448783875 seconds

Dec 08 01:27:33 ====================================================================== 
Dec 08 01:27:33 ERROR [400.085s]: test_ddp_uneven_inputs (__main__.TestDistBackendWithSpawn) 
Dec 08 01:27:33 ---------------------------------------------------------------------- 
Dec 08 01:27:33 Traceback (most recent call last): 
Dec 08 01:27:33   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper 
Dec 08 01:27:33     self._join_processes(fn) 
Dec 08 01:27:33   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes 
Dec 08 01:27:33     self._check_return_codes(elapsed_time) 
Dec 08 01:27:33   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 436, in _check_return_codes 
Dec 08 01:27:33     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Dec 08 01:27:33 RuntimeError: Process 0 terminated or timed out after 400.05966448783875 seconds 
Dec 08 01:27:33  
Dec 08 01:27:33 ---------------------------------------------------------------------- 
Dec 08 01:27:33 Ran 163 tests in 6406.710s 
Dec 08 01:27:33  
Dec 08 01:27:33 FAILED (errors=4, skipped=105) 
Dec 08 01:27:33  
Dec 08 01:27:33 Generating XML reports... 
Dec 08 01:27:33 Generated XML report: test-reports/dist-nccl/TEST-TestDistBackendWithSpawn-20201207234046.xml 
Dec 08 01:27:33 Traceback (most recent call last): 
Dec 08 01:27:33   File "test/run_test.py", line 874, in <module>

pytorch_linux_xenial_cuda10_2_cudnn7_py3_slow_test (5/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Dec 08 02:12:08 unknown file: Failure

Dec 08 02:12:00 done blocking all streams 
Dec 08 02:12:00 Starting to sleep 
Dec 08 02:12:00 done sleeping 
Dec 08 02:12:08 Rank 0 done with sleep  
Dec 08 02:12:08 RANK 0 calling syncStreams  
Dec 08 02:12:08 0 is int  
Dec 08 02:12:08  blocking stream for device 0 
Dec 08 02:12:08 1 is int  
Dec 08 02:12:08  blocking stream for device 1 
Dec 08 02:12:08 done blocking all streams 
Dec 08 02:12:08 unknown file: Failure 
Dec 08 02:12:08 C++ exception with description "NCCL communicator was aborted." thrown in the test body. 
Dec 08 02:12:08 [  FAILED  ] ProcessGroupNCCLErrorsTest.testNCCLErrorsBlocking (22881 ms) 
Dec 08 02:12:08 [ RUN      ] ProcessGroupNCCLErrorsTest.testNCCLTimedoutErrorsBlocking 
Dec 08 02:12:16 Rank 0 done with sleep  
Dec 08 02:12:16 RANK 0 calling syncStreams  
Dec 08 02:12:16 0 is int  
Dec 08 02:12:16  blocking stream for device 0 
Dec 08 02:12:16 1 is int  
Dec 08 02:12:16  blocking stream for device 1 
Dec 08 02:12:16 done blocking all streams

Extra GitHub checks: 1 failed

Failed: GitHub Actions - clang-tidy

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm3.9-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 69 times.

facebook-github-bot

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

[wip] fix processgroupnccl profiling

55298a1

Differential Revision: [D25250227](https://our.internmc.facebook.com/intern/diff/D25250227/) [ghstack-poisoned]

rohan-varma requested review from mingzhe09088, mrshenli, pritamdamania87 and zhaojuanmao as code owners December 1, 2020 19:28

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 1, 2020

Merge remote-tracking branch 'origin/master' into nccl_prof_fix

204f062

rohan-varma removed request for pritamdamania87, mrshenli, mingzhe09088 and zhaojuanmao December 1, 2020 22:52

rohan-varma mentioned this pull request Dec 1, 2020

Avoid using FutureNCCL before it's ready #48561

Closed

rohan-varma added 2 commits December 4, 2020 01:35

Merge remote-tracking branch 'origin/master' into nccl_prof_fix

ac87f01

Add logging to profiler

9459dfd

rohan-varma requested a review from albanD as a code owner December 4, 2020 09:46

rohan-varma added 3 commits December 4, 2020 02:12

syntax error

89f0251

fix syntax issue

533625b

fix

32c50b4

facebook-github-bot reviewed Dec 4, 2020

View reviewed changes

rohan-varma removed the request for review from albanD December 4, 2020 21:58

rohan-varma added 4 commits December 4, 2020 16:47

debug

0a36346

debugging

c843837

stash

c11d639

CI

bfc4f77

rohan-varma closed this Dec 9, 2020

facebook-github-bot deleted the ci-all/rohan/nccl_prof_fix branch January 27, 2021 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip][ci-all] fix processgroupnccl profiling #48664

[wip][ci-all] fix processgroupnccl profiling #48664

rohan-varma commented Dec 1, 2020

dr-ci bot commented Dec 1, 2020 •

edited

facebook-github-bot left a comment

[wip][ci-all] fix processgroupnccl profiling #48664

[wip][ci-all] fix processgroupnccl profiling #48664

Conversation

rohan-varma commented Dec 1, 2020

dr-ci bot commented Dec 1, 2020 • edited

💊 CI failures summary and remediations

🕵️ 5 new failures recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_test (1/5)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_nogpu_NO_AVX_test (2/5)

pytorch_parallelnative_linux_xenial_py3_6_gcc5_4_test (3/5)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_nogpu_NO_AVX2_test (4/5)

pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_test (5/5)

❄️ 5 failures tentatively classified as flaky

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (1/5)

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test (2/5)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (3/5)

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc7_test (4/5)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_slow_test (5/5)

Extra GitHub checks: 1 failed

ci.pytorch.org: 1 failed

facebook-github-bot left a comment

Choose a reason for hiding this comment

dr-ci bot commented Dec 1, 2020 •

edited