[NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors #43232

osalpekar · 2020-08-18T22:34:44Z

Stack from ghstack:

[NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors #43232 [NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors
[NCCL] Destructor Blocks on WorkNCCL Completion #41054 [NCCL] Destructor Blocks on WorkNCCL Completion
[NCCL] WorkNCCL Helper Functions #41053 [NCCL] WorkNCCL Helper Functions
[NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread #41052 [NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread
[NCCL] Use cudaEventQuery to Poll for GPU operation errors #41051 [NCCL] Use cudaEventQuery to Poll for GPU operation errors
[NCCL] Timeout Loop Thread for Async Error Handling #41050 [NCCL] Timeout Loop Thread for Async Error Handling

This Commit:
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and handleNCCLGuard so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization.

This Stack:
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: D22929042

…CL errors This method avoids the expensive serialization from adding a callback and instead polls the CUDA event to check for completion. It then performs the same error handling to throw an exception from the workCleanupThread. Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/) [ghstack-poisoned]

…CL errors This method avoids the expensive serialization from adding a callback and instead polls the CUDA event to check for completion. It then performs the same error handling to throw an exception from the workCleanupThread. Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/) ghstack-source-id: 1101884 Pull Request resolved: #43232

dr-ci · 2020-08-18T22:43:13Z

💊 CI failures summary and remediations

As of commit 1d0a139 (more details on the Dr. CI page):

2/3 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)
1/3 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 03 05:27:03 RuntimeError: Process 0 terminated or timed out after 100.08590054512024 seconds

Sep 03 05:27:03 ====================================================================== 
Sep 03 05:27:03 ERROR [100.110s]: test_failure_recovery (__main__.DistributedDataParallelTest) 
Sep 03 05:27:03 ---------------------------------------------------------------------- 
Sep 03 05:27:03 Traceback (most recent call last): 
Sep 03 05:27:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 224, in wrapper 
Sep 03 05:27:03     self._join_processes(fn) 
Sep 03 05:27:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 337, in _join_processes 
Sep 03 05:27:03     self._check_return_codes(elapsed_time) 
Sep 03 05:27:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 375, in _check_return_codes 
Sep 03 05:27:03     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Sep 03 05:27:03 RuntimeError: Process 0 terminated or timed out after 100.08590054512024 seconds 
Sep 03 05:27:03  
Sep 03 05:27:03 ---------------------------------------------------------------------- 
Sep 03 05:27:03 Ran 120 tests in 229.037s 
Sep 03 05:27:03  
Sep 03 05:27:03 FAILED (errors=1, skipped=9) 
Sep 03 05:27:03  
Sep 03 05:27:03 Generating XML reports... 
Sep 03 05:27:03 Generated XML report: test-reports/python-unittest/TEST-CommTest-20200903052314.xml 
Sep 03 05:27:03 Generated XML report: test-reports/python-unittest/TEST-ComputeBucketAssignmentTest-20200903052314.xml 
Sep 03 05:27:03 Generated XML report: test-reports/python-unittest/TEST-DistributedDataParallelTest-20200903052314.xml

ci.pytorch.org: 2 failed

Failed: pr/pytorch-linux-bionic-rocm3.7-py3.6
Failed: pr/caffe2-pytorch-linux-bionic-rocm3.7-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 38 times.

… to catch NCCL errors" This method avoids the expensive serialization from adding a callback and instead polls the CUDA event to check for completion. It then performs the same error handling to throw an exception from the workCleanupThread. Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/) [ghstack-poisoned]

… Loop Pull Request resolved: #43232 **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 110283587 Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

… to catch NCCL errors" **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/) [ghstack-poisoned]

… Loop Pull Request resolved: #43232 **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 110300867 Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

… to catch NCCL errors" **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/) [ghstack-poisoned]

… Loop Pull Request resolved: #43232 **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 110382519 Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

… to catch NCCL errors" **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/) [ghstack-poisoned]

… Loop Pull Request resolved: #43232 **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 110813858 Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

torch/lib/c10d/ProcessGroupNCCL.cpp

… to catch NCCL errors" **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/) [ghstack-poisoned]

… Loop Pull Request resolved: #43232 **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111301605 Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

… to catch NCCL errors" **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/) [ghstack-poisoned]

… Loop Pull Request resolved: #43232 **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111311020 Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

pritamdamania87

Looks good overall, have a few minor comments inline.

Can we also have another PR on top of this to remove the busy waiting in wait() and replace it with cudaEventSynchronize?

pritamdamania87 · 2020-09-16T00:18:26Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+  }
+  return false;
+}
+
 void ProcessGroupNCCL::WorkNCCL::handleNCCLGuard() {


We no longer need this function?

pritamdamania87 · 2020-09-16T00:25:07Z

torch/lib/c10d/ProcessGroupNCCL.cpp

-        work->handleNCCLGuard();
+      // Handle Exceptions on failed GPU operations and remove completed
+      // workNCCL objects from work vector.
+      if (work->isCompletedAndThrowException()) {


It would be nice to add a LOG(ERROR) before we throw an exception mentioning the following: "Some NCCL operations have timed out/failed and due to the async nature of CUDA kernels subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we're taking the entire process down."

Created #44988 with this change

osalpekar requested review from mrshenli, pietern and zhaojuanmao as code owners August 18, 2020 22:34

osalpekar changed the title ~~[WIP] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors~~ [NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors Aug 18, 2020

pritamdamania87 reviewed Sep 2, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupNCCL.cpp Show resolved Hide resolved

pritamdamania87 reviewed Sep 16, 2020

View reviewed changes

facebook-github-bot added the cla signed label Oct 30, 2020

osalpekar closed this Dec 18, 2020

facebook-github-bot deleted the gh/osalpekar/72/head branch January 18, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors #43232

[NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors #43232

osalpekar commented Aug 18, 2020 •

edited

dr-ci bot commented Aug 18, 2020 •

edited

pritamdamania87 left a comment

pritamdamania87 Sep 16, 2020

pritamdamania87 Sep 16, 2020

osalpekar Sep 21, 2020

[NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors #43232

[NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors #43232

Conversation

osalpekar commented Aug 18, 2020 • edited

dr-ci bot commented Aug 18, 2020 • edited

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

ci.pytorch.org: 2 failed

pritamdamania87 left a comment

Choose a reason for hiding this comment

pritamdamania87 Sep 16, 2020

Choose a reason for hiding this comment

pritamdamania87 Sep 16, 2020

Choose a reason for hiding this comment

osalpekar Sep 21, 2020

Choose a reason for hiding this comment

osalpekar commented Aug 18, 2020 •

edited

dr-ci bot commented Aug 18, 2020 •

edited