Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors #43232

Closed
wants to merge 9 commits into from

Conversation

osalpekar
Copy link
Member

@osalpekar osalpekar commented Aug 18, 2020

Stack from ghstack:

This Commit:
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and handleNCCLGuard so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization.

This Stack:
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: D22929042

…CL errors

This method avoids the expensive serialization from adding a callback
and instead polls the CUDA event to check for completion. It then performs the
same error handling to throw an exception from the workCleanupThread.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Aug 18, 2020
…CL errors

This method avoids the expensive serialization from adding a callback
and instead polls the CUDA event to check for completion. It then performs the
same error handling to throw an exception from the workCleanupThread.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

ghstack-source-id: 1101884
Pull Request resolved: #43232
@dr-ci
Copy link

dr-ci bot commented Aug 18, 2020

💊 CI failures summary and remediations

As of commit 1d0a139 (more details on the Dr. CI page):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 03 05:27:03 RuntimeError: Process 0 terminated or timed out after 100.08590054512024 seconds
Sep 03 05:27:03 ====================================================================== 
Sep 03 05:27:03 ERROR [100.110s]: test_failure_recovery (__main__.DistributedDataParallelTest) 
Sep 03 05:27:03 ---------------------------------------------------------------------- 
Sep 03 05:27:03 Traceback (most recent call last): 
Sep 03 05:27:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 224, in wrapper 
Sep 03 05:27:03     self._join_processes(fn) 
Sep 03 05:27:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 337, in _join_processes 
Sep 03 05:27:03     self._check_return_codes(elapsed_time) 
Sep 03 05:27:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 375, in _check_return_codes 
Sep 03 05:27:03     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Sep 03 05:27:03 RuntimeError: Process 0 terminated or timed out after 100.08590054512024 seconds 
Sep 03 05:27:03  
Sep 03 05:27:03 ---------------------------------------------------------------------- 
Sep 03 05:27:03 Ran 120 tests in 229.037s 
Sep 03 05:27:03  
Sep 03 05:27:03 FAILED (errors=1, skipped=9) 
Sep 03 05:27:03  
Sep 03 05:27:03 Generating XML reports... 
Sep 03 05:27:03 Generated XML report: test-reports/python-unittest/TEST-CommTest-20200903052314.xml 
Sep 03 05:27:03 Generated XML report: test-reports/python-unittest/TEST-ComputeBucketAssignmentTest-20200903052314.xml 
Sep 03 05:27:03 Generated XML report: test-reports/python-unittest/TEST-DistributedDataParallelTest-20200903052314.xml 

ci.pytorch.org: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 38 times.

@osalpekar osalpekar changed the title [WIP] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors [NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors Aug 18, 2020
… to catch NCCL errors"

This method avoids the expensive serialization from adding a callback
and instead polls the CUDA event to check for completion. It then performs the
same error handling to throw an exception from the workCleanupThread.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Aug 19, 2020
… Loop

Pull Request resolved: #43232

**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 110283587

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)
… to catch NCCL errors"


**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Aug 20, 2020
… Loop

Pull Request resolved: #43232

**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 110300867

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)
… to catch NCCL errors"


**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Aug 20, 2020
… Loop

Pull Request resolved: #43232

**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 110382519

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)
… to catch NCCL errors"


**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

[ghstack-poisoned]
… to catch NCCL errors"


**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Aug 27, 2020
… Loop

Pull Request resolved: #43232

**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 110813858

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)
… to catch NCCL errors"


**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

[ghstack-poisoned]
… to catch NCCL errors"


**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Sep 3, 2020
… Loop

Pull Request resolved: #43232

**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111301605

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)
… to catch NCCL errors"


**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 PR's in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Sep 3, 2020
… Loop

Pull Request resolved: #43232

**This Commit:**
Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111311020

Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, have a few minor comments inline.

Can we also have another PR on top of this to remove the busy waiting in wait() and replace it with cudaEventSynchronize?

}
return false;
}

void ProcessGroupNCCL::WorkNCCL::handleNCCLGuard() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need this function?

work->handleNCCLGuard();
// Handle Exceptions on failed GPU operations and remove completed
// workNCCL objects from work vector.
if (work->isCompletedAndThrowException()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a LOG(ERROR) before we throw an exception mentioning the following: "Some NCCL operations have timed out/failed and due to the async nature of CUDA kernels subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we're taking the entire process down."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #44988 with this change

@osalpekar osalpekar closed this Dec 18, 2020
@facebook-github-bot facebook-github-bot deleted the gh/osalpekar/72/head branch January 18, 2021 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants