Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Blocking Wait if both Blocking Wait and Async Error Handling Are Set #47926

Closed
wants to merge 1 commit into from

Conversation

osalpekar
Copy link
Member

@osalpekar osalpekar commented Nov 13, 2020

Stack from ghstack:

Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit.

Fixes: #47943

Differential Revision: D24928149

Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit.

Differential Revision: [D24928149](https://our.internmc.facebook.com/intern/diff/D24928149/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 13, 2020
osalpekar added a commit that referenced this pull request Nov 13, 2020
Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit.

Differential Revision: [D24928149](https://our.internmc.facebook.com/intern/diff/D24928149/)

ghstack-source-id: 116553583
Pull Request resolved: #47926
@dr-ci
Copy link

dr-ci bot commented Nov 13, 2020

💊 CI failures summary and remediations

As of commit 675f47a (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


1 failure confirmed as flaky and can be ignored:

  • pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 2 times.

@codecov
Copy link

codecov bot commented Nov 13, 2020

Codecov Report

Merging #47926 (675f47a) into gh/osalpekar/105/base (6aaf046) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@                    Coverage Diff                    @@
##           gh/osalpekar/105/base   #47926      +/-   ##
=========================================================
- Coverage                  81.25%   81.25%   -0.01%     
=========================================================
  Files                       1838     1838              
  Lines                     198256   198256              
=========================================================
- Hits                      161098   161097       -1     
- Misses                     37158    37159       +1     

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 5d51b63.

tugsbayasgalan pushed a commit to tugsbayasgalan/pytorch that referenced this pull request Nov 16, 2020
…Set (pytorch#47926)

Summary:
Pull Request resolved: pytorch#47926

Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit.
ghstack-source-id: 116553583

Test Plan: Simple FBL run/CI

Reviewed By: jiayisuse

Differential Revision: D24928149

fbshipit-source-id: d42c038ad44607feb3d46dd65925237c564ff7a3
@facebook-github-bot facebook-github-bot deleted the gh/osalpekar/105/head branch November 17, 2020 15:19
Comment on lines +456 to +461
if (blockingWait_ && asyncErrorHandling_) {
LOG(INFO) << "[Rank " << rank_
<< "] NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING "
<< "should not both be enabled. "
<< "Only NCCL_BLOCKING_WAIT is being used in this process.";
asyncErrorHandling_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osalpekar Sorry a bit late on this, but I think we should instead throw an error if both NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING is set and ask user to set only one. It is usually confusing for users if we have such behavior where we unset one option. User's might not look closely at the logs and would feel that there might be a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants