-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[PGNCCL] Enable non-blocking API mode by default #137544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137544
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit fa2869a with merge base 195d0a6 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
This was referenced Oct 9, 2024
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. - Throw async error during comm init wait. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. - Throw async error during comm init wait. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
pytorchmergebot
pushed a commit
that referenced
this pull request
Oct 23, 2024
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384
kwen2501
added a commit
that referenced
this pull request
Oct 24, 2024
…in eager init" ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
kwen2501
added a commit
that referenced
this pull request
Oct 24, 2024
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
kwen2501
added a commit
that referenced
this pull request
Oct 25, 2024
…in eager init" ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
kwen2501
added a commit
that referenced
this pull request
Oct 25, 2024
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
kwen2501
added a commit
that referenced
this pull request
Oct 25, 2024
…in eager init" ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
kwen2501
added a commit
that referenced
this pull request
Oct 25, 2024
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
kwen2501
added a commit
that referenced
this pull request
Oct 25, 2024
…in eager init" ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
kwen2501
added a commit
that referenced
this pull request
Oct 25, 2024
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
pytorchmergebot
pushed a commit
that referenced
this pull request
Oct 26, 2024
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860
pytorchmergebot
pushed a commit
that referenced
this pull request
Oct 27, 2024
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860
SamGinzburg
pushed a commit
that referenced
this pull request
Oct 28, 2024
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384
Replaced by #138527 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
keep-going
Don't stop on first failure, keep running tests until the end
oncall: distributed
Add this issue/PR to distributed oncall triage queue
release notes: distributed (c10d)
release notes category
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
nccl_nonblocking_timeout
#138374Resolves RFC #137007.
Changelist:
nccl_use_nonblocking
to true (previous: false).cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o