Skip to content

Conversation

kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Oct 9, 2024

Copy link

pytorch-bot bot commented Oct 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137544

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit fa2869a with merge base 195d0a6 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 9, 2024
kwen2501 added a commit that referenced this pull request Oct 9, 2024
ghstack-source-id: 8903825
Pull Request resolved: #137544
@kwen2501 kwen2501 requested a review from shuqiangzhang October 9, 2024 00:28
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 9, 2024
ghstack-source-id: 75c7d3a
Pull Request resolved: #137544
@kwen2501 kwen2501 added the keep-going Don't stop on first failure, keep running tests until the end label Oct 9, 2024
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 10, 2024
ghstack-source-id: 26062bb
Pull Request resolved: #137544

[PGNCCL] Add sched_yield in wait loop of ncclInProgress

Throw async error during comm init wait
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.
- Throw async error during comm init wait.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.
- Throw async error during comm init wait.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 10, 2024
ghstack-source-id: 671692d
Pull Request resolved: #137544

[PGNCCL] Add sched_yield in wait loop of ncclInProgress
@kwen2501 kwen2501 marked this pull request as draft October 10, 2024 23:09
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 13, 2024
ghstack-source-id: cfab754
Pull Request resolved: #137544

[PGNCCL] Add sched_yield in wait loop of ncclInProgress
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 15, 2024
ghstack-source-id: e1b01dd
Pull Request resolved: #137544

[PGNCCL] Add sched_yield in wait loop of ncclInProgress
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).
- Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1).
- Add sched_yield in wait loop of ncclInProgress.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 19, 2024
ghstack-source-id: 0f9e940
Pull Request resolved: #137544
Resolves RFC #137007.

Changelist:
- Set default value of `nccl_use_nonblocking` to true (previous: false).

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 21, 2024
ghstack-source-id: 568a77c
Pull Request resolved: #137544
pytorchmergebot pushed a commit that referenced this pull request Oct 23, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: #138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384
kwen2501 added a commit that referenced this pull request Oct 24, 2024
…in eager init"


### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 24, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 25, 2024
…in eager init"


### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 25, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 25, 2024
…in eager init"


### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 25, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 25, 2024
…in eager init"


### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 25, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 26, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: #138527
Approved by: https://github.com/wconstab
ghstack dependencies: #138860
pytorchmergebot pushed a commit that referenced this pull request Oct 27, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: #138527
Approved by: https://github.com/wconstab
ghstack dependencies: #138860
SamGinzburg pushed a commit that referenced this pull request Oct 28, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: #138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384
@kwen2501
Copy link
Contributor Author

kwen2501 commented Nov 2, 2024

Replaced by #138527

@kwen2501 kwen2501 closed this Nov 2, 2024
@github-actions github-actions bot deleted the gh/kwen2501/70/head branch December 3, 2024 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

keep-going Don't stop on first failure, keep running tests until the end oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant