[PGNCCL] Enable non-blocking API mode by default #137544

kwen2501 · 2024-10-09T00:16:18Z

Stack from ghstack (oldest at bottom):

Resolves RFC #137007.

Changelist:

Set default value of nccl_use_nonblocking to true (previous: false).

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-09T00:16:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137544

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit fa2869a with merge base 195d0a6 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_send_recv_nccl_torch_profiler
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_send_recv_nccl
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu) (gh)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 8903825 Pull Request resolved: #137544

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 75c7d3a Pull Request resolved: #137544

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 26062bb Pull Request resolved: #137544 [PGNCCL] Add sched_yield in wait loop of ncclInProgress Throw async error during comm init wait

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. - Throw async error during comm init wait. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 671692d Pull Request resolved: #137544 [PGNCCL] Add sched_yield in wait loop of ncclInProgress

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: cfab754 Pull Request resolved: #137544 [PGNCCL] Add sched_yield in wait loop of ncclInProgress

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: e1b01dd Pull Request resolved: #137544 [PGNCCL] Add sched_yield in wait loop of ncclInProgress

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). - Set default value of `nccl_nonblocking_timeout` to 30 mins (previous: -1). - Add sched_yield in wait loop of ncclInProgress. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 0f9e940 Pull Request resolved: #137544

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 568a77c Pull Request resolved: #137544

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384

…in eager init" ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…in eager init" ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…in eager init" ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…in eager init" ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384

kwen2501 · 2024-11-02T04:17:11Z

Replaced by #138527

[PGNCCL] Enable non-blocking API mode by default

6603c6a

[ghstack-poisoned]

kwen2501 requested review from eqy and syed-ahmed as code owners October 9, 2024 00:16

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 9, 2024

kwen2501 added a commit that referenced this pull request Oct 9, 2024

[PGNCCL] Enable non-blocking API mode by default

e8df4c9

ghstack-source-id: 8903825 Pull Request resolved: #137544

kwen2501 requested a review from shuqiangzhang October 9, 2024 00:28

This was referenced Oct 9, 2024

[c10d] Improve split_group test #137572

Closed

[PGNCCL] Limit access to ncclComm_ #137573

Closed

kwen2501 added a commit that referenced this pull request Oct 9, 2024

[PGNCCL] Enable non-blocking API mode by default

eff5c9d

ghstack-source-id: 75c7d3a Pull Request resolved: #137544

kwen2501 added the keep-going Don't stop on first failure, keep running tests until the end label Oct 9, 2024

kwen2501 added a commit that referenced this pull request Oct 10, 2024

[PGNCCL] Enable non-blocking API mode by default

b4f1d66

ghstack-source-id: 26062bb Pull Request resolved: #137544 [PGNCCL] Add sched_yield in wait loop of ncclInProgress Throw async error during comm init wait

kwen2501 mentioned this pull request Oct 10, 2024

[PGNCCL] Fix bugs in non-blocking mode #137741

Closed

kwen2501 added a commit that referenced this pull request Oct 10, 2024

[PGNCCL] Enable non-blocking API mode by default

ebf7791

ghstack-source-id: 671692d Pull Request resolved: #137544 [PGNCCL] Add sched_yield in wait loop of ncclInProgress

kwen2501 marked this pull request as draft October 10, 2024 23:09

kwen2501 mentioned this pull request Oct 13, 2024

[c10d] Fix color value for comm split being negative #137855

Closed

kwen2501 added a commit that referenced this pull request Oct 13, 2024

[PGNCCL] Enable non-blocking API mode by default

78a4512

ghstack-source-id: cfab754 Pull Request resolved: #137544 [PGNCCL] Add sched_yield in wait loop of ncclInProgress

kwen2501 mentioned this pull request Oct 14, 2024

[CI][Distributed] Not to test distributed_test.py with UCC #137932

Closed

kwen2501 added a commit that referenced this pull request Oct 15, 2024

[PGNCCL] Enable non-blocking API mode by default

e3d3a63

ghstack-source-id: e1b01dd Pull Request resolved: #137544 [PGNCCL] Add sched_yield in wait loop of ncclInProgress

kwen2501 mentioned this pull request Oct 18, 2024

[PGNCCL] Add default value for nccl_nonblocking_timeout #138374

Closed

kwen2501 added 2 commits October 18, 2024 16:14

Update on "[PGNCCL] Enable non-blocking API mode by default"

0b05dad

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 mentioned this pull request Oct 19, 2024

[PGNCCL] Ensure comm is ready before all accesses #138384

Closed

kwen2501 added a commit that referenced this pull request Oct 19, 2024

[PGNCCL] Enable non-blocking API mode by default

73462d8

ghstack-source-id: 0f9e940 Pull Request resolved: #137544

Update on "[PGNCCL] Enable non-blocking API mode by default"

fa2869a

Resolves RFC #137007. Changelist: - Set default value of `nccl_use_nonblocking` to true (previous: false). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 mentioned this pull request Oct 21, 2024

[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR #138488

Closed

kwen2501 added a commit that referenced this pull request Oct 21, 2024

[PGNCCL] Enable non-blocking API mode by default

22f5e2d

ghstack-source-id: 568a77c Pull Request resolved: #137544

kwen2501 mentioned this pull request Oct 22, 2024

[PGNCCL] Use non-blocking mode by default in eager init #138527

Closed

kwen2501 closed this Nov 2, 2024

github-actions bot deleted the gh/kwen2501/70/head branch December 3, 2024 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PGNCCL] Enable non-blocking API mode by default #137544

[PGNCCL] Enable non-blocking API mode by default #137544

Uh oh!

kwen2501 commented Oct 9, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading

Uh oh!

kwen2501 commented Nov 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[PGNCCL] Enable non-blocking API mode by default #137544

[PGNCCL] Enable non-blocking API mode by default #137544

Uh oh!

Conversation

kwen2501 commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137544

❌ 3 New Failures

Uh oh!

kwen2501 commented Nov 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kwen2501 commented Oct 9, 2024 •

edited

Loading

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading