-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Enable pg_nccl.reduce_scatter to perform vector ReduceScatter for uneven input splits #82924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
✅ No Failures (1 Pending)As of commit 4ace662 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
This pull request was exported from Phabricator. Differential Revision: D38478781 |
|
Maybe also worth writing a bit in the PR description about the context -- |
|
Also good to comment the above in code. |
|
This pull request was exported from Phabricator. Differential Revision: D38478781 |
cb26161 to
f92c52a
Compare
f92c52a to
fa7204e
Compare
|
This pull request was exported from Phabricator. Differential Revision: D38478781 |
|
This pull request was exported from Phabricator. Differential Revision: D38478781 |
fa7204e to
fff60a6
Compare
|
This pull request was exported from Phabricator. Differential Revision: D38478781 |
fff60a6 to
0bdd278
Compare
…ven input splits (pytorch#82924) Summary: Pull Request resolved: pytorch#82924 A vector reduce_scatter requires each process to reduce and scatter an input tensor according to the input list provided. Internally, pg_nccl.reduce_scatter will coalesce a list of pg_nccl._reduce_oop to implement a vector reduce-scatter in the case when the any input shape is different in the input list. Otherwise, it will perform a ncclReduceScatter as usual. - This change adds a `CoalescedWorkNCCL` class which encapsulates the WorkNCCL requests from coalesced operations. A `.wait()` on a CoalescedWorkNCCL request will call a wait on each of the WorkNCCL requests that are coalesced. - This change adds an out-of-place `_reduce_oop` function to ProcessGroupNCCL. It allows reducing an input tensor and placing the output in a separate output tensor. Since reduce_scatter provides an out-of-place API, a reduce_scatter_v semantic implemented inside `pg_nccl.reduce_scatter` also needs to support out-of-place, for which an out-of-place reduce is required to be added. Test Plan: Added a new test `test_reduce_scatter_v_cuda` for reduce_scatter_v to `distributed_nccl_spawn`. Differential Revision: D38478781 fbshipit-source-id: 0157acdee4e9a1dd328a27d4e30c3b81c4523039
|
This pull request was exported from Phabricator. Differential Revision: D38478781 |
0bdd278 to
4ace662
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here. |
|
Hey @aashaka. |
| if async_val: | ||
| req.wait() | ||
|
|
||
| expected_value = 2 + (10 * (len(group) - 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use variable defined above instead of instant numbers.
| end_len = start_len + input_split_sizes[rank] | ||
| sum_len = sum(input_split_sizes) | ||
| master_value = 2 | ||
| worker_value = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
rename master_value --> value_to_self
rename worker_value --> value_to_others
to be clearer
…ven input splits (#82924) (#82924) Summary: A vector reduce_scatter requires each process to reduce and scatter an input tensor according to the input list provided. Internally, pg_nccl.reduce_scatter will coalesce a list of pg_nccl._reduce_oop to implement a vector reduce-scatter in the case when the any input shape is different in the input list. Otherwise, it will perform a ncclReduceScatter as usual. - This change adds a `CoalescedWorkNCCL` class which encapsulates the WorkNCCL requests from coalesced operations. A `.wait()` on a CoalescedWorkNCCL request will call a wait on each of the WorkNCCL requests that are coalesced. - This change adds an out-of-place `_reduce_oop` function to ProcessGroupNCCL. It allows reducing an input tensor and placing the output in a separate output tensor. Since reduce_scatter provides an out-of-place API, a reduce_scatter_v semantic implemented inside `pg_nccl.reduce_scatter` also needs to support out-of-place, for which an out-of-place reduce is required to be added. Pull Request resolved: #82924 Approved by: https://github.com/kwen2501 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d6a30e213e2355e8ad553c02d205391c889a0254 Test plan from GitHub: Added a new test `test_reduce_scatter_v_cuda` for reduce_scatter_v to `distributed_nccl_spawn`. Original Phabricator Test Plan: Added a new test `test_reduce_scatter_v_cuda` for reduce_scatter_v to `distributed_nccl_spawn`. Reviewed By: kwen2501 Differential Revision: D38478781 Pulled By: aashaka fbshipit-source-id: b5a203847241e83556e51b640b24eaa765dca6c4
Summary: A vector reduce_scatter requires each process to reduce and scatter an input tensor according to the input list provided. Internally, pg_nccl.reduce_scatter will coalesce a list of pg_nccl._reduce_oop to implement a vector reduce-scatter in the case when the any input shape is different in the input list. Otherwise, it will perform a ncclReduceScatter as usual.
This change adds a
CoalescedWorkNCCLclass which encapsulates the WorkNCCL requests from coalesced operations. A.wait()on a CoalescedWorkNCCL request will call a wait on each of the WorkNCCL requests that are coalesced.This change adds an out-of-place
_reduce_oopfunction to ProcessGroupNCCL. It allows reducing an input tensor and placing the output in a separate output tensor. Since reduce_scatter provides an out-of-place API, a reduce_scatter_v semantic implemented insidepg_nccl.reduce_scatteralso needs to support out-of-place, for which an out-of-place reduce is required to be added.Test Plan: Added a new test
test_reduce_scatter_v_cudafor reduce_scatter_v todistributed_nccl_spawn.Differential Revision: D38478781