Enable pg_nccl.reduce_scatter to perform vector ReduceScatter for uneven input splits #82924

aashaka · 2022-08-05T23:44:16Z

Summary: A vector reduce_scatter requires each process to reduce and scatter an input tensor according to the input list provided. Internally, pg_nccl.reduce_scatter will coalesce a list of pg_nccl._reduce_oop to implement a vector reduce-scatter in the case when the any input shape is different in the input list. Otherwise, it will perform a ncclReduceScatter as usual.

This change adds a CoalescedWorkNCCL class which encapsulates the WorkNCCL requests from coalesced operations. A .wait() on a CoalescedWorkNCCL request will call a wait on each of the WorkNCCL requests that are coalesced.
This change adds an out-of-place _reduce_oop function to ProcessGroupNCCL. It allows reducing an input tensor and placing the output in a separate output tensor. Since reduce_scatter provides an out-of-place API, a reduce_scatter_v semantic implemented inside pg_nccl.reduce_scatter also needs to support out-of-place, for which an out-of-place reduce is required to be added.

Test Plan: Added a new test test_reduce_scatter_v_cuda for reduce_scatter_v to distributed_nccl_spawn.

Differential Revision: D38478781

facebook-github-bot · 2022-08-05T23:44:23Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82924
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (1 Pending)

As of commit 4ace662 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

facebook-github-bot · 2022-08-05T23:44:35Z

This pull request was exported from Phabricator. Differential Revision: D38478781

torch/_C/_distributed_c10d.pyi

torch/csrc/distributed/c10d/ProcessGroup.cpp

torch/csrc/distributed/c10d/ProcessGroup.hpp

torch/csrc/distributed/c10d/init.cpp

kwen2501 · 2022-08-06T00:33:20Z

Maybe also worth writing a bit in the PR description about the context --
You need out-of-place reduce from the backend because you want to compose a reduce_scatter_v pattern at the Python front end using coalesced reduces. Today, the dist.reduce_scatter API supports out-of-place, so, if the reduce_scatter_v pattern is implemented under the dist.reduce_scatter API, you would need out-of-place support as well.

kwen2501 · 2022-08-06T00:37:32Z

Also good to comment the above in code.

facebook-github-bot · 2022-08-08T18:40:59Z

This pull request was exported from Phabricator. Differential Revision: D38478781

facebook-github-bot · 2022-08-09T18:56:18Z

This pull request was exported from Phabricator. Differential Revision: D38478781

facebook-github-bot · 2022-08-15T22:12:47Z

This pull request was exported from Phabricator. Differential Revision: D38478781

facebook-github-bot · 2022-08-15T23:47:23Z

This pull request was exported from Phabricator. Differential Revision: D38478781

…ven input splits (pytorch#82924) Summary: Pull Request resolved: pytorch#82924 A vector reduce_scatter requires each process to reduce and scatter an input tensor according to the input list provided. Internally, pg_nccl.reduce_scatter will coalesce a list of pg_nccl._reduce_oop to implement a vector reduce-scatter in the case when the any input shape is different in the input list. Otherwise, it will perform a ncclReduceScatter as usual. - This change adds a `CoalescedWorkNCCL` class which encapsulates the WorkNCCL requests from coalesced operations. A `.wait()` on a CoalescedWorkNCCL request will call a wait on each of the WorkNCCL requests that are coalesced. - This change adds an out-of-place `_reduce_oop` function to ProcessGroupNCCL. It allows reducing an input tensor and placing the output in a separate output tensor. Since reduce_scatter provides an out-of-place API, a reduce_scatter_v semantic implemented inside `pg_nccl.reduce_scatter` also needs to support out-of-place, for which an out-of-place reduce is required to be added. Test Plan: Added a new test `test_reduce_scatter_v_cuda` for reduce_scatter_v to `distributed_nccl_spawn`. Differential Revision: D38478781 fbshipit-source-id: 0157acdee4e9a1dd328a27d4e30c3b81c4523039

facebook-github-bot · 2022-08-17T19:22:01Z

This pull request was exported from Phabricator. Differential Revision: D38478781

kwen2501

LGTM.

kwen2501 · 2022-08-18T02:15:08Z

@pytorchbot merge

pytorchmergebot · 2022-08-18T02:16:21Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

github-actions · 2022-08-18T02:17:03Z

Hey @aashaka.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

kwen2501 · 2022-08-18T02:24:03Z

torch/testing/_internal/distributed/distributed_test.py

+                if async_val:
+                    req.wait()
+
+                expected_value = 2 + (10 * (len(group) - 1))


nit: use variable defined above instead of instant numbers.

kwen2501 · 2022-08-18T02:26:18Z

torch/testing/_internal/distributed/distributed_test.py

+            end_len = start_len + input_split_sizes[rank]
+            sum_len = sum(input_split_sizes)
+            master_value = 2
+            worker_value = 10


nit:
rename master_value --> value_to_self
rename worker_value --> value_to_others
to be clearer

…ven input splits (#82924) (#82924) Summary: A vector reduce_scatter requires each process to reduce and scatter an input tensor according to the input list provided. Internally, pg_nccl.reduce_scatter will coalesce a list of pg_nccl._reduce_oop to implement a vector reduce-scatter in the case when the any input shape is different in the input list. Otherwise, it will perform a ncclReduceScatter as usual. - This change adds a `CoalescedWorkNCCL` class which encapsulates the WorkNCCL requests from coalesced operations. A `.wait()` on a CoalescedWorkNCCL request will call a wait on each of the WorkNCCL requests that are coalesced. - This change adds an out-of-place `_reduce_oop` function to ProcessGroupNCCL. It allows reducing an input tensor and placing the output in a separate output tensor. Since reduce_scatter provides an out-of-place API, a reduce_scatter_v semantic implemented inside `pg_nccl.reduce_scatter` also needs to support out-of-place, for which an out-of-place reduce is required to be added. Pull Request resolved: #82924 Approved by: https://github.com/kwen2501 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d6a30e213e2355e8ad553c02d205391c889a0254 Test plan from GitHub: Added a new test `test_reduce_scatter_v_cuda` for reduce_scatter_v to `distributed_nccl_spawn`. Original Phabricator Test Plan: Added a new test `test_reduce_scatter_v_cuda` for reduce_scatter_v to `distributed_nccl_spawn`. Reviewed By: kwen2501 Differential Revision: D38478781 Pulled By: aashaka fbshipit-source-id: b5a203847241e83556e51b640b24eaa765dca6c4

aashaka requested review from H-Huang, awgu, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners August 5, 2022 23:44

facebook-github-bot added the cla signed label Aug 5, 2022

facebook-github-bot added the fb-exported label Aug 5, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 5, 2022

kwen2501 reviewed Aug 6, 2022

View reviewed changes

aashaka force-pushed the export-D38478781 branch from cb26161 to f92c52a Compare August 8, 2022 18:41

aashaka changed the title ~~Expose an out-of-place _reduce from ProcessGroupNCCL~~ Expose an out-of-place _reduce_oop from ProcessGroupNCCL Aug 8, 2022

aashaka force-pushed the export-D38478781 branch from f92c52a to fa7204e Compare August 9, 2022 18:56

aashaka force-pushed the export-D38478781 branch from fa7204e to fff60a6 Compare August 15, 2022 22:12

aashaka changed the title ~~Expose an out-of-place _reduce_oop from ProcessGroupNCCL~~ Enable pg_nccl.reduce_scatter to perform vector ReduceScatter for uneven input splits Aug 15, 2022

aashaka force-pushed the export-D38478781 branch from fff60a6 to 0bdd278 Compare August 15, 2022 23:47

aashaka force-pushed the export-D38478781 branch from 0bdd278 to 4ace662 Compare August 17, 2022 19:22

kwen2501 approved these changes Aug 18, 2022

View reviewed changes

pytorchmergebot added the Merged label Aug 18, 2022

pytorchmergebot closed this in d6a30e2 Aug 18, 2022

kwen2501 reviewed Aug 18, 2022

View reviewed changes

kwen2501 added release notes: distributed (c10d) release notes category topic: new features topic category labels Aug 18, 2022

kwen2501 mentioned this pull request Sep 7, 2022

Add reduce_scatter_v, a vector ReduceScatter for non-uniform input tensor splits #81009

Closed

Enable pg_nccl.reduce_scatter to perform vector ReduceScatter for uneven input splits #82924

Enable pg_nccl.reduce_scatter to perform vector ReduceScatter for uneven input splits #82924

Uh oh!

Conversation

aashaka commented Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (1 Pending)

Uh oh!

facebook-github-bot commented Aug 5, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwen2501 commented Aug 6, 2022

Uh oh!

kwen2501 commented Aug 6, 2022

Uh oh!

facebook-github-bot commented Aug 8, 2022

Uh oh!

facebook-github-bot commented Aug 9, 2022

Uh oh!

facebook-github-bot commented Aug 15, 2022

Uh oh!

facebook-github-bot commented Aug 15, 2022

Uh oh!

facebook-github-bot commented Aug 17, 2022

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Aug 18, 2022

Uh oh!

pytorchmergebot commented Aug 18, 2022

Uh oh!

github-actions bot commented Aug 18, 2022

Uh oh!

kwen2501 Aug 18, 2022

Choose a reason for hiding this comment

Uh oh!

kwen2501 Aug 18, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aashaka commented Aug 5, 2022 •

edited

Loading

facebook-github-bot commented Aug 5, 2022 •

edited

Loading