[16/N] Add _allgather_base custom op with CPU/CUDA implementation #88889

H-Huang · 2022-11-11T16:21:55Z

Stack from ghstack:

Allow Process Group to support multiple backends #88330 Allow Process Group to support multiple backends and move PG specfic implementations to backend class
Remove ProcessGroupRoundRobin #87088 Remove ProcessGroupRoundRobin
[17/N] Add _reduce_scatter_base custom op with CPU/CUDA implementation #88903 [17/N] Add _reduce_scatter_base custom op with CPU/CUDA implementation
[16/N] Add _allgather_base custom op with CPU/CUDA implementation #88889 [16/N] Add _allgather_base custom op with CPU/CUDA implementation
[15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations #88846 [15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations
[14/N] Refactor _new_process_group_helper() to remove repeated code #88351 [14/N] Refactor _new_process_group_helper() to remove repeated code

Context

#86225

Differential Revision: D41227739

[ghstack-poisoned]

pytorch-bot · 2022-11-11T16:21:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88889

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 16fab8b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

H-Huang · 2022-11-11T17:18:29Z

torch/csrc/distributed/c10d/init.cpp

-              &::c10d::ProcessGroup::_allgather_base,
+              [](const c10::intrusive_ptr<::c10d::ProcessGroup>& self,
+                 at::Tensor& output_tensor,
+                 at::Tensor& input_tensor,


i think input_tensor can be const &

I had a look at init.cpp, it seems the convention has not been using const. Not sure why. Can go with convention I guess.

kwen2501

LGTM.

kwen2501 · 2022-11-11T17:51:23Z

torch/csrc/distributed/c10d/init.cpp

-              &::c10d::ProcessGroup::_allgather_base,
+              [](const c10::intrusive_ptr<::c10d::ProcessGroup>& self,
+                 at::Tensor& output_tensor,
+                 at::Tensor& input_tensor,


I had a look at init.cpp, it seems the convention has not been using const. Not sure why. Can go with convention I guess.

kwen2501 · 2022-11-11T17:53:19Z

test/distributed/test_c10d_nccl.py

+        device = "cuda"
+        tensor = torch.ones(10, 10, device=torch.device(device))
+        output_tensor = torch.zeros(10, 10, device=torch.device(device))
+        dist._all_gather_base(output_tensor, tensor)


Please use all_gather_into_tensor at Python level.
We will also rename _all_gather_base at the binding and C++ levels to all_gather_into_tensor when we introduce extension-breaking changes in 2.0.

good point! thanks

H-Huang · 2022-11-11T18:51:51Z

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…ntation" Differential Revision: [D41227739](https://our.internmc.facebook.com/intern/diff/D41227739) [ghstack-poisoned]

H-Huang · 2022-11-12T22:29:16Z

@pytorchbot merge

pytorchmergebot · 2022-11-12T22:31:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…torch#88889) Differential Revision: [D41227739](https://our.internmc.facebook.com/intern/diff/D41227739) Pull Request resolved: pytorch#88889 Approved by: https://github.com/kwen2501

[16/N] Add _allgather_base custom op with CPU/CUDA implementation

1bd055a

[ghstack-poisoned]

H-Huang requested review from mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, awgu and kwen2501 as code owners November 11, 2022 16:21

H-Huang mentioned this pull request Nov 11, 2022

[15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations #88846

Closed

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Nov 11, 2022

This was referenced Nov 11, 2022

Remove ProcessGroupRoundRobin #87088

Closed

Allow Process Group to support multiple backends #88330

Closed

H-Huang commented Nov 11, 2022

View reviewed changes

kwen2501 approved these changes Nov 11, 2022

View reviewed changes

Update on "[16/N] Add _allgather_base custom op with CPU/CUDA impleme…

16fab8b

…ntation" Differential Revision: [D41227739](https://our.internmc.facebook.com/intern/diff/D41227739) [ghstack-poisoned]

H-Huang mentioned this pull request Nov 11, 2022

[17/N] Add _reduce_scatter_base custom op with CPU/CUDA implementation #88903

Closed

H-Huang added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 12, 2022

pytorchmergebot added the Merged label Nov 12, 2022

pytorchmergebot closed this in df1df9d Nov 12, 2022

facebook-github-bot deleted the gh/H-Huang/94/head branch June 8, 2023 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[16/N] Add _allgather_base custom op with CPU/CUDA implementation #88889

[16/N] Add _allgather_base custom op with CPU/CUDA implementation #88889

H-Huang commented Nov 11, 2022 •

edited

pytorch-bot bot commented Nov 11, 2022 •

edited

H-Huang Nov 11, 2022 •

edited

kwen2501 Nov 11, 2022

kwen2501 left a comment

kwen2501 Nov 11, 2022

kwen2501 Nov 11, 2022

H-Huang Nov 11, 2022

H-Huang commented Nov 11, 2022

H-Huang commented Nov 12, 2022

pytorchmergebot commented Nov 12, 2022

[16/N] Add _allgather_base custom op with CPU/CUDA implementation #88889

[16/N] Add _allgather_base custom op with CPU/CUDA implementation #88889

Conversation

H-Huang commented Nov 11, 2022 • edited

Context

pytorch-bot bot commented Nov 11, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88889

✅ No Failures

H-Huang Nov 11, 2022 • edited

Choose a reason for hiding this comment

kwen2501 Nov 11, 2022

Choose a reason for hiding this comment

kwen2501 left a comment

Choose a reason for hiding this comment

kwen2501 Nov 11, 2022

Choose a reason for hiding this comment

kwen2501 Nov 11, 2022

Choose a reason for hiding this comment

H-Huang Nov 11, 2022

Choose a reason for hiding this comment

H-Huang commented Nov 11, 2022

H-Huang commented Nov 12, 2022

pytorchmergebot commented Nov 12, 2022

Merge started

H-Huang commented Nov 11, 2022 •

edited

pytorch-bot bot commented Nov 11, 2022 •

edited

H-Huang Nov 11, 2022 •

edited