-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4/N] [Dispatchable Collectives] Update all_reduce_ with CPU / CUDA implementations #83810
Conversation
…mplementations [ghstack-poisoned]
🔗 Helpful links
✅ No Failures (22 Pending)As of commit 29a2486 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…mplementations ghstack-source-id: be18410d00cd737835b5aeb68ed6ee23902833a1 Pull Request resolved: #83810
…PU / CUDA implementations" [ghstack-poisoned]
…mplementations ghstack-source-id: a7471b2c8deaabc3113e2fa01493fd80b2cc774e Pull Request resolved: #83810
…PU / CUDA implementations" [ghstack-poisoned]
…mplementations ghstack-source-id: 5dd79ca430a5ff5c1530db15b8e2a1cab3dc2715 Pull Request resolved: #83810
…PU / CUDA implementations" [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
// sparse all_reduce in the Gloo backend | ||
TORCH_LIBRARY_IMPL(c10d, SparseCPU, m) { | ||
m.impl("allreduce_", allreduce_cpu_); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is there actually a SparseCPU
or SparseCUDA
implementation in Gloo?
If not, do we need to add it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is logic within allreduce()
to handle sparse tensors (https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp#L1465-L1467) to which it branches to different logic, so having each implementation just call allreduce_ keeps behavior the same, but I think there may be a cleaner way to do this, we should discuss this.
…PU / CUDA implementations" [ghstack-poisoned]
…PU / CUDA implementations" [ghstack-poisoned]
…PU / CUDA implementations" [ghstack-poisoned]
…PU / CUDA implementations" [ghstack-poisoned]
…PU / CUDA implementations" [ghstack-poisoned]
…PU / CUDA implementations" [ghstack-poisoned]
…PU / CUDA implementations" [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83810
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 84a5880: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…PU / CUDA implementations" [ghstack-poisoned]
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…PU / CUDA implementations" ### About this PR * Update the all_reduce op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Update test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D39506979](https://our.internmc.facebook.com/intern/diff/D39506979) [ghstack-poisoned]
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
@pytorchbot successfully started a merge job. Check the current status here and land check progress here. |
…mplementations (#83810) ### About this PR * Update the all_reduce op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Update test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D39506979](https://our.internmc.facebook.com/intern/diff/D39506979) Pull Request resolved: #83810 Approved by: https://github.com/kwen2501
…mplementations (pytorch#83810) ### About this PR * Update the all_reduce op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Update test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D39506979](https://our.internmc.facebook.com/intern/diff/D39506979) Pull Request resolved: pytorch#83810 Approved by: https://github.com/kwen2501
…mplementations (#83810) ### About this PR * Update the all_reduce op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Update test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D39506979](https://our.internmc.facebook.com/intern/diff/D39506979) Pull Request resolved: #83810 Approved by: https://github.com/kwen2501
…mplementations (pytorch#83810) ### About this PR * Update the all_reduce op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Update test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D39506979](https://our.internmc.facebook.com/intern/diff/D39506979) Pull Request resolved: pytorch#83810 Approved by: https://github.com/kwen2501
Stack from ghstack:
About this PR
Context
#86225
Differential Revision: D39506979