New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix gloo cuda sparse_allreduce dispatch #111485
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111485
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 02946df with merge base 971f67c (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot rebase -s |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
self.rank, self.world_size, num_inputs=1 | ||
) | ||
for (inputs, outputs) in tests: | ||
tensors = inputs[-1].clone().cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N00b question, does this mean cuda tensors can work on gloo? So we first move tensors from GPU to CPU and do the communication and move it back to CUDA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gloo supports all_reduce
and broadcast
for cuda tensors (https://pytorch.org/docs/master/distributed.html#backends). In ProcessGroupGloo implementation for all_reduce it will copy the CUDA tensors to pinned CPU tensors then performs the allreduce. So pg_gloo.all_reduce(cpu_tensor) and pg_gloo.all_reduce(cuda_tensor) are both supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have a n00b question for my own learning, otherwise it looks good to me.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes pytorch#111422 allreduce_sparse_cuda gets dispatched to allreduce_sparse which doesnt exist for gloo. However, gloo has an existing implementation so this is just fixing the dispatching to that. The reason CI didn't catch this is because we are calling the backend directly. Added a test which calls the public API (dist.XYZ) and goes through the dispatcher Pull Request resolved: pytorch#111485 Approved by: https://github.com/fduwjj
Fixes #111422
allreduce_sparse_cuda gets dispatched to allreduce_sparse which doesnt exist for gloo. However, gloo has an existing implementation so this is just fixing the dispatching to that.
The reason CI didn't catch this is because we are calling the backend directly. Added a test which calls the public API (dist.XYZ) and goes through the dispatcher