-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[10/N] Update barrier with CPU/CUDA implementations #86368
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86368
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 51f3b75: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: e9d93b37f71e7a3d6a73d234b333e63e1cb00b17 Pull Request resolved: #86368
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -286,6 +302,15 @@ TORCH_LIBRARY_IMPL(c10d, CPU, m) { | |||
TORCH_LIBRARY_IMPL(c10d, CUDA, m) { | |||
m.impl("reduce_scatter_", reduce_scatter_cuda_); | |||
} | |||
|
|||
TORCH_LIBRARY_IMPL(c10d, CPU, m) { | |||
m.impl("barrier", barrier_cpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you decide if there should be a trailing underscore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the convention for PT operators (https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/native_functions.yaml) is if the tensor is modified inplace, then operator should be appended with _
. We don't do this for barrier
and send
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Professor Huang!
### Changes - Updates for the barrier collective ### Context #86225 Differential Revision: [D40145698](https://our.internmc.facebook.com/intern/diff/D40145698) [ghstack-poisoned]
### Changes - Updates for the barrier collective ### Context #86225 Differential Revision: [D40145698](https://our.internmc.facebook.com/intern/diff/D40145698) [ghstack-poisoned]
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is okay to land current changes since they lead to the same behavior.
But we should note that current change would not achieve dispatching of barrier.
Also added some minor comments about the test. (Sorry I missed them previously; can be improved in a future PR)
@@ -1488,6 +1490,7 @@ def _test_collectives(self, backend): | |||
(dist.all_reduce,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I may have missed this earlier. Would passing self.rank
to reduce and broadcast cause each rank identifying a different root and hang? They should have the same root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing self.rank causes the broadcast operation to be sourced from that rank, but it does not hang waiting for other ranks to ACK the broadcast it sent. The same logic applies to reduce so thats why I believe it is not hanging
if collective == dist.all_gather: | ||
if collective == dist.barrier: | ||
collective() | ||
elif collective == dist.all_gather: | ||
collective([tensor], tensor, *args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- nit: this if-else block is getting bigger (as a result of wrapping/templating).
Maybe the test would be easier to read if we just write out each test call in _test_collectives, like:
dist.barrier()
dist.all_reduce(tensor)
...
- Is
collective([tensor], tensor, *args)
a correct format forall_gather
? i.e. it will have a list of only one tensor for the output. Or are we testing the dispatching functionality with WORLD_SIZE=1 here? (if so the code makes sense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Makes sense, the if statement is getting a bit messy, I will change it in future PRs.
- We are just testing dispatching functionality and making sure the operation is callable. So handling the latter case, since collective correctness should be handled by other tests.
### Changes - Updates for the barrier collective ### Context #86225 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @kwen2501 @awgu [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
### Changes - Updates for the barrier collective - NOTE: current change will not achieve dispatching of barrier since there is no tensor to read from ### Context pytorch#86225 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @kwen2501 @awgu Pull Request resolved: pytorch#86368 Approved by: https://github.com/kwen2501
### Changes - Updates for the barrier collective - NOTE: current change will not achieve dispatching of barrier since there is no tensor to read from ### Context pytorch#86225 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @kwen2501 @awgu Pull Request resolved: pytorch#86368 Approved by: https://github.com/kwen2501
Stack from ghstack:
Changes
Context
#86225
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @kwen2501 @awgu