-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[C10d][NCCL] Refactor complex all_reduce and broadcast #121045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121045
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9a304f7 with merge base 8861507 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Naive question, what happens when we view complex |
Yes, the "elementwise" computation must be valid there, as long as the "pre-mul factor" is of the real type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add a minimal test e.g,. in test_c10d_nccl.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Re: move complex support down from python to cpp
we might need to give it more thought, as today all cpp backends rely on the view_as_real
conversion at the python level. If we move it, then it means we'd need to add it back in every backend. (like you did here).
Another way to do it is at the dispatcher:
https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Ops.cpp
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++. ``` [rank0]: Traceback (most recent call last): [rank0]: File "~/complex_ddp.py", line 72, in <module> [rank0]: main() [rank0]: File "~/complex_ddp.py", line 64, in main [rank0]: loss.backward() [rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward [rank0]: torch.autograd.backward( [rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward [rank0]: _engine_run_backward( [rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat ``` I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501? Pull Request resolved: #121045 Approved by: https://github.com/eqy, https://github.com/kwen2501
The necessity of this PR lies in the fact that autograd engine + DDP calls
all_reduce
from C++, so the changes must be made in C++.I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501?
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @eqy @ptrblck