-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RPC Framework] Forbid process group backend in RPC #56854
Conversation
To resolve #51670, forbid process group PRC backend. This can avoid the need of checking the current backend in the torch script `remote_module_template`. Differential Revision: [D27984658](https://our.internmc.facebook.com/intern/diff/D27984658/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 36f1a3f (more details on the Dr. CI page):
🕵️ 6 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages: mypy (1/6)Step: "Run mypy" (full log | diagnosis details | 🔁 rerun)
|
To resolve #51670, forbid process group PRC backend. This can avoid the need of checking the current backend in the torch script `remote_module_template`. Differential Revision: [D27984658](https://our.internmc.facebook.com/intern/diff/D27984658/) [ghstack-poisoned]
Pull Request resolved: #56854 To resolve #51670, forbid process group PRC backend. This can avoid the need of checking the current backend in the torch script `remote_module_template`. ghstack-source-id: 127345847 Differential Revision: [D27984658](https://our.internmc.facebook.com/intern/diff/D27984658/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the plan is to keep PG RPC backend in v1.9. If the goal is to make RemoteModule only work with TensorPipe backend, I think it is fine, as |
Yes, I understand that this PR is not completely deprecating PG backend for RPC. This is just the minimum work to support a follow-up PR. Plan to discuss with @lw and create a separate PR to complete remove PG backend. |
I might miss sth. IIUC, this PR will throw a
If this is what you would like to avoid, is it acceptable to let RemoteModule error out when the backend is PG? |
Curious, how does RemoteModule know whether the device map is provided? Read from the RPC agent state? |
Yes, per discussion with @pritamdamania87, we plan to error out PG backend on RemoteModule. |
I agree with @mrshenli and @rohan-varma, we should not forbid the PG backend in 1.9. If some component requires the RPC agent to support CUDA tensors, we can add runtime checks for this in that component. If it helps, I think we could consider adding a |
Thanks for the suggestion! Not sure if it's worthwhile adding a new method here temporarily for only version 1.9. If we don't want to forbid PG backend before 1.9, we can probably bypass this issue at this time. I created #56943 that always moves forward output back to CPU for now, and added some TODOs to fix after 1.9. |
Same thought, but none of the getters in @lw: could you provide a |
@mrshenli @pritamdamania87 Doesn't dist autograd also need to retrieve the device maps from the RPC agent in order to store the for use in the backwards pass? How does it do so? If there's already some code to do that, we should reuse it. If not, I'm fine with adding that method. |
This will break the promise we made in: #55615 |
@mrshenli Thanks for referring to this issue! I don't think there is a need to forbid PG backend at this time. Now I realize that I don't really need to check the PRC backend in RemoteModule. Since TensorPipe backend can only support sending GPU tensors over the wire with a device map, what I only need to check in RemoteModule is whether a device map is set. We can close this PR once @lw's comment #56854 (comment) is addressed. |
Yes, I added support for this in #44859. |
Thanks for the pointer! @lw So it seems that all we need now is just exposing a Python API for C++ |
Expose a Python API to get the device map and unblock RemoteModule work. See: #56854 (comment) Additionally, add a const decorator for the C++ getter. Differential Revision: [D28070160](https://our.internmc.facebook.com/intern/diff/D28070160/) [ghstack-poisoned]
Expose a Python API to get the device map and unblock RemoteModule work. See: #56854 (comment) Additionally, add a const decorator for the C++ getter. #Original PR issue: #51670 Differential Revision: [D28070160](https://our.internmc.facebook.com/intern/diff/D28070160/) [ghstack-poisoned]
Pull Request resolved: #57179 Expose a Python API to get the device map and unblock RemoteModule work. See: #56854 (comment) Additionally, add a const decorator for the C++ getter. #Original PR issue: #51670 ghstack-source-id: 127684266 Differential Revision: [D28070160](https://our.internmc.facebook.com/intern/diff/D28070160/)
Summary: Pull Request resolved: #57179 Expose a Python API to get the device map and unblock RemoteModule work. See: #56854 (comment) Additionally, add a const decorator for the C++ getter. #Original PR issue: #51670 ghstack-source-id: 127684266 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D28070160 fbshipit-source-id: 624d14552d82b99487f72e16428fa75c7a47f61f
Abandon this PR. No need to forbid Process Group RPC backend. Whether to allow sending the GPU tensors over the wire is now is determined by whether a device map on the remote worker is set or not. |
) Summary: Pull Request resolved: pytorch#57179 Expose a Python API to get the device map and unblock RemoteModule work. See: pytorch#56854 (comment) Additionally, add a const decorator for the C++ getter. #Original PR issue: pytorch#51670 ghstack-source-id: 127684266 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D28070160 fbshipit-source-id: 624d14552d82b99487f72e16428fa75c7a47f61f
Stack from ghstack:
To resolve #51670, forbid process group PRC backend. This can avoid the need of checking the current backend in the torch script
remote_module_template
.Otherwise, we need to check the current RPC backend to determine whether we want to move the forward output back to CPU.
Differential Revision: D27984658