-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Allow to specify a set of device for CUDAFuture #56515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands. Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 3e9423a (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
aten/src/ATen/cuda/CUDAFuture.h
Outdated
// that the parent future didn't use. This field is set to the value provided | ||
// in the constructor and will be "inherited" by all child futures. | ||
// FIXME Remove the c10::optional once the TensorPipe agent can provide this. | ||
c10::optional<std::vector<c10::DeviceIndex>> devices_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this can be const?
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
@@ -1086,8 +1086,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective( | |||
|
|||
{ | |||
at::cuda::CUDAMultiStreamGuard streamGuard(ncclStreams_[key]); | |||
std::vector<c10::DeviceIndex> deviceIndices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add deviceIndices.reserve(device.size())
?
aten/src/ATen/cuda/CUDAFuture.cpp
Outdated
isCudaDeviceUsed[data_ptr.device().index()] = true; | ||
} | ||
} | ||
std::vector<c10::DeviceIndex> device_indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
camel naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing a lot of inconsistency between snake case and camel case, hence I never know what to do. Is there some convention we're following/enforcing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually try to keep the same naming convention within the same file.
In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands. Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/) [ghstack-poisoned]
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
work->future_ = c10::make_intrusive<at::cuda::CUDAFuture>( | ||
c10::ListType::create(c10::TensorType::get())); | ||
c10::ListType::create(c10::TensorType::get()), | ||
std::move(deviceIndices)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is also a call to cuda future in ProcessGroupNCCL::pointToPoint
, do you want to add those changes there too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
aten/src/ATen/cuda/CUDAFuture.cpp
Outdated
excessDevices.empty(), | ||
"The result contained tensors residing on device(s) ", | ||
formatSetOfDevices(excessDevices), | ||
" which are not among the expected device(s) ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mention that the user can specify these devices when constructing cuda future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, that makes sense, but I'm not sure how we can do it properly: this error could also be raised when a callback returns a value on a "bad" device, but the parent future could come from a variety of places (hand-constructed, returned from ProcessGroupNCCL, or from the TensorPipeAgent) each of which has its own way of setting the supported devices, and I don't think we want to list and explain all of them here. Do you have a wording that you think would be generic but still convey the message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, its a good point that it can come from a variety of places so probably logging the devices like we do is sufficient here.
In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands. Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/) [ghstack-poisoned]
In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands. Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/) [ghstack-poisoned]
This pull request has been merged in 58d12eb. |
Summary: Pull Request resolved: pytorch#56515 In pytorch#56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until pytorch#56405 lands. ghstack-source-id: 127261552 Test Plan: Added a test for this later in the stack. Reviewed By: mrshenli Differential Revision: D27861067 fbshipit-source-id: 8ab2c9d06a514c0407a7e96abc3704e8d5c5dc09
Stack from ghstack:
In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.
We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).
I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.
Differential Revision: D27861067