Allow to specify a set of device for CUDAFuture #56515

lw · 2021-04-20T19:05:25Z

Stack from ghstack:

Add tests for CUDAFuture #56518 Add tests for CUDAFuture
Support custom Python classes in CUDAFuture #56516 Support custom Python classes in CUDAFuture
Add CUDA support to a user-created torch.futures.Future #56517 Add CUDA support to a user-created torch.futures.Future
Allow to specify a set of device for CUDAFuture #56515 Allow to specify a set of device for CUDAFuture

In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.

Differential Revision: D27861067

In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands. Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/) [ghstack-poisoned]

facebook-github-bot · 2021-04-20T19:05:40Z

💊 CI failures summary and remediations

As of commit 3e9423a (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

mrshenli · 2021-04-21T02:08:13Z

aten/src/ATen/cuda/CUDAFuture.h

+  // that the parent future didn't use. This field is set to the value provided
+  // in the constructor and will be "inherited" by all child futures.
+  // FIXME Remove the c10::optional once the TensorPipe agent can provide this.
+  c10::optional<std::vector<c10::DeviceIndex>> devices_;


looks like this can be const?

mrshenli · 2021-04-21T02:10:48Z

torch/lib/c10d/ProcessGroupNCCL.cpp

@@ -1086,8 +1086,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(

  {
    at::cuda::CUDAMultiStreamGuard streamGuard(ncclStreams_[key]);
+    std::vector<c10::DeviceIndex> deviceIndices;


nit: add deviceIndices.reserve(device.size())?

mrshenli · 2021-04-21T02:18:10Z

aten/src/ATen/cuda/CUDAFuture.cpp

+      isCudaDeviceUsed[data_ptr.device().index()] = true;
+    }
+  }
+  std::vector<c10::DeviceIndex> device_indices;


camel naming?

I'm seeing a lot of inconsistency between snake case and camel case, hence I never know what to do. Is there some convention we're following/enforcing?

I usually try to keep the same naming convention within the same file.

In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands. Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/) [ghstack-poisoned]

rohan-varma · 2021-04-21T21:07:43Z

torch/lib/c10d/ProcessGroupNCCL.cpp

    work->future_ = c10::make_intrusive<at::cuda::CUDAFuture>(
-        c10::ListType::create(c10::TensorType::get()));
+        c10::ListType::create(c10::TensorType::get()),
+        std::move(deviceIndices));


I think there is also a call to cuda future in ProcessGroupNCCL::pointToPoint, do you want to add those changes there too?

Good catch, thanks!

rohan-varma · 2021-04-21T21:11:04Z

aten/src/ATen/cuda/CUDAFuture.cpp

+        excessDevices.empty(),
+        "The result contained tensors residing on device(s) ",
+        formatSetOfDevices(excessDevices),
+        " which are not among the expected device(s) ",


nit: mention that the user can specify these devices when constructing cuda future?

Uhm, that makes sense, but I'm not sure how we can do it properly: this error could also be raised when a callback returns a value on a "bad" device, but the parent future could come from a variety of places (hand-constructed, returned from ProcessGroupNCCL, or from the TensorPipeAgent) each of which has its own way of setting the supported devices, and I don't think we want to list and explain all of them here. Do you have a wording that you think would be generic but still convey the message?

I see, its a good point that it can come from a variety of places so probably logging the devices like we do is sufficient here.

In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands. Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/) [ghstack-poisoned]

facebook-github-bot · 2021-04-23T15:14:15Z

This pull request has been merged in 58d12eb.

Summary: Pull Request resolved: pytorch#56515 In pytorch#56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until pytorch#56405 lands. ghstack-source-id: 127261552 Test Plan: Added a test for this later in the stack. Reviewed By: mrshenli Differential Revision: D27861067 fbshipit-source-id: 8ab2c9d06a514c0407a7e96abc3704e8d5c5dc09

lw requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners April 20, 2021 19:05

facebook-github-bot added the cla signed label Apr 20, 2021

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 20, 2021

mrshenli approved these changes Apr 21, 2021

View reviewed changes

rohan-varma reviewed Apr 21, 2021

View reviewed changes

lw added 2 commits April 22, 2021 06:24

facebook-github-bot closed this in 58d12eb Apr 23, 2021

facebook-github-bot added the Merged label Apr 23, 2021

facebook-github-bot deleted the gh/lw/129/head branch April 27, 2021 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow to specify a set of device for CUDAFuture #56515

Allow to specify a set of device for CUDAFuture #56515

Uh oh!

lw commented Apr 20, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 20, 2021 •

edited

Loading

Uh oh!

mrshenli Apr 21, 2021

Uh oh!

mrshenli Apr 21, 2021

Uh oh!

mrshenli Apr 21, 2021

Uh oh!

lw Apr 21, 2021

Uh oh!

mrshenli Apr 21, 2021

Uh oh!

rohan-varma Apr 21, 2021

Uh oh!

lw Apr 22, 2021

Uh oh!

rohan-varma Apr 21, 2021

Uh oh!

lw Apr 22, 2021

Uh oh!

rohan-varma Apr 25, 2021

Uh oh!

facebook-github-bot commented Apr 23, 2021

Uh oh!

Uh oh!

Allow to specify a set of device for CUDAFuture #56515

Allow to specify a set of device for CUDAFuture #56515

Uh oh!

Conversation

lw commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 23, 2021

Uh oh!

Uh oh!

lw commented Apr 20, 2021 •

edited

Loading

facebook-github-bot commented Apr 20, 2021 •

edited

Loading