Skip to content

Conversation

lw
Copy link
Contributor

@lw lw commented Apr 20, 2021

Stack from ghstack:

In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.

Differential Revision: D27861067

In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.

Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 20, 2021

💊 CI failures summary and remediations

As of commit 3e9423a (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

// that the parent future didn't use. This field is set to the value provided
// in the constructor and will be "inherited" by all child futures.
// FIXME Remove the c10::optional once the TensorPipe agent can provide this.
c10::optional<std::vector<c10::DeviceIndex>> devices_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this can be const?

@@ -1086,8 +1086,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(

{
at::cuda::CUDAMultiStreamGuard streamGuard(ncclStreams_[key]);
std::vector<c10::DeviceIndex> deviceIndices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add deviceIndices.reserve(device.size())?

isCudaDeviceUsed[data_ptr.device().index()] = true;
}
}
std::vector<c10::DeviceIndex> device_indices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

camel naming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a lot of inconsistency between snake case and camel case, hence I never know what to do. Is there some convention we're following/enforcing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually try to keep the same naming convention within the same file.

In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.

Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/)

[ghstack-poisoned]
work->future_ = c10::make_intrusive<at::cuda::CUDAFuture>(
c10::ListType::create(c10::TensorType::get()));
c10::ListType::create(c10::TensorType::get()),
std::move(deviceIndices));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is also a call to cuda future in ProcessGroupNCCL::pointToPoint, do you want to add those changes there too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

excessDevices.empty(),
"The result contained tensors residing on device(s) ",
formatSetOfDevices(excessDevices),
" which are not among the expected device(s) ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mention that the user can specify these devices when constructing cuda future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, that makes sense, but I'm not sure how we can do it properly: this error could also be raised when a callback returns a value on a "bad" device, but the parent future could come from a variety of places (hand-constructed, returned from ProcessGroupNCCL, or from the TensorPipeAgent) each of which has its own way of setting the supported devices, and I don't think we want to list and explain all of them here. Do you have a wording that you think would be generic but still convey the message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, its a good point that it can come from a variety of places so probably logging the devices like we do is sufficient here.

lw added 2 commits April 22, 2021 06:24
In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.

Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/)

[ghstack-poisoned]
In #56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.

Differential Revision: [D27861067](https://our.internmc.facebook.com/intern/diff/D27861067/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 58d12eb.

@facebook-github-bot facebook-github-bot deleted the gh/lw/129/head branch April 27, 2021 14:16
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#56515

In pytorch#56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until pytorch#56405 lands.
ghstack-source-id: 127261552

Test Plan: Added a test for this later in the stack.

Reviewed By: mrshenli

Differential Revision: D27861067

fbshipit-source-id: 8ab2c9d06a514c0407a7e96abc3704e8d5c5dc09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants