Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove NCCL dependency from PythonFutureWrapper #48495

Closed
wants to merge 8 commits into from
Closed

Conversation

lw
Copy link
Contributor

@lw lw commented Nov 26, 2020

Stack from ghstack:

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).


PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: D25177560

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/)

[ghstack-poisoned]
// Expose the default implementation so that external ones can defer to it.
static std::vector<std::reference_wrapper<const at::DataPtr>>
defaultDataPtrExtractor(const at::IValue& value) {
// FIXME Should we support more types than just tensors and tensor lists?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we are going use this for as a general CudaFuture. But it can come in followup PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'm doing this in #48502

using DataPtrExtractor =
std::function<std::vector<std::reference_wrapper<const at::DataPtr>>(
const at::IValue&)>;
virtual void setDataPtrExtractor(DataPtrExtractor data_ptr_extractor) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is an intermediate state, as setDataPtrExtractor exists in the base class, but dataPtrExtractor_ only lives in sub classes? If this will be the long-term solution, do we need to rename this function? Otherwise setDataPtrExtractor is not doing what the name suggests in non-FutureNCCL classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I was thinking of this as a long-term solution. Well, actually, I didn't think about it too much, because this is basically how it was already done (the setRecordStreamCallback was a no-op in ivalue::Future and was only implemented by FutureNCCL). I was fine with such a solution as I read the semantics of this method basically as "setDataPtrExtractorIfNeeded".

Also, later on I'll do something similar to this in order to merge some FutureNCCL logic into ivalue::Future: I'll define (protected) virtual methods that are left unimplemented in ivalue::Future and only do something when overridden by the FutureNCCL subclass. Although admittedly that's not exactly the same as these hooks are not part of the public interface.

I recognize that these solutions are not the nicest ones, but the hook one was the safest one I could find (minimum code duplication and protection from later updates to ivalue::Future). I'm not as attached to the DataPtrExtractor though, and I'd be happy to hear alternative proposals.

I've also only just realized that DataPtrExtractor will incur in another issue once we support multi-GPU (in #48500) since then it will be used in two places (by the "parent" future, inside then, and by the "child" future, inside markCompleted). And thus we'll probably need the parent future to propagate its DataPtrExtractor to the child future, so that if the child future completes immediately (before it's wrapped in a PythonFutureWrapper) it already has the right DataPtrExtractor. This will be a bit tricky to get right, especially if multiple threads are at play and we need to protect against race conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thus we'll probably need the parent future to propagate its DataPtrExtractor to the child future, so that if the child future completes immediately (before it's wrapped in a PythonFutureWrapper) it already has the right DataPtrExtractor.

I assume this means the child Future created by .then() would always be the same type (CPU/CUDA) as the parent Future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this means the child Future created by .then() would always be the same type (CPU/CUDA) as the parent Future?

That's indeed the case. I didn't give it much thought, do you think it could present a problem? Since the CUDAFuture is a "generalization" of ivalue::Future (and, in fact, it behaves exactly the same when the vector of CUDAEvents is empty), it should be perfectly fine to attach a CPU-only callback to a CUDAFutures. Issues would start to arise if one wants to attach a CUDA callback to a CPU-only ivalue::Future. I'm not sure how we would tackle that...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issues would start to arise if one wants to attach a CUDA callback to a CPU-only ivalue::Future.

the current version LGTM. If users hit this, we can fix it later.

} else {
tensor = value_.toTensor();
}
for (const at::DataPtr& data_ptr : extractDataPtrs(value_)) {
c10::cuda::CUDACachingAllocator::recordStream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about the implication for RPC use cases. Do RPC also need to call recordStream? If yes, when? Is it when the tensors are retrieved from the Future (through result or wait), we should call recordStream on the current stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still figuring out the correct usage of the caching allocator, but I think this should work in the same way for RPC. The model I have in mind for RPC is the one we discussed in #44084 (comment). In that case, in the receiver (bottom right quadrant of the diagram), I think we need to record streams with the caching allocators whenever we "transfer" the result to other streams than the ones we used to receive it. This would happen both when using .wait()/.value(), and in callbacks (basically the points in the diagram where we say "record events"). And this is exactly what we're doing here. Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, make sense to me.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dr-ci
Copy link

dr-ci bot commented Nov 27, 2020

💊 CI failures summary and remediations

As of commit 9b59aa6 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 20 times.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/)

[ghstack-poisoned]
@lw
Copy link
Contributor Author

lw commented Nov 27, 2020

I updated this: since the custom Python-aware DataPtr extractor was a stateless lambda I made it a static method of the PythonFutureWrapper class.

@lw lw mentioned this pull request Nov 29, 2020
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b7f5aa9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants