Remove NCCL dependency from PythonFutureWrapper #48495

lw · 2020-11-26T18:55:57Z

Stack from ghstack:

Add support for async callbacks in ivalue::Future #48790 Add support for async callbacks in ivalue::Future
Drop FutureNCCL in favor of vanilla CUDAFuture #49014 Drop FutureNCCL in favor of vanilla CUDAFuture
Make CUDAFuture remember and restore current device in callback #48789 Make CUDAFuture remember and restore current device in callback
Remove DataPtr extractor from CUDAFuture #48840 Remove DataPtr extractor from CUDAFuture
Cache the DataPtrs in CUDAFuture #48788 Cache the DataPtrs in CUDAFuture
Split out reusable CUDAFuture from FutureNCCL #48506 Split out reusable CUDAFuture from FutureNCCL
Merge common parts of FutureNCCL into at::ivalue::Future #48505 Merge common parts of FutureNCCL into at::ivalue::Future
Split FutureNCCL's CUDA-specific parts from generic future logic #48504 Split FutureNCCL's CUDA-specific parts from generic future logic
Support wider range of types in FutureNCCL #48502 Support wider range of types in FutureNCCL
Don't store device indices separately on FutureNCCL #48501 Don't store device indices separately on FutureNCCL
Add multi-GPU support to FutureNCCL #48500 Add multi-GPU support to FutureNCCL
Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563 Fix FutureNCCL not recording dataptrs with caching alloc in wait()
Fix FutureNCCL's completed() disagreeing with wait() #48503 Fix FutureNCCL's completed() disagreeing with wait()
Record CUDA events for "follow-up" FutureNCCL inside markCompleted #48499 Record CUDA events for "follow-up" FutureNCCL inside markCompleted
Use fresh stream from pool for each FutureNCCL callback #48498 Use fresh stream from pool for each FutureNCCL callback
Make FutureNCCL record events in current stream #48497 Make FutureNCCL record events in current stream
Have FutureNCCL record streams w/ allocator in addCallback #48496 Have FutureNCCL record streams w/ allocator in addCallback
Add some safeguards to FutureNCCL #48562 Add some safeguards to FutureNCCL
Remove NCCL dependency from PythonFutureWrapper #48495 Remove NCCL dependency from PythonFutureWrapper
Avoid using FutureNCCL before it's ready #48561 Avoid using FutureNCCL before it's ready

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.

Differential Revision: D25177560

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better. We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated. Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on. Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/) [ghstack-poisoned]

mrshenli · 2020-11-26T19:39:51Z

aten/src/ATen/core/ivalue_inl.h

+  // Expose the default implementation so that external ones can defer to it.
+  static std::vector<std::reference_wrapper<const at::DataPtr>>
+  defaultDataPtrExtractor(const at::IValue& value) {
+    // FIXME Should we support more types than just tensors and tensor lists?


Yes, if we are going use this for as a general CudaFuture. But it can come in followup PRs.

Indeed, I'm doing this in #48502

mrshenli · 2020-11-26T19:47:47Z

aten/src/ATen/core/ivalue_inl.h

+  using DataPtrExtractor =
+      std::function<std::vector<std::reference_wrapper<const at::DataPtr>>(
+          const at::IValue&)>;
+  virtual void setDataPtrExtractor(DataPtrExtractor data_ptr_extractor) {}


I assume this is an intermediate state, as setDataPtrExtractor exists in the base class, but dataPtrExtractor_ only lives in sub classes? If this will be the long-term solution, do we need to rename this function? Otherwise setDataPtrExtractor is not doing what the name suggests in non-FutureNCCL classes?

To be honest, I was thinking of this as a long-term solution. Well, actually, I didn't think about it too much, because this is basically how it was already done (the setRecordStreamCallback was a no-op in ivalue::Future and was only implemented by FutureNCCL). I was fine with such a solution as I read the semantics of this method basically as "setDataPtrExtractorIfNeeded".

Also, later on I'll do something similar to this in order to merge some FutureNCCL logic into ivalue::Future: I'll define (protected) virtual methods that are left unimplemented in ivalue::Future and only do something when overridden by the FutureNCCL subclass. Although admittedly that's not exactly the same as these hooks are not part of the public interface.

I recognize that these solutions are not the nicest ones, but the hook one was the safest one I could find (minimum code duplication and protection from later updates to ivalue::Future). I'm not as attached to the DataPtrExtractor though, and I'd be happy to hear alternative proposals.

I've also only just realized that DataPtrExtractor will incur in another issue once we support multi-GPU (in #48500) since then it will be used in two places (by the "parent" future, inside then, and by the "child" future, inside markCompleted). And thus we'll probably need the parent future to propagate its DataPtrExtractor to the child future, so that if the child future completes immediately (before it's wrapped in a PythonFutureWrapper) it already has the right DataPtrExtractor. This will be a bit tricky to get right, especially if multiple threads are at play and we need to protect against race conditions.

And thus we'll probably need the parent future to propagate its DataPtrExtractor to the child future, so that if the child future completes immediately (before it's wrapped in a PythonFutureWrapper) it already has the right DataPtrExtractor.

I assume this means the child Future created by .then() would always be the same type (CPU/CUDA) as the parent Future?

I assume this means the child Future created by .then() would always be the same type (CPU/CUDA) as the parent Future?

That's indeed the case. I didn't give it much thought, do you think it could present a problem? Since the CUDAFuture is a "generalization" of ivalue::Future (and, in fact, it behaves exactly the same when the vector of CUDAEvents is empty), it should be perfectly fine to attach a CPU-only callback to a CUDAFutures. Issues would start to arise if one wants to attach a CUDA callback to a CPU-only ivalue::Future. I'm not sure how we would tackle that...

Issues would start to arise if one wants to attach a CUDA callback to a CPU-only ivalue::Future.

the current version LGTM. If users hit this, we can fix it later.

mrshenli · 2020-11-26T19:58:29Z

torch/lib/c10d/ProcessGroupNCCL.hpp

-        } else {
-          tensor = value_.toTensor();
-        }
+      for (const at::DataPtr& data_ptr : extractDataPtrs(value_)) {
        c10::cuda::CUDACachingAllocator::recordStream(


Curious about the implication for RPC use cases. Do RPC also need to call recordStream? If yes, when? Is it when the tensors are retrieved from the Future (through result or wait), we should call recordStream on the current stream?

I'm still figuring out the correct usage of the caching allocator, but I think this should work in the same way for RPC. The model I have in mind for RPC is the one we discussed in #44084 (comment). In that case, in the receiver (bottom right quadrant of the diagram), I think we need to record streams with the caching allocators whenever we "transfer" the result to other streams than the ones we used to receive it. This would happen both when using .wait()/.value(), and in callbacks (basically the points in the diagram where we say "record events"). And this is exactly what we're doing here. Does this make sense?

yep, make sense to me.

mrshenli

LGTM!

dr-ci · 2020-11-27T00:47:36Z

💊 CI failures summary and remediations

As of commit 9b59aa6 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 20 times.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better. We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated. Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on. Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/) [ghstack-poisoned]

lw · 2020-11-27T16:32:06Z

I updated this: since the custom Python-aware DataPtr extractor was a stateless lambda I made it a static method of the PythonFutureWrapper class.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better. We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated. Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on. Differential Revision: [D25177560](https://our.internmc.facebook.com/intern/diff/D25177560/) [ghstack-poisoned]

facebook-github-bot · 2020-12-10T13:12:23Z

This pull request has been merged in b7f5aa9.

lw requested review from apaszke, mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 26, 2020 18:55

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 26, 2020

mrshenli reviewed Nov 26, 2020

View reviewed changes

mrshenli approved these changes Nov 26, 2020

View reviewed changes

lw mentioned this pull request Nov 29, 2020

Debug CI failures of #48501 #48553

Closed

This was referenced Nov 29, 2020

Avoid using FutureNCCL before it's ready #48561

Closed

Add some safeguards to FutureNCCL #48562

Closed

Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563

Closed

lw added 2 commits November 29, 2020 14:33

This was referenced Dec 3, 2020

Cache the DataPtrs in CUDAFuture #48788

Closed

Make CUDAFuture remember and restore current device in callback #48789

Closed

Add support for async callbacks in ivalue::Future #48790

Closed

lw mentioned this pull request Dec 4, 2020

Remove DataPtr extractor from CUDAFuture #48840

Closed

lw mentioned this pull request Dec 8, 2020

Drop FutureNCCL in favor of vanilla CUDAFuture #49014

Closed

facebook-github-bot closed this in b7f5aa9 Dec 10, 2020

facebook-github-bot added the Merged label Dec 10, 2020

facebook-github-bot deleted the gh/lw/84/head branch December 13, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove NCCL dependency from PythonFutureWrapper #48495

Remove NCCL dependency from PythonFutureWrapper #48495

lw commented Nov 26, 2020 •

edited

mrshenli Nov 26, 2020

lw Nov 27, 2020

mrshenli Nov 26, 2020

lw Nov 27, 2020

mrshenli Dec 2, 2020

lw Dec 2, 2020

mrshenli Dec 8, 2020

mrshenli Nov 26, 2020

lw Nov 27, 2020

mrshenli Dec 2, 2020

mrshenli left a comment

dr-ci bot commented Nov 27, 2020 •

edited

lw commented Nov 27, 2020

facebook-github-bot commented Dec 10, 2020

Remove NCCL dependency from PythonFutureWrapper #48495

Remove NCCL dependency from PythonFutureWrapper #48495

Conversation

lw commented Nov 26, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrshenli left a comment

Choose a reason for hiding this comment

dr-ci bot commented Nov 27, 2020 • edited

💊 CI failures summary and remediations

lw commented Nov 27, 2020

facebook-github-bot commented Dec 10, 2020

lw commented Nov 26, 2020 •

edited

dr-ci bot commented Nov 27, 2020 •

edited