Add multi-GPU support to FutureNCCL #48500

lw · 2020-11-26T18:56:39Z

Stack from ghstack:

Add support for async callbacks in ivalue::Future #48790 Add support for async callbacks in ivalue::Future
Drop FutureNCCL in favor of vanilla CUDAFuture #49014 Drop FutureNCCL in favor of vanilla CUDAFuture
Make CUDAFuture remember and restore current device in callback #48789 Make CUDAFuture remember and restore current device in callback
Remove DataPtr extractor from CUDAFuture #48840 Remove DataPtr extractor from CUDAFuture
Cache the DataPtrs in CUDAFuture #48788 Cache the DataPtrs in CUDAFuture
Split out reusable CUDAFuture from FutureNCCL #48506 Split out reusable CUDAFuture from FutureNCCL
Merge common parts of FutureNCCL into at::ivalue::Future #48505 Merge common parts of FutureNCCL into at::ivalue::Future
Split FutureNCCL's CUDA-specific parts from generic future logic #48504 Split FutureNCCL's CUDA-specific parts from generic future logic
Support wider range of types in FutureNCCL #48502 Support wider range of types in FutureNCCL
Don't store device indices separately on FutureNCCL #48501 Don't store device indices separately on FutureNCCL
Add multi-GPU support to FutureNCCL #48500 Add multi-GPU support to FutureNCCL
Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563 Fix FutureNCCL not recording dataptrs with caching alloc in wait()
Fix FutureNCCL's completed() disagreeing with wait() #48503 Fix FutureNCCL's completed() disagreeing with wait()
Record CUDA events for "follow-up" FutureNCCL inside markCompleted #48499 Record CUDA events for "follow-up" FutureNCCL inside markCompleted
Use fresh stream from pool for each FutureNCCL callback #48498 Use fresh stream from pool for each FutureNCCL callback
Make FutureNCCL record events in current stream #48497 Make FutureNCCL record events in current stream
Have FutureNCCL record streams w/ allocator in addCallback #48496 Have FutureNCCL record streams w/ allocator in addCallback
Add some safeguards to FutureNCCL #48562 Add some safeguards to FutureNCCL
Remove NCCL dependency from PythonFutureWrapper #48495 Remove NCCL dependency from PythonFutureWrapper
Avoid using FutureNCCL before it's ready #48561 Avoid using FutureNCCL before it's ready

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

After the previous changes, this is now much simpler than it sounds. For the most part it just consists in repeating some operations multiple times, once for device (e.g., recording and blocking on events). Funnily, we already had a vector of events, even though we only ever stored one element in it (this probably comes from the fact that this is shared with WorkNCCL, which can hold more than one event). Here, we now also store a vector of device indices.

Perhaps the only non-trivial part of this is that now, for "follow-up" Futures (for callbacks), we can't know in advance which device the result will be on so we must determine it dynamically when we receive the result, by inspecting it. That's also easier than it sound because we already have a dataptr extractor.

Differential Revision: D25177556

dr-ci · 2020-11-26T23:08:53Z

💊 CI failures summary and remediations

As of commit 0928ffa (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 24 times.

mrshenli

LGTM!

mrshenli · 2020-11-26T23:14:40Z

torch/lib/c10d/ProcessGroupNCCL.hpp

@@ -197,9 +198,7 @@ class ProcessGroupNCCL : public ProcessGroup {
  //
  // If created by WorkNCCL's getFuture API, FutureNCCL has a reference to
  // WorkNCCL's cudaEvents, NCCL collective's outputs, and the device index of


index -> indices?

Thanks, fixed (it'll be visible once I re-export it)

I was thinking of doing a complete pass over comments/docstrings/... at the very end, once we have split out CUDAFuture, to document the reasons behind our design choices, the subtleties around streams usage, and such.

mrshenli · 2020-11-26T23:14:56Z

torch/lib/c10d/ProcessGroupNCCL.hpp

-  // outputs' device. Its value is NCCL collective's
-  // outputs. FutureNCCL only supports single-process single-device mode where
-  // the size of outputs is equal to 1.
+  // outputs' device. Its value is NCCL collective's outputs.


device -> devices?

Thanks, fixed (it'll be visible once I re-export it)

mrshenli · 2020-11-26T23:15:32Z

torch/lib/c10d/ProcessGroupNCCL.hpp

+      // value, because the user's callback could use those other devices.
+      std::vector<at::cuda::CUDAStream> streams;
+      for (c10::DeviceIndex idx = 0; idx < c10::cuda::device_count(); idx++) {
+        // FIXME Should we find a way to allow to change the priority of


when do we need high-priority streams?

I actually have no idea. Do you know when high- vs low-priority streams were used in ProcessGroupNCCL? What was the reason? Does it still apply here?

I have never seen a user trying to configure stream priority for NCCL ops. And I don't think it's possible with the init_process_group API. Users probably will have to use the un-documented ProcessGroupNCCL ctor API.

Besides, I am also not sure how much impact the stream priority can have on the schedule and how visible that is to the e2e perf.

If there is no specific use case for now, we probably can keep it simple for now? It will be interesting to run some experiments to quantify its impact on perf in the future.

mrshenli

Test failures look relevant:

https://app.circleci.com/pipelines/github/pytorch/pytorch/242887/workflows/0ca6759b-dd80-4abd-b627-849f3b436c05/jobs/9162893/steps

ov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [3.027s]: test_all_gather_multigpu (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [2.827s]: test_all_gather_multigpu_complex (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [2.926s]: test_all_reduce_sum_cuda (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [2.927s]: test_all_reduce_sum_cuda_async (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [2.927s]: test_all_reduce_sum_cuda_complex (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [2.927s]: test_broadcast_cuda (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [2.932s]: test_nccl_high_priority_stream (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [2.927s]: test_reduce_multigpu (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [2.927s]: test_reduce_sum_cuda (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ======================================================================
Nov 26 23:08:31 FAIL [2.928s]: test_reduce_sum_cuda_twice (__main__.TestDistBackendWithSpawn)
Nov 26 23:08:31 ----------------------------------------------------------------------
Nov 26 23:08:31 Traceback (most recent call last):
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 26 23:08:31     self._join_processes(fn)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 26 23:08:31     self._check_return_codes(elapsed_time)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 450, in _check_return_codes
Nov 26 23:08:31     msg="Expected zero exit code but got {}".format(first_process.exitcode)
Nov 26 23:08:31   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1149, in assertEqual
Nov 26 23:08:31     self.assertTrue(result, msg=msg)
Nov 26 23:08:31 AssertionError: False is not true : Expected zero exit code but got -11
Nov 26 23:08:31 
Nov 26 23:08:31 ----------------------------------------------------------------------

lw · 2020-11-27T16:45:31Z

torch/lib/c10d/ProcessGroupNCCL.hpp

+      auto fut = c10::make_intrusive<FutureNCCL>();
+      // The new future needs the DataPtr extractor when it gets marked complete
+      // but this might happen immediately inline or in parallel by another
+      // thread. In both these cases this would/might happen before the user has
+      // time to set their own DataPtr extractor, which might lead to failures
+      // if the default extractor can't handle some of the user's types.
+      // Therefore we propagate our extractor.


I realized that since we now use the DataPtr extractor also in markCompleted, we need to do this ugly thing. I suspect that this was the cause of the CI failures (we'll see shortly). I welcome ideas on how to do this better...

lw · 2020-11-28T13:24:30Z

It turned out the segfault came from accessing the parameters rather than the fields when doing the assert inside the constructor (i.e., the difference was a trailing underscore). This led to dereferencing a null pointer.

lw · 2020-11-28T13:26:15Z

torch/csrc/jit/python/pybind_utils.h

@@ -224,6 +224,7 @@ struct VISIBILITY_HIDDEN PythonFutureWrapper
  // vector that has exactly one tensor.
  static std::vector<std::reference_wrapper<const at::DataPtr>> dataPtrExtractor(
      const at::IValue& value) {
+    TORCH_INTERNAL_ASSERT(PyGILState_Check() == 0);


I also added this because I was seeing some deadlocks (before I fixed the segfault) which seemed unrelated, and my suspicion was that it came from a reentrant GIL acquisition, possibly due to the DataPtr extractor now being used in more places. However the CI at the moment doesn't seem to trigger neither the deadlock nor this check.

lw · 2020-11-29T13:35:03Z

The deadlocks persist so, since I couldn't repro them on my local machine, I logged into the CircleCI executor and ran one test with CUDA_LAUNCH_BLOCKING=1. Here's the stack traces: https://gist.github.com/lw/64c4e677476461887c479c6fcbcf0d00. My current best guess for what is happening is:

We have two processes on the same machine, which has two devices. Both processes see both devices but, as far as NCCL is concerned, each of them only uses one device.
Each process, early on, sets up the device it's responsible for (e.g., creates input tensors on them).
Then each process does two things in a row: it attaches a callback to a future, and then launches the NCCL kernel. The callback is run inline, and when that happens we try to acquire the current stream for all devices. Since the "other" device of each process was until then uninitialized, this causes a CUDA context initialization and the creation of a stream.
So here's the situation: process 0 is attempting to initialize a CUDA context on device 1 and create a stream there, but this must wait for the NCCL kernel that was launched on device 1 by process 1, but that kernel won't terminate until process 0 has launched its own kernel, but that can't happen because process 0 is stuck.

For some more context on why it's happening:

Some CUDA operations are "kernel fences" (a term I made up, which means that any kernel launched after those ops won't run until all kernels launched before are done, regardless of what streams are used). This is a known limitation, documented here (and, in fact, I think this is one of the main reasons that the CUDA caching allocator exists). That documentation only refers to what happens inside a single process, and it doesn't mention context initialization and/or stream creation as such "kernel fences", but I suspect that they may actually be. (I'd love some confirmation on this).
The NCCL kernels are also weird, because the two kernels that are launched by the two processes on the two devices won't complete until both of them have started. In short, they form a barrier.

Note that I don't think the order of operations here is critically relevant. I suspect that even if we invoked the callback after we performed the NCCL operation a similar issue could occur.

I have a few ideas on how we could tackle this:

When running callbacks, we only set current streams for devices that are actually used by the value that is held in the future. (The other devices are left untouches/unset). Which means that we wouldn't initialize any context nor create any stream. On the other hand, if the user then uses some other device in their callback, that's undefined behavior.
If we had a way to know which devices have already been initialized, then we could set the current stream for those, but not for any other ones. This is quite similar to the above solution, but a bit more "general".
We could keep setting the current stream for all devices but set CUDA_VISIBLE_DEVICES in our tests so that each process only sees the device it's responsible for, and doesn't try to initialize the other one. And then we'd require users to do the same in their code to avoid issues.
We could pre-emptively have all processes initialize all devices (for example when ProcessGroupNCCL starts) so that this won't interfere with later operations.

I don't really have any preference yet among these options, I don't like any of them. I'll have to think a bit more about it.

The "core" design choice we need to make, I think, is whether we want to initialize all devices in all processes. When a process creates a CUDA context on a device this will consume memory on that device, which is undesirable if users don't intend to use the device. So, to rephrase the issue: can we assume that if users didn't explicitly prevent us from using some devices (by using CUDA_VISIBLE_DEVICES) then they are fine with us touching all of them?

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- After the previous changes, this is now much simpler than it sounds. For the most part it just consists in repeating some operations multiple times, once for device (e.g., recording and blocking on events). Funnily, we already had a vector of events, even though we only ever stored one element in it (this probably comes from the fact that this is shared with WorkNCCL, which can hold more than one event). Here, we now also store a vector of device indices. Perhaps the only non-trivial part of this is that now, for "follow-up" Futures (for callbacks), we can't know in advance which device the result will be on so we must determine it dynamically when we receive the result, by inspecting it. That's also easier than it sound because we already have a dataptr extractor. Differential Revision: [D25177556](https://our.internmc.facebook.com/intern/diff/D25177556/) [ghstack-poisoned]

mrshenli · 2020-12-02T23:23:35Z

Which means that we wouldn't initialize any context nor create any stream. On the other hand, if the user then uses some other device in their callback, that's undefined behavior.

This actually looks fine to me if we are only talking about Python callbacks. IIUC, if users will touch a different device in the cb, the Python program will initialize the context first. And if they also wanna use the result tensor in that new device, I would assume they need a to/copy operation, which will synchronize streams on both src and dst. Did I miss any situation that this might go wrong?

mrshenli

LGTM!

mrshenli · 2020-12-02T23:29:23Z

torch/lib/c10d/ProcessGroupNCCL.hpp

@@ -212,28 +211,37 @@ class ProcessGroupNCCL : public ProcessGroup {
   public:
    explicit FutureNCCL(
        at::IValue value,
-        c10::DeviceIndex deviceIndex,
+        std::vector<c10::DeviceIndex> deviceIndices,


do we expect devices to be distinct?

We do, although I think it should be fine if there are duplicated. When the CUDAFuture itself determines the set of devices (inside markCompleted) it explicitly deduplicated them. Let me add a check to the other constructor (the one invoked by ProcessGroupNCCL) to ensure they are distinct.

lw · 2020-12-03T09:16:23Z

Did I miss any situation that this might go wrong?

One issue I was concerned about was when the user's callback returned a value that contained a tensor in another device. Such a value will be stored in the child Future (the one returned by then()), which will need to record an event for each device that is used by the new value, and will do so in the current stream. Suppose the parent Future has a value that resides on device 0, and thus when invoking the callback it only gets a stream from the pool for device 0, and doesn't touch the other devices. If device 1 is not initialized yet, its current stream at this point will be the default stream (i.e., nullptr). The user callback can very well use device 1 (and thus initialize it), but the user's current stream for device 1 will be reset when the callback exits, meaning the child Future can't access that stream to record an event on it. If the user properly synchronizes their own stream with the default stream then everything should still work. However the concern is that using the default stream has poor performance, since it cannot run in parallel with any other stream AFAIK.

Another concern is that we could incur in the same deadlock issue with NCCL that we experienced at some point in the iterations of this PR. Especially with multiple threads, it could happen that one user callback initializes device 1 while another thread is performing a NCCL callback, and if these two operations race in the wrong way they could get stuck. However this issue might be minor, as from what I could observe it doesn't seem to affect newer devices, and moreover the user can easily circumvent it if they explicitly initialize the devices before starting any NCCL operations.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- After the previous changes, this is now much simpler than it sounds. For the most part it just consists in repeating some operations multiple times, once for device (e.g., recording and blocking on events). Funnily, we already had a vector of events, even though we only ever stored one element in it (this probably comes from the fact that this is shared with WorkNCCL, which can hold more than one event). Here, we now also store a vector of device indices. Perhaps the only non-trivial part of this is that now, for "follow-up" Futures (for callbacks), we can't know in advance which device the result will be on so we must determine it dynamically when we receive the result, by inspecting it. That's also easier than it sound because we already have a dataptr extractor. Differential Revision: [D25177556](https://our.internmc.facebook.com/intern/diff/D25177556/) [ghstack-poisoned]

facebook-github-bot · 2020-12-10T13:13:18Z

This pull request has been merged in e294c2d.

lw requested review from mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 26, 2020 18:56

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 26, 2020

mrshenli approved these changes Nov 26, 2020

View reviewed changes

mrshenli requested changes Nov 26, 2020

View reviewed changes

lw commented Nov 27, 2020

View reviewed changes

lw requested a review from apaszke as a code owner November 28, 2020 11:35

lw commented Nov 28, 2020

View reviewed changes

lw mentioned this pull request Nov 29, 2020

Debug CI failures of #48501 #48553

Closed

lw mentioned this pull request Nov 29, 2020

Avoid using FutureNCCL before it's ready #48561

Closed

This was referenced Nov 29, 2020

Add some safeguards to FutureNCCL #48562

Closed

Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563

Closed

mrshenli approved these changes Dec 2, 2020

View reviewed changes

This was referenced Dec 3, 2020

Cache the DataPtrs in CUDAFuture #48788

Closed

Make CUDAFuture remember and restore current device in callback #48789

Closed

Add support for async callbacks in ivalue::Future #48790

Closed

lw mentioned this pull request Dec 4, 2020

Remove DataPtr extractor from CUDAFuture #48840

Closed

lw mentioned this pull request Dec 8, 2020

Drop FutureNCCL in favor of vanilla CUDAFuture #49014

Closed

facebook-github-bot closed this in e294c2d Dec 10, 2020

facebook-github-bot added the Merged label Dec 10, 2020

facebook-github-bot deleted the gh/lw/89/head branch December 13, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-GPU support to FutureNCCL #48500

Add multi-GPU support to FutureNCCL #48500

lw commented Nov 26, 2020 •

edited

dr-ci bot commented Nov 26, 2020 •

edited

mrshenli left a comment

mrshenli Nov 26, 2020

lw Nov 27, 2020

mrshenli Nov 26, 2020

lw Nov 27, 2020

mrshenli Nov 26, 2020

lw Nov 27, 2020

mrshenli Dec 2, 2020 •

edited

mrshenli left a comment

lw Nov 27, 2020

lw commented Nov 28, 2020

lw Nov 28, 2020

lw commented Nov 29, 2020

mrshenli commented Dec 2, 2020

mrshenli left a comment

mrshenli Dec 2, 2020

lw Dec 3, 2020

lw commented Dec 3, 2020

facebook-github-bot commented Dec 10, 2020

Add multi-GPU support to FutureNCCL #48500

Add multi-GPU support to FutureNCCL #48500

Conversation

lw commented Nov 26, 2020 • edited

dr-ci bot commented Nov 26, 2020 • edited

💊 CI failures summary and remediations

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrshenli Dec 2, 2020 • edited

Choose a reason for hiding this comment

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lw commented Nov 28, 2020

Choose a reason for hiding this comment

lw commented Nov 29, 2020

mrshenli commented Dec 2, 2020

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lw commented Dec 3, 2020

facebook-github-bot commented Dec 10, 2020

lw commented Nov 26, 2020 •

edited

dr-ci bot commented Nov 26, 2020 •

edited

mrshenli Dec 2, 2020 •

edited