Merge common parts of FutureNCCL into at::ivalue::Future #48505

lw · 2020-11-26T18:57:25Z

Stack from ghstack:

Add support for async callbacks in ivalue::Future #48790 Add support for async callbacks in ivalue::Future
Drop FutureNCCL in favor of vanilla CUDAFuture #49014 Drop FutureNCCL in favor of vanilla CUDAFuture
Make CUDAFuture remember and restore current device in callback #48789 Make CUDAFuture remember and restore current device in callback
Remove DataPtr extractor from CUDAFuture #48840 Remove DataPtr extractor from CUDAFuture
Cache the DataPtrs in CUDAFuture #48788 Cache the DataPtrs in CUDAFuture
Split out reusable CUDAFuture from FutureNCCL #48506 Split out reusable CUDAFuture from FutureNCCL
Merge common parts of FutureNCCL into at::ivalue::Future #48505 Merge common parts of FutureNCCL into at::ivalue::Future
Split FutureNCCL's CUDA-specific parts from generic future logic #48504 Split FutureNCCL's CUDA-specific parts from generic future logic
Support wider range of types in FutureNCCL #48502 Support wider range of types in FutureNCCL
Don't store device indices separately on FutureNCCL #48501 Don't store device indices separately on FutureNCCL
Add multi-GPU support to FutureNCCL #48500 Add multi-GPU support to FutureNCCL
Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563 Fix FutureNCCL not recording dataptrs with caching alloc in wait()
Fix FutureNCCL's completed() disagreeing with wait() #48503 Fix FutureNCCL's completed() disagreeing with wait()
Record CUDA events for "follow-up" FutureNCCL inside markCompleted #48499 Record CUDA events for "follow-up" FutureNCCL inside markCompleted
Use fresh stream from pool for each FutureNCCL callback #48498 Use fresh stream from pool for each FutureNCCL callback
Make FutureNCCL record events in current stream #48497 Make FutureNCCL record events in current stream
Have FutureNCCL record streams w/ allocator in addCallback #48496 Have FutureNCCL record streams w/ allocator in addCallback
Add some safeguards to FutureNCCL #48562 Add some safeguards to FutureNCCL
Remove NCCL dependency from PythonFutureWrapper #48495 Remove NCCL dependency from PythonFutureWrapper
Avoid using FutureNCCL before it's ready #48561 Avoid using FutureNCCL before it's ready

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This is already happening, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have only the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future.

Differential Revision: D25180535

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ... The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future. Differential Revision: [D25180535](https://our.internmc.facebook.com/intern/diff/D25180535/) [ghstack-poisoned]

lw · 2020-11-26T19:18:57Z

aten/src/ATen/core/ivalue_inl.h

-    // Cannot move capture std::function in lambda, because it cannot deduce
-    // the template type for std::function. Hence use std::bind to explicitly
-    // specify types.


I was curious to see what the reason for this problem was so I tried to undo this fix... and it seems to work? Maybe the comment was outdated?

dr-ci · 2020-11-26T23:43:24Z

💊 CI failures summary and remediations

As of commit d667d5e (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 25 times.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ... The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future. Differential Revision: [D25180535](https://our.internmc.facebook.com/intern/diff/D25180535/) [ghstack-poisoned]

mrshenli

LGTM. Please check with @wanchaol regarding whether adding cbs to ivalue Future is OK.

mrshenli · 2020-11-27T22:17:44Z

aten/src/ATen/core/ivalue_inl.h

+  virtual c10::intrusive_ptr<Future> createInstance(at::TypePtr type) {
+    return c10::make_intrusive<Future>(type);
+  }
+
+  virtual void postMarkCompletedHook(const at::IValue& value) {}
+
+  virtual std::function<void(void)> wrapCallback(std::function<void(void)> callback) { return callback; }
+
+  virtual void postWaitHook() {}


Shall we add some comments to these functions and explain what derived classes need to do when implementing them?

mrshenli · 2020-11-27T22:21:10Z

torch/lib/c10d/ProcessGroupNCCL.hpp

-      if (error_) {
-        throw *error_;
+    void setDataPtrExtractor(DataPtrExtractor data_ptr_extractor) override {
+      // To avoid races with other threads that may be using the extractor, we


do we expect this function to be called multiple times? If no, do we need a lock + assert?

Yes, indeed I was being sloppy here, thanks for calling me out. It can be called multiple times: once when the future is constructed (if it's a "child" future created by the then() method), and then one more time by PythonFutureWrapper for each time we call then() on that future (in order to create a "grand-child" future). Now that I think of it this data race was probably there before, but we should still fix it.

I'm still hoping we can come up with a better idea for this whole DataPtrExtractor, or get rid of it entirely if we can get a JIT helper that can do it for us. If neither of that happens, I'll add a lock here.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ... The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future. Differential Revision: [D25180535](https://our.internmc.facebook.com/intern/diff/D25180535/) [ghstack-poisoned]

lw · 2020-11-30T12:58:39Z

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py

@@ -125,7 +125,7 @@ def compute_q(fut):
        return [
            dist.all_reduce(q, group=group_to_use, async_op=True)
            .get_future()
-            .value()[0]
+            .wait()[0]


This is needed because, in the PythonFutureWrapper, calling wait() also returns a value(), whereas calling value() does not wait. This was already true on the CPU side (i.e., value() fails if the future isn't complete), and now it's also true for the GPU part (we don't sync streams when calling value()).

This was a discrepancy between ivalue::Future and FutureNCCL which I think we should fix.

The value() method is thus supposed to only be used to retrieve the value once we know it's already been correctly waited/synchronized, for example within a callback.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ... The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future. Differential Revision: [D25180535](https://our.internmc.facebook.com/intern/diff/D25180535/) [ghstack-poisoned]

wanchaol

looks good to me.

wanchaol · 2020-12-04T08:34:15Z

aten/src/ATen/core/ivalue_inl.h

    std::unique_lock<std::mutex> lock(mutex_);
    while (!completed_) {
      finished_cv_.wait(lock);
    }
+
+    if (!eptr_) {


shall we use hasError instead?

I don't think we can use it as-is, since it tries to acquire the mutex, and this method already has it, and that would deadlock. If you think it's worth it I can add a hasErrorInternal method that doesn't acquire the mutex.

wanchaol · 2020-12-04T08:37:14Z

aten/src/ATen/core/ivalue_inl.h

+    return c10::make_intrusive<Future>(type);
+  }
+
+  virtual void postMarkCompletedHook(const at::IValue& value) {}


can you briefly document how these there apis are used, probably only FutureNCCL is using them now, but if there're other future derived type there, might be a good reference.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ... The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future. Differential Revision: [D25180535](https://our.internmc.facebook.com/intern/diff/D25180535/) [ghstack-poisoned]

facebook-github-bot · 2020-12-10T13:13:18Z

This pull request has been merged in 4c425e8.

lw requested review from mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 26, 2020 18:57

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 26, 2020

lw commented Nov 26, 2020

View reviewed changes

mrshenli approved these changes Nov 27, 2020

View reviewed changes

lw added 2 commits November 28, 2020 03:34

This was referenced Nov 29, 2020

Avoid using FutureNCCL before it's ready #48561

Closed

Add some safeguards to FutureNCCL #48562

Closed

Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563

Closed

lw requested a review from apaszke as a code owner November 30, 2020 11:36

lw commented Nov 30, 2020

View reviewed changes

This was referenced Dec 3, 2020

Cache the DataPtrs in CUDAFuture #48788

Closed

Make CUDAFuture remember and restore current device in callback #48789

Closed

Add support for async callbacks in ivalue::Future #48790

Closed

wanchaol approved these changes Dec 4, 2020

View reviewed changes

lw mentioned this pull request Dec 4, 2020

Remove DataPtr extractor from CUDAFuture #48840

Closed

lw mentioned this pull request Dec 8, 2020

Drop FutureNCCL in favor of vanilla CUDAFuture #49014

Closed

facebook-github-bot closed this in 4c425e8 Dec 10, 2020

facebook-github-bot added the Merged label Dec 10, 2020

facebook-github-bot deleted the gh/lw/94/head branch December 13, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge common parts of FutureNCCL into at::ivalue::Future #48505

Merge common parts of FutureNCCL into at::ivalue::Future #48505

lw commented Nov 26, 2020 •

edited

lw Nov 26, 2020

dr-ci bot commented Nov 26, 2020 •

edited

mrshenli left a comment

mrshenli Nov 27, 2020

mrshenli Nov 27, 2020

lw Nov 27, 2020

lw Nov 30, 2020

wanchaol left a comment

wanchaol Dec 4, 2020

lw Dec 4, 2020

wanchaol Dec 4, 2020

facebook-github-bot commented Dec 10, 2020

Merge common parts of FutureNCCL into at::ivalue::Future #48505

Merge common parts of FutureNCCL into at::ivalue::Future #48505

Conversation

lw commented Nov 26, 2020 • edited

Choose a reason for hiding this comment

dr-ci bot commented Nov 26, 2020 • edited

💊 CI failures summary and remediations

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 10, 2020

lw commented Nov 26, 2020 •

edited

dr-ci bot commented Nov 26, 2020 •

edited