Add some safeguards to FutureNCCL #48562

lw · 2020-11-29T19:53:17Z

Stack from ghstack:

Add support for async callbacks in ivalue::Future #48790 Add support for async callbacks in ivalue::Future
Drop FutureNCCL in favor of vanilla CUDAFuture #49014 Drop FutureNCCL in favor of vanilla CUDAFuture
Make CUDAFuture remember and restore current device in callback #48789 Make CUDAFuture remember and restore current device in callback
Remove DataPtr extractor from CUDAFuture #48840 Remove DataPtr extractor from CUDAFuture
Cache the DataPtrs in CUDAFuture #48788 Cache the DataPtrs in CUDAFuture
Split out reusable CUDAFuture from FutureNCCL #48506 Split out reusable CUDAFuture from FutureNCCL
Merge common parts of FutureNCCL into at::ivalue::Future #48505 Merge common parts of FutureNCCL into at::ivalue::Future
Split FutureNCCL's CUDA-specific parts from generic future logic #48504 Split FutureNCCL's CUDA-specific parts from generic future logic
Support wider range of types in FutureNCCL #48502 Support wider range of types in FutureNCCL
Don't store device indices separately on FutureNCCL #48501 Don't store device indices separately on FutureNCCL
Add multi-GPU support to FutureNCCL #48500 Add multi-GPU support to FutureNCCL
Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563 Fix FutureNCCL not recording dataptrs with caching alloc in wait()
Fix FutureNCCL's completed() disagreeing with wait() #48503 Fix FutureNCCL's completed() disagreeing with wait()
Record CUDA events for "follow-up" FutureNCCL inside markCompleted #48499 Record CUDA events for "follow-up" FutureNCCL inside markCompleted
Use fresh stream from pool for each FutureNCCL callback #48498 Use fresh stream from pool for each FutureNCCL callback
Make FutureNCCL record events in current stream #48497 Make FutureNCCL record events in current stream
Have FutureNCCL record streams w/ allocator in addCallback #48496 Have FutureNCCL record streams w/ allocator in addCallback
Add some safeguards to FutureNCCL #48562 Add some safeguards to FutureNCCL
Remove NCCL dependency from PythonFutureWrapper #48495 Remove NCCL dependency from PythonFutureWrapper
Avoid using FutureNCCL before it's ready #48561 Avoid using FutureNCCL before it's ready

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.

Differential Revision: D25210333

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events). I'm also making the second constructor private, as it's only supposed to be used by the then() method. Differential Revision: [D25210333](https://our.internmc.facebook.com/intern/diff/D25210333/) [ghstack-poisoned]

dr-ci · 2020-11-29T19:54:52Z

💊 CI failures summary and remediations

As of commit 203376d (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events). I'm also making the second constructor private, as it's only supposed to be used by the then() method. Differential Revision: [D25210333](https://our.internmc.facebook.com/intern/diff/D25210333/) [ghstack-poisoned]

mrshenli · 2020-12-02T03:09:30Z

torch/lib/c10d/ProcessGroupNCCL.hpp

+        TORCH_INTERNAL_ASSERT(event.isCreated());
+        TORCH_INTERNAL_ASSERT(event.device_index() == deviceIndex_);
+      }
+      for (const at::DataPtr& data_ptr : extractDataPtrs(value_)) {


nit: (this should not block this PR, please ignore for now and fix if necessary later) can extractDataPtrs be expensive (e.g., using pickling)? If so, do we need to cache the extracted data ptrs?

My rationale was that the cost of "extracting" the data ptrs should be of the same order as "iterating" over them (after extraction). What I mean is that extracting data ptrs should have a linear complexity in the data size, and thus when combined with the linear-time iteration over the data ptrs, it doesn't effectively contribute to the overall (asymptotical) complexity.

I do realize though that this only holds when the data contains only (or mostly) tensors. If the value is a big user class with only one tensor, it won't be the case. And indeed we've ended up extracting data ptrs multiple times, so it makes sense to me to cache them. I'll add this as a separate diff.

mrshenli · 2020-12-02T03:16:18Z

torch/lib/c10d/ProcessGroupNCCL.hpp

@@ -231,8 +231,16 @@ class ProcessGroupNCCL : public ProcessGroup {
      TORCH_INTERNAL_ASSERT(
          cudaEvents_->size() == 1,
          "FutureNCCL only supports single-process single-device mode.");
+      for (const at::cuda::CUDAEvent& event : *cudaEvents_) {
+        TORCH_INTERNAL_ASSERT(event.isCreated());
+        TORCH_INTERNAL_ASSERT(event.device_index() == deviceIndex_);


I might miss sth. Why all events are on the same device? For multi-input collective calls, aren't we going to get one event per device?

At this stage there is in fact a single event, which is on deviceIndex_ (see the check just above). I had no real reason to use a for loop here rather than just check the first (and only) element of cudaEvents_. If you want I can change this.

mrshenli

LGTM!

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events). I'm also making the second constructor private, as it's only supposed to be used by the then() method. Differential Revision: [D25210333](https://our.internmc.facebook.com/intern/diff/D25210333/) [ghstack-poisoned]

facebook-github-bot · 2020-12-10T13:12:23Z

This pull request has been merged in 868a1a4.

lw requested review from mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 29, 2020 19:53

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 29, 2020

mrshenli reviewed Dec 2, 2020

View reviewed changes

mrshenli approved these changes Dec 2, 2020

View reviewed changes

This was referenced Dec 3, 2020

Cache the DataPtrs in CUDAFuture #48788

Closed

Make CUDAFuture remember and restore current device in callback #48789

Closed

lw mentioned this pull request Dec 3, 2020

Add support for async callbacks in ivalue::Future #48790

Closed

lw mentioned this pull request Dec 4, 2020

Remove DataPtr extractor from CUDAFuture #48840

Closed

lw mentioned this pull request Dec 8, 2020

Drop FutureNCCL in favor of vanilla CUDAFuture #49014

Closed

facebook-github-bot closed this in 868a1a4 Dec 10, 2020

facebook-github-bot added the Merged label Dec 10, 2020

facebook-github-bot deleted the gh/lw/98/head branch December 13, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add some safeguards to FutureNCCL #48562

Add some safeguards to FutureNCCL #48562

Uh oh!

lw commented Nov 29, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Nov 29, 2020 •

edited

Loading

Uh oh!

mrshenli Dec 2, 2020

Uh oh!

lw Dec 2, 2020

Uh oh!

mrshenli Dec 2, 2020

Uh oh!

lw Dec 2, 2020

Uh oh!

mrshenli left a comment

Uh oh!

facebook-github-bot commented Dec 10, 2020

Uh oh!

Uh oh!

Add some safeguards to FutureNCCL #48562

Add some safeguards to FutureNCCL #48562

Uh oh!

Conversation

lw commented Nov 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Nov 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

mrshenli Dec 2, 2020

Choose a reason for hiding this comment

Uh oh!

lw Dec 2, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli Dec 2, 2020

Choose a reason for hiding this comment

Uh oh!

lw Dec 2, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Dec 10, 2020

Uh oh!

Uh oh!

lw commented Nov 29, 2020 •

edited

Loading

dr-ci bot commented Nov 29, 2020 •

edited

Loading