Avoid using FutureNCCL before it's ready #48561

lw · 2020-11-29T19:53:05Z

Stack from ghstack:

Add support for async callbacks in ivalue::Future #48790 Add support for async callbacks in ivalue::Future
Drop FutureNCCL in favor of vanilla CUDAFuture #49014 Drop FutureNCCL in favor of vanilla CUDAFuture
Make CUDAFuture remember and restore current device in callback #48789 Make CUDAFuture remember and restore current device in callback
Remove DataPtr extractor from CUDAFuture #48840 Remove DataPtr extractor from CUDAFuture
Cache the DataPtrs in CUDAFuture #48788 Cache the DataPtrs in CUDAFuture
Split out reusable CUDAFuture from FutureNCCL #48506 Split out reusable CUDAFuture from FutureNCCL
Merge common parts of FutureNCCL into at::ivalue::Future #48505 Merge common parts of FutureNCCL into at::ivalue::Future
Split FutureNCCL's CUDA-specific parts from generic future logic #48504 Split FutureNCCL's CUDA-specific parts from generic future logic
Support wider range of types in FutureNCCL #48502 Support wider range of types in FutureNCCL
Don't store device indices separately on FutureNCCL #48501 Don't store device indices separately on FutureNCCL
Add multi-GPU support to FutureNCCL #48500 Add multi-GPU support to FutureNCCL
Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563 Fix FutureNCCL not recording dataptrs with caching alloc in wait()
Fix FutureNCCL's completed() disagreeing with wait() #48503 Fix FutureNCCL's completed() disagreeing with wait()
Record CUDA events for "follow-up" FutureNCCL inside markCompleted #48499 Record CUDA events for "follow-up" FutureNCCL inside markCompleted
Use fresh stream from pool for each FutureNCCL callback #48498 Use fresh stream from pool for each FutureNCCL callback
Make FutureNCCL record events in current stream #48497 Make FutureNCCL record events in current stream
Have FutureNCCL record streams w/ allocator in addCallback #48496 Have FutureNCCL record streams w/ allocator in addCallback
Add some safeguards to FutureNCCL #48562 Add some safeguards to FutureNCCL
Remove NCCL dependency from PythonFutureWrapper #48495 Remove NCCL dependency from PythonFutureWrapper
Avoid using FutureNCCL before it's ready #48561 Avoid using FutureNCCL before it's ready

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening before the work was actually launched, however FutureNCCL does always invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.

pytorch/torch/lib/c10d/ProcessGroup.cpp

Line 66 in 383abf1

recordFunctionEndCallback_ = at::wrapPropagateTLSState(end_handler);

In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.

Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.

Differential Revision: D25210337

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams. https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66 In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it. Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes. Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/) [ghstack-poisoned]

codecov · 2020-11-29T23:55:45Z

Codecov Report

Merging #48561 (34e597b) into gh/lw/97/base (09b974c) will increase coverage by 0.01%.
The diff coverage is n/a.

@@                Coverage Diff                @@
##           gh/lw/97/base   #48561      +/-   ##
=================================================
+ Coverage          80.74%   80.75%   +0.01%     
=================================================
  Files               1872     1867       -5     
  Lines             201866   201619     -247     
=================================================
- Hits              162992   162814     -178     
+ Misses             38874    38805      -69

mrshenli · 2020-12-01T21:29:49Z

torch/lib/c10d/ProcessGroupNCCL.cpp

    // Note when can_profile is false, profilingTitle is not provided and so,
    // recordFunctionEndCallback_ is not set.
-    work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_));
+    work->recordFunctionEndCallback_();


The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op.

Does this mean the existing code for launching the cb was wrong?

The code of FutureNCCL was correct (because, for NCCL, it's "correct" to always invoke callbacks inline).

The mistake was in ProcessGroupNCCL, which was constructing a FutureNCCL with "incomplete" arguments, but then using the FutureNCCL before making those arguments complete. (These "arguments" are the CUDA events).

Once FutureNCCL was returned by ProcessGroupNCCL, its arguments were complete, so the users of ProcessGroupNCCL couldn't hit this issue.

I still believe that the high-level behavior here is wrong, since it caused recordFunctionEndCallback_ to be invoked before the NCCL function was called. But this commit does nothing more than make that mistake explicit. I think it should be fixed in a separate PR by someone who knows what's going on.

Yes, this is a bug and the reason why we disabled some of the failing tests as part of #48129. Attempting to fix/debug this as part of #48664, but in the meantime this PR should be fine as-is.

mrshenli

This LGTM as it does not change the existing behavior.

cc @SciPioneer (for FutureNCCL) @rohan-varma (for profiler) please comment if there are concerns. The question I have is that does this mean it does not correctly profile the execution time of NCCL c10d operations?

rohan-varma · 2020-12-01T02:39:58Z

torch/lib/c10d/ProcessGroupNCCL.cpp

    // Note when can_profile is false, profilingTitle is not provided and so,
    // recordFunctionEndCallback_ is not set.
-    work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_));
+    work->recordFunctionEndCallback_();


I agree with your analysis that this code was initially incorrect.

However the ideal functionality (trying to fix this in #48196) is indeed to run recordFunctionEndCallback_ after this NCCL collective has been completed. I tried to make this the case by moving this addCallback to after where the work is launched, although that still didn't seem to make it work since the callback is CPU only when the profiler is enabled without use_cuda=True. Do you have any suggestions on how to ensure that a CPU callback can run after CUDA operations such as NCCL collectives are guaranteed to have been completed?

since the callback is CPU only when the profiler is enabled without use_cuda=True.

Does this mean ideally we need to record the TLS including use_cuda=True when calling addCallback (if it can behave as normal future.addCallback). And because FutureNCCL always run cb inline, we don't even have the opportunity to check the state for use_cuda?

Question, for c10d ops, when/where does profiler initialization/enabling happen?

I don't know how the profiler is designed to work with CUDA, but it does sound odd to me to try to measure GPU timing using CPU code. I thought the proper tool for the job were CUDA events, created with the EnableTiming flag, and the cudaEventElapsedTime function. If that can work, then such events can be created and enqueued even from an inline CPU callback, and CUDA should still be able to correctly collect timing for them.

@lw Right, to profile GPU code the user should invoke the profiler with use_cuda=True.

When this happens, this code path is run, which indeed uses CUDA events to do the timing:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/profiler_cuda.cpp#L35

@mrshenli, I think with NCCL we don’t have to propagate any TLS State since the RecordFunction start and end happen in the same thread synchronously. Even so, with the current way we fork the ProfilerState, the use_cuda flag is carried along so events on another thread would be profiled with cuda.

@mrshenli For profiler init, that would be by the user invoking python profiler with c10d ops wrapped inside. For RecordFunction initialization, basically RecordFunction::start, it happens here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroup.cpp#L60

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams. https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66 In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it. Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes. Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/) [ghstack-poisoned]

facebook-github-bot · 2020-12-10T13:12:23Z

This pull request has been merged in 7f7f0fa.

lw requested review from mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 29, 2020 19:53

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 29, 2020

mrshenli reviewed Dec 1, 2020

View reviewed changes

mrshenli approved these changes Dec 1, 2020

View reviewed changes

rohan-varma reviewed Dec 1, 2020

View reviewed changes

This was referenced Dec 3, 2020

Cache the DataPtrs in CUDAFuture #48788

Closed

Make CUDAFuture remember and restore current device in callback #48789

Closed

lw mentioned this pull request Dec 3, 2020

Add support for async callbacks in ivalue::Future #48790

Closed

lw mentioned this pull request Dec 4, 2020

Remove DataPtr extractor from CUDAFuture #48840

Closed

lw mentioned this pull request Dec 8, 2020

Drop FutureNCCL in favor of vanilla CUDAFuture #49014

Closed

rohan-varma mentioned this pull request Dec 8, 2020

Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda #48946

Closed

facebook-github-bot closed this in 7f7f0fa Dec 10, 2020

facebook-github-bot added the Merged label Dec 10, 2020

facebook-github-bot deleted the gh/lw/97/head branch December 13, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid using FutureNCCL before it's ready #48561

Avoid using FutureNCCL before it's ready #48561

lw commented Nov 29, 2020 •

edited

codecov bot commented Nov 29, 2020 •

edited

mrshenli Dec 1, 2020

lw Dec 1, 2020

lw Dec 1, 2020

rohan-varma Dec 1, 2020

mrshenli left a comment

rohan-varma Dec 1, 2020

mrshenli Dec 2, 2020

lw Dec 2, 2020 •

edited

rohan-varma Dec 2, 2020

rohan-varma Dec 2, 2020

facebook-github-bot commented Dec 10, 2020

Avoid using FutureNCCL before it's ready #48561

Avoid using FutureNCCL before it's ready #48561

Conversation

lw commented Nov 29, 2020 • edited

codecov bot commented Nov 29, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lw Dec 2, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 10, 2020

lw commented Nov 29, 2020 •

edited

codecov bot commented Nov 29, 2020 •

edited

lw Dec 2, 2020 •

edited