Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using FutureNCCL before it's ready #48561

Closed
wants to merge 5 commits into from
Closed

Conversation

lw
Copy link
Contributor

@lw lw commented Nov 29, 2020

Stack from ghstack:

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).


WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening before the work was actually launched, however FutureNCCL does always invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.

recordFunctionEndCallback_ = at::wrapPropagateTLSState(end_handler);

In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.

Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.

Differential Revision: D25210337

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.

https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66

In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.

Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.

Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/)

[ghstack-poisoned]
@codecov
Copy link

codecov bot commented Nov 29, 2020

Codecov Report

Merging #48561 (34e597b) into gh/lw/97/base (09b974c) will increase coverage by 0.01%.
The diff coverage is n/a.

@@                Coverage Diff                @@
##           gh/lw/97/base   #48561      +/-   ##
=================================================
+ Coverage          80.74%   80.75%   +0.01%     
=================================================
  Files               1872     1867       -5     
  Lines             201866   201619     -247     
=================================================
- Hits              162992   162814     -178     
+ Misses             38874    38805      -69     

// Note when can_profile is false, profilingTitle is not provided and so,
// recordFunctionEndCallback_ is not set.
work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_));
work->recordFunctionEndCallback_();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op.

Does this mean the existing code for launching the cb was wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code of FutureNCCL was correct (because, for NCCL, it's "correct" to always invoke callbacks inline).

The mistake was in ProcessGroupNCCL, which was constructing a FutureNCCL with "incomplete" arguments, but then using the FutureNCCL before making those arguments complete. (These "arguments" are the CUDA events).

Once FutureNCCL was returned by ProcessGroupNCCL, its arguments were complete, so the users of ProcessGroupNCCL couldn't hit this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe that the high-level behavior here is wrong, since it caused recordFunctionEndCallback_ to be invoked before the NCCL function was called. But this commit does nothing more than make that mistake explicit. I think it should be fixed in a separate PR by someone who knows what's going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bug and the reason why we disabled some of the failing tests as part of #48129. Attempting to fix/debug this as part of #48664, but in the meantime this PR should be fine as-is.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM as it does not change the existing behavior.

cc @SciPioneer (for FutureNCCL) @rohan-varma (for profiler) please comment if there are concerns. The question I have is that does this mean it does not correctly profile the execution time of NCCL c10d operations?

// Note when can_profile is false, profilingTitle is not provided and so,
// recordFunctionEndCallback_ is not set.
work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_));
work->recordFunctionEndCallback_();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your analysis that this code was initially incorrect.

However the ideal functionality (trying to fix this in #48196) is indeed to run recordFunctionEndCallback_ after this NCCL collective has been completed. I tried to make this the case by moving this addCallback to after where the work is launched, although that still didn't seem to make it work since the callback is CPU only when the profiler is enabled without use_cuda=True. Do you have any suggestions on how to ensure that a CPU callback can run after CUDA operations such as NCCL collectives are guaranteed to have been completed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the callback is CPU only when the profiler is enabled without use_cuda=True.

Does this mean ideally we need to record the TLS including use_cuda=True when calling addCallback (if it can behave as normal future.addCallback). And because FutureNCCL always run cb inline, we don't even have the opportunity to check the state for use_cuda?

Question, for c10d ops, when/where does profiler initialization/enabling happen?

Copy link
Contributor Author

@lw lw Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how the profiler is designed to work with CUDA, but it does sound odd to me to try to measure GPU timing using CPU code. I thought the proper tool for the job were CUDA events, created with the EnableTiming flag, and the cudaEventElapsedTime function. If that can work, then such events can be created and enqueued even from an inline CPU callback, and CUDA should still be able to correctly collect timing for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lw Right, to profile GPU code the user should invoke the profiler with use_cuda=True.

When this happens, this code path is run, which indeed uses CUDA events to do the timing:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/profiler_cuda.cpp#L35

@mrshenli, I think with NCCL we don’t have to propagate any TLS State since the RecordFunction start and end happen in the same thread synchronously. Even so, with the current way we fork the ProfilerState, the use_cuda flag is carried along so events on another thread would be profiled with cuda.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli For profiler init, that would be by the user invoking python profiler with c10d ops wrapped inside. For RecordFunction initialization, basically RecordFunction::start, it happens here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroup.cpp#L60

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.

https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66

In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.

Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.

Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.

https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66

In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.

Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.

Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.

https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66

In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.

Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.

Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/)

[ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

---

WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.

https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66

In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.

Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.

Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7f7f0fa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants