New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using FutureNCCL before it's ready #48561
Conversation
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams. https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66 In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it. Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes. Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/) [ghstack-poisoned]
Codecov Report
@@ Coverage Diff @@
## gh/lw/97/base #48561 +/- ##
=================================================
+ Coverage 80.74% 80.75% +0.01%
=================================================
Files 1872 1867 -5
Lines 201866 201619 -247
=================================================
- Hits 162992 162814 -178
+ Misses 38874 38805 -69 |
// Note when can_profile is false, profilingTitle is not provided and so, | ||
// recordFunctionEndCallback_ is not set. | ||
work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_)); | ||
work->recordFunctionEndCallback_(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op.
Does this mean the existing code for launching the cb was wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code of FutureNCCL was correct (because, for NCCL, it's "correct" to always invoke callbacks inline).
The mistake was in ProcessGroupNCCL, which was constructing a FutureNCCL with "incomplete" arguments, but then using the FutureNCCL before making those arguments complete. (These "arguments" are the CUDA events).
Once FutureNCCL was returned by ProcessGroupNCCL, its arguments were complete, so the users of ProcessGroupNCCL couldn't hit this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe that the high-level behavior here is wrong, since it caused recordFunctionEndCallback_ to be invoked before the NCCL function was called. But this commit does nothing more than make that mistake explicit. I think it should be fixed in a separate PR by someone who knows what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM as it does not change the existing behavior.
cc @SciPioneer (for FutureNCCL) @rohan-varma (for profiler) please comment if there are concerns. The question I have is that does this mean it does not correctly profile the execution time of NCCL c10d operations?
// Note when can_profile is false, profilingTitle is not provided and so, | ||
// recordFunctionEndCallback_ is not set. | ||
work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_)); | ||
work->recordFunctionEndCallback_(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your analysis that this code was initially incorrect.
However the ideal functionality (trying to fix this in #48196) is indeed to run recordFunctionEndCallback_
after this NCCL collective has been completed. I tried to make this the case by moving this addCallback to after where the work is launched, although that still didn't seem to make it work since the callback is CPU only when the profiler is enabled without use_cuda=True
. Do you have any suggestions on how to ensure that a CPU callback can run after CUDA operations such as NCCL collectives are guaranteed to have been completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the callback is CPU only when the profiler is enabled without
use_cuda=True
.
Does this mean ideally we need to record the TLS including use_cuda=True
when calling addCallback
(if it can behave as normal future.addCallback). And because FutureNCCL
always run cb inline, we don't even have the opportunity to check the state for use_cuda
?
Question, for c10d ops, when/where does profiler initialization/enabling happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how the profiler is designed to work with CUDA, but it does sound odd to me to try to measure GPU timing using CPU code. I thought the proper tool for the job were CUDA events, created with the EnableTiming flag, and the cudaEventElapsedTime function. If that can work, then such events can be created and enqueued even from an inline CPU callback, and CUDA should still be able to correctly collect timing for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lw Right, to profile GPU code the user should invoke the profiler with use_cuda=True.
When this happens, this code path is run, which indeed uses CUDA events to do the timing:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/profiler_cuda.cpp#L35
@mrshenli, I think with NCCL we don’t have to propagate any TLS State since the RecordFunction start and end happen in the same thread synchronously. Even so, with the current way we fork the ProfilerState, the use_cuda flag is carried along so events on another thread would be profiled with cuda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli For profiler init, that would be by the user invoking python profiler with c10d ops wrapped inside. For RecordFunction initialization, basically RecordFunction::start, it happens here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroup.cpp#L60
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams. https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66 In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it. Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes. Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/) [ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams. https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66 In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it. Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes. Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/) [ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams. https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66 In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it. Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes. Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/) [ghstack-poisoned]
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams. https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66 In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it. Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes. Differential Revision: [D25210337](https://our.internmc.facebook.com/intern/diff/D25210337/) [ghstack-poisoned]
This pull request has been merged in 7f7f0fa. |
Stack from ghstack:
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening before the work was actually launched, however FutureNCCL does always invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.
pytorch/torch/lib/c10d/ProcessGroup.cpp
Line 66 in 383abf1
In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.
Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.
Differential Revision: D25210337