-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using FutureNCCL before it's ready #48561
Changes from all commits
a7e8ed6
b6af445
a50d4fc
34e597b
6244665
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -462,7 +462,7 @@ ProcessGroupNCCL::ProcessGroupNCCL( | |
if (blockingWait_ && asyncErrorHandling_) { | ||
LOG(INFO) << "[Rank " << rank_ | ||
<< "] NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING " | ||
<< "should not both be enabled. " | ||
<< "should not both be enabled. " | ||
<< "Only NCCL_BLOCKING_WAIT is being used in this process."; | ||
asyncErrorHandling_ = false; | ||
} | ||
|
@@ -1073,15 +1073,13 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective( | |
|
||
if (work->recordFunctionEndCallback_) { | ||
// recordFunctionEndCallback_ is normally called in fininsh() function by | ||
// base class, but since finish is not called by WorkNCCL, we schedule this | ||
// function to be run when work is done. | ||
// base class, but since finish is not called by WorkNCCL, we run this | ||
// function now. | ||
// Note when can_profile is false, profilingTitle is not provided and so, | ||
// recordFunctionEndCallback_ is not set. | ||
work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_)); | ||
work->recordFunctionEndCallback_(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Does this mean the existing code for launching the cb was wrong? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code of FutureNCCL was correct (because, for NCCL, it's "correct" to always invoke callbacks inline). The mistake was in ProcessGroupNCCL, which was constructing a FutureNCCL with "incomplete" arguments, but then using the FutureNCCL before making those arguments complete. (These "arguments" are the CUDA events). Once FutureNCCL was returned by ProcessGroupNCCL, its arguments were complete, so the users of ProcessGroupNCCL couldn't hit this issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still believe that the high-level behavior here is wrong, since it caused recordFunctionEndCallback_ to be invoked before the NCCL function was called. But this commit does nothing more than make that mistake explicit. I think it should be fixed in a separate PR by someone who knows what's going on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
} | ||
|
||
|
||
|
||
at::cuda::OptionalCUDAGuard gpuGuard; | ||
|
||
pre(ncclStreams_[key]); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your analysis that this code was initially incorrect.
However the ideal functionality (trying to fix this in #48196) is indeed to run
recordFunctionEndCallback_
after this NCCL collective has been completed. I tried to make this the case by moving this addCallback to after where the work is launched, although that still didn't seem to make it work since the callback is CPU only when the profiler is enabled withoutuse_cuda=True
. Do you have any suggestions on how to ensure that a CPU callback can run after CUDA operations such as NCCL collectives are guaranteed to have been completed?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean ideally we need to record the TLS including
use_cuda=True
when callingaddCallback
(if it can behave as normal future.addCallback). And becauseFutureNCCL
always run cb inline, we don't even have the opportunity to check the state foruse_cuda
?Question, for c10d ops, when/where does profiler initialization/enabling happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how the profiler is designed to work with CUDA, but it does sound odd to me to try to measure GPU timing using CPU code. I thought the proper tool for the job were CUDA events, created with the EnableTiming flag, and the cudaEventElapsedTime function. If that can work, then such events can be created and enqueued even from an inline CPU callback, and CUDA should still be able to correctly collect timing for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lw Right, to profile GPU code the user should invoke the profiler with use_cuda=True.
When this happens, this code path is run, which indeed uses CUDA events to do the timing:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/profiler_cuda.cpp#L35
@mrshenli, I think with NCCL we don’t have to propagate any TLS State since the RecordFunction start and end happen in the same thread synchronously. Even so, with the current way we fork the ProfilerState, the use_cuda flag is carried along so events on another thread would be profiled with cuda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli For profiler init, that would be by the user invoking python profiler with c10d ops wrapped inside. For RecordFunction initialization, basically RecordFunction::start, it happens here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroup.cpp#L60