Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using FutureNCCL before it's ready #48561

Closed
wants to merge 5 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
10 changes: 4 additions & 6 deletions torch/lib/c10d/ProcessGroupNCCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -462,7 +462,7 @@ ProcessGroupNCCL::ProcessGroupNCCL(
if (blockingWait_ && asyncErrorHandling_) {
LOG(INFO) << "[Rank " << rank_
<< "] NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING "
<< "should not both be enabled. "
<< "should not both be enabled. "
<< "Only NCCL_BLOCKING_WAIT is being used in this process.";
asyncErrorHandling_ = false;
}
Expand Down Expand Up @@ -1073,15 +1073,13 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(

if (work->recordFunctionEndCallback_) {
// recordFunctionEndCallback_ is normally called in fininsh() function by
// base class, but since finish is not called by WorkNCCL, we schedule this
// function to be run when work is done.
// base class, but since finish is not called by WorkNCCL, we run this
// function now.
// Note when can_profile is false, profilingTitle is not provided and so,
// recordFunctionEndCallback_ is not set.
work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_));
work->recordFunctionEndCallback_();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your analysis that this code was initially incorrect.

However the ideal functionality (trying to fix this in #48196) is indeed to run recordFunctionEndCallback_ after this NCCL collective has been completed. I tried to make this the case by moving this addCallback to after where the work is launched, although that still didn't seem to make it work since the callback is CPU only when the profiler is enabled without use_cuda=True. Do you have any suggestions on how to ensure that a CPU callback can run after CUDA operations such as NCCL collectives are guaranteed to have been completed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the callback is CPU only when the profiler is enabled without use_cuda=True.

Does this mean ideally we need to record the TLS including use_cuda=True when calling addCallback (if it can behave as normal future.addCallback). And because FutureNCCL always run cb inline, we don't even have the opportunity to check the state for use_cuda?

Question, for c10d ops, when/where does profiler initialization/enabling happen?

Copy link
Contributor Author

@lw lw Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how the profiler is designed to work with CUDA, but it does sound odd to me to try to measure GPU timing using CPU code. I thought the proper tool for the job were CUDA events, created with the EnableTiming flag, and the cudaEventElapsedTime function. If that can work, then such events can be created and enqueued even from an inline CPU callback, and CUDA should still be able to correctly collect timing for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lw Right, to profile GPU code the user should invoke the profiler with use_cuda=True.

When this happens, this code path is run, which indeed uses CUDA events to do the timing:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/profiler_cuda.cpp#L35

@mrshenli, I think with NCCL we don’t have to propagate any TLS State since the RecordFunction start and end happen in the same thread synchronously. Even so, with the current way we fork the ProfilerState, the use_cuda flag is carried along so events on another thread would be profiled with cuda.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli For profiler init, that would be by the user invoking python profiler with c10d ops wrapped inside. For RecordFunction initialization, basically RecordFunction::start, it happens here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroup.cpp#L60

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op.

Does this mean the existing code for launching the cb was wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code of FutureNCCL was correct (because, for NCCL, it's "correct" to always invoke callbacks inline).

The mistake was in ProcessGroupNCCL, which was constructing a FutureNCCL with "incomplete" arguments, but then using the FutureNCCL before making those arguments complete. (These "arguments" are the CUDA events).

Once FutureNCCL was returned by ProcessGroupNCCL, its arguments were complete, so the users of ProcessGroupNCCL couldn't hit this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe that the high-level behavior here is wrong, since it caused recordFunctionEndCallback_ to be invoked before the NCCL function was called. But this commit does nothing more than make that mistake explicit. I think it should be fixed in a separate PR by someone who knows what's going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bug and the reason why we disabled some of the failing tests as part of #48129. Attempting to fix/debug this as part of #48664, but in the meantime this PR should be fine as-is.

}



at::cuda::OptionalCUDAGuard gpuGuard;

pre(ncclStreams_[key]);
Expand Down