Skip to content

Commit

Permalink
Have FutureNCCL record streams w/ allocator in addCallback (#48496)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #48496

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

There are two ways to add a callback to a Future: `then` and `addCallback` (with the former deferring to the latter). FutureNCCL only "patched" `then`, which caused `addCallback` to be unsupported. By patching `addCallback`, on the other hand, we cover both.

The high-level goal of this change though is to remove all CUDA-specific stuff from `then`, and move it to either `markCompleted` or to a wrapper around the callback. This will take a few more steps to achieve.
ghstack-source-id: 118180031

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177558

fbshipit-source-id: ee0ad24eb2e56494c353db700319858ef9dcf32b
  • Loading branch information
lw authored and facebook-github-bot committed Dec 10, 2020
1 parent 868a1a4 commit e4267eb
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions torch/lib/c10d/ProcessGroupNCCL.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,15 @@ class ProcessGroupNCCL : public ProcessGroup {
// this callback. This new FutureNCCL's cudaEvents will record the
// callback's stream and will have the result value of the callback.
void addCallback(std::function<void(void)> callback) override {
// Do not free the underlying data storage of value_ before its
// usage on futureNCCLCallbackStream_ finish.
for (const at::DataPtr& data_ptr : extractDataPtrs(value_)) {
c10::cuda::CUDACachingAllocator::recordStream(
data_ptr, *futureNCCLCallbackStream_);
}

(*cudaEvents_)[0].block(*futureNCCLCallbackStream_);
// Use the dedicated callback stream to run callback.
c10::OptionalStreamGuard streamGuard{
c10::Stream(*futureNCCLCallbackStream_)};
callback();
Expand Down Expand Up @@ -335,14 +343,6 @@ class ProcessGroupNCCL : public ProcessGroup {
// Therefore we propagate our extractor.
fut->setDataPtrExtractor(dataPtrExtractor_);

// Do not free the underlying data storage of value_ before its
// usage on futureNCCLCallbackStream_ finish.
for (const at::DataPtr& data_ptr : extractDataPtrs(value_)) {
c10::cuda::CUDACachingAllocator::recordStream(
data_ptr, *futureNCCLCallbackStream_);
}

// Use the dedicated callback stream to run callback.
// Cannot move capture std::function in lambda, because it cannot deduce
// the template type for std::function. Hence use std::bind to explicitly
// specify types.
Expand Down

0 comments on commit e4267eb

Please sign in to comment.