Skip to content

Commit

Permalink
Fix FutureNCCL's completed() disagreeing with wait() (#48503)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #48503

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

My impression is that one property of the upstream Future class is that once .wait() returns, or once a callback is invoked, then .completed() should return True. This was not the case for FutureNCCL because .wait() would return immediately, and callbacks would be invoked inline, but .completed() could return False if the CUDA async operations hadn't completed yet.

That was odd and confusing. Since there are other ways for users to check the status of CUDA operations (if they really need, and typically I don't think it's so common), perhaps it's best to avoid checking the status of CUDA events in .completed().
ghstack-source-id: 118180028

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25180531

fbshipit-source-id: e1207f6b91f010f278923cc5fec1190d0fcdab30
  • Loading branch information
lw authored and facebook-github-bot committed Dec 10, 2020
1 parent b91b087 commit 003c30b
Showing 1 changed file with 1 addition and 8 deletions.
9 changes: 1 addition & 8 deletions torch/lib/c10d/ProcessGroupNCCL.hpp
Expand Up @@ -339,15 +339,8 @@ class ProcessGroupNCCL : public ProcessGroup {
return fut;
}

// Checks cudaEventQuery with cudaEvents. Returns true if a FutureError was
// recorded or the entire operation is completed on the GPU.
bool completed() const override {
if (error_) {
return true;
}
// Checking the work's corresponding CUDA events' status
auto ret = cudaEventQuery((*cudaEvents_)[0]);
return ret != cudaErrorNotReady || ret == cudaSuccess;
return true;
}

bool hasValue() const override {
Expand Down

0 comments on commit 003c30b

Please sign in to comment.