Skip to content

Commit

Permalink
[NCCL] Add Error log when ProcessGroupNCCL takes down process upon
Browse files Browse the repository at this point in the history
timeout/error

Pull Request resolved: #44988

The new NCCL async error handling feature throws an exception from the
workCleanup Thread if one of the NCCL operations encounters an error or times
out. This PR adds an error log to make it more clear to the user why the
training process crashed.
ghstack-source-id: 113876146

Differential Revision: [D23794801](https://our.internmc.facebook.com/intern/diff/D23794801/)
  • Loading branch information
osalpekar committed Oct 8, 2020
1 parent acca11b commit 4422bb2
Showing 1 changed file with 6 additions and 0 deletions.
6 changes: 6 additions & 0 deletions torch/lib/c10d/ProcessGroupNCCL.cpp
Expand Up @@ -306,6 +306,12 @@ void ProcessGroupNCCL::WorkNCCL::handleNCCLGuard() {
std::lock_guard<std::mutex> lock(mutex_);
completed_ = true;
if (exception_) {
auto exceptionMsg = c10::str(
"Some NCCL operations have failed or timed out. Due to the ",
"asynchronous nature of CUDA kernels, subsequent GPU operations ",
"might run on corrupted/incomplete data. To avoid this inconsistency, ",
"we are taking the entire process down.");
LOG(ERROR) << exceptionMsg;
std::rethrow_exception(exception_);
}
}
Expand Down

0 comments on commit 4422bb2

Please sign in to comment.