Skip to content

Commit

Permalink
[C10D] Fix nccl flightrecorder ignored dump timeout (#118142)
Browse files Browse the repository at this point in the history
Don't call future.get() unless it's ready, because it waits.
Also, refactor the code a bit for simplicity.

We should do a follow-on PR to clean up the timeouts further, but this
should fix the glaring timeout bug.

Pull Request resolved: #118142
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118044, #118046, #118047
  • Loading branch information
wconstab authored and pytorchmergebot committed Jan 25, 2024
1 parent 87335fa commit a40951d
Showing 1 changed file with 20 additions and 18 deletions.
38 changes: 20 additions & 18 deletions torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1022,29 +1022,31 @@ void ProcessGroupNCCL::waitForDumpOrTimeout(
TORCH_CHECK(fut.valid(), "Expected a valid future");

auto futStatus = fut.wait_for(std::chrono::seconds(timeout_sec));
if (futStatus != std::future_status::ready) {
TORCH_CHECK(
futStatus != std::future_status::deferred,
"Expected the dump future to have been launched eagerly.");
TORCH_CHECK(
futStatus != std::future_status::deferred, "Expected eager launch.");
if (futStatus == std::future_status::ready) {
// Calling .get() will re-raise any exception from the future, and we don't
// care about the retval
try {
fut.get();
std::this_thread::sleep_until(wakeUpTime);
} catch (const std::exception& e) {
LOG(ERROR) << logPrefix()
<< "Caught exception during async debug dump: \"" << e.what()
<< "\"\n";
} catch (...) {
LOG(ERROR) << logPrefix()
<< "Caught unknown exception during async debug dump.";
}
} else {
LOG(INFO)
<< logPrefix() << "Debug dump timed out and is being abandoned."
<< " This may be due to slow ADDR2LINE performance processing stacktraces."
<< " Try TORCH_DISABLE_ADDR2LINE=1 and TORCH_NCCL_TRACE_CPP_STACK=0 to work around.";
}

// Calling .get() will raise any exception stored in the promise associated
// with the future. (but we can ignore the return value, which will be false
// if dumping is not enabled)
try {
fut.get();
std::this_thread::sleep_until(wakeUpTime);
} catch (const std::exception& e) {
LOG(ERROR) << logPrefix() << "Caught exception during async debug dump: \""
<< e.what() << "\"\n";
} catch (...) {
LOG(ERROR) << logPrefix()
<< "Caught unknown exception during async debug dump.";
}
// Ensure we sleep at least until wakeUpTime regardless of future execution
// time
std::this_thread::sleep_until(wakeUpTime);
}

void ProcessGroupNCCL::abortCommsFromMap(
Expand Down

0 comments on commit a40951d

Please sign in to comment.