Skip to content

Commit

Permalink
Update on "[C10D] Fix nccl flightrecorder ignored dump timeout"
Browse files Browse the repository at this point in the history
Don't call future.get() unless it's ready, because it waits.
Also, refactor the code a bit for simplicity.

We should do a follow-on PR to clean up the timeouts further, but this
should fix the glaring timeout bug.

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
  • Loading branch information
wconstab committed Jan 23, 2024
1 parent 8200a6c commit 449e97f
Showing 1 changed file with 5 additions and 8 deletions.
13 changes: 5 additions & 8 deletions torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1013,12 +1013,10 @@ void ProcessGroupNCCL::waitForDumpOrTimeout(

auto futStatus = fut.wait_for(std::chrono::seconds(timeout_sec));
TORCH_CHECK(
futStatus != std::future_status::deferred,
"Expected the dump future to have been launched eagerly.");
futStatus != std::future_status::deferred, "Expected eager launch.");
if (futStatus == std::future_status::ready) {
// Calling .get() will raise any exception stored in the promise associated
// with the future. (but we can ignore the return value, which will be false
// if dumping is not enabled)
// Calling .get() will re-raise any exception from the future, and we don't
// care about the retval
try {
fut.get();
std::this_thread::sleep_until(wakeUpTime);
Expand All @@ -1036,10 +1034,9 @@ void ProcessGroupNCCL::waitForDumpOrTimeout(
<< " This may be due to slow ADDR2LINE performance processing stacktraces."
<< " Try TORCH_DISABLE_ADDR2LINE=1 and TORCH_NCCL_TRACE_CPP_STACK=0 to work around.";
}
// Ensure we sleep at least until wakeUpTime regardless of whether the future
// ran quickly or took the full timeout.
// Ensure we sleep at least until wakeUpTime regardless of future execution
// time
std::this_thread::sleep_until(wakeUpTime);
return;
}

void ProcessGroupNCCL::abortCommsFromMap(
Expand Down

0 comments on commit 449e97f

Please sign in to comment.