Skip to content

Commit

Permalink
Fix send()/recv() to adhere to timeout (#109611)
Browse files Browse the repository at this point in the history
Summary: Point to point ops don't enqueue their work to the `workMetaList_` which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts.

Test Plan:
While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / `MultiprocessTestCase`

I manually tested this change with the script in #109401, but need to look more closely at how to automate a test for NCCL watchdog

Differential Revision: D49418976

Pull Request resolved: #109611
Approved by: https://github.com/wconstab
  • Loading branch information
H-Huang authored and pytorchmergebot committed Oct 3, 2023
1 parent a0bffe7 commit efb73fe
Showing 1 changed file with 10 additions and 0 deletions.
10 changes: 10 additions & 0 deletions torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1876,6 +1876,9 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::pointToPoint(
"TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point "
"collectives.");

// Bump sequence number, updated in collective() as well
seq_++;

const auto devices = getDeviceList(tensors);
std::string key;
int p2pRank = 0, p2pTargetRank = 0;
Expand Down Expand Up @@ -2021,6 +2024,13 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::pointToPoint(
/*uses_future=*/false);
}

// Enqueue P2P op so that it can be cancelled by NCCL watchdog
c10::cuda::CaptureStatus capture_status =
c10::cuda::currentStreamCaptureStatusMayInitCtx();
if (!coalescing_state_ && capture_status == c10::cuda::CaptureStatus::None) {
workEnqueue(work);
}

return work;
}

Expand Down

0 comments on commit efb73fe

Please sign in to comment.