ProcessGroupNCCL error/timeout handling

As of NCCL 2.4 there are functions to detect I/O errors and abort running kernels. This is required to implement timeouts and force workers to raise an error or terminate when other workers fail.

See https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/api/comms.html and `ncclCommAbort` and `ncclCommGetAsyncError` and https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/ for an example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ProcessGroupNCCL error/timeout handling #17882

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ProcessGroupNCCL error/timeout handling #17882

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions