Skip to content

ProcessGroupNCCL error/timeout handling #17882

@pietern

Description

@pietern

As of NCCL 2.4 there are functions to detect I/O errors and abort running kernels. This is required to implement timeouts and force workers to raise an error or terminate when other workers fail.

See https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/api/comms.html and ncclCommAbort and ncclCommGetAsyncError and https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/ for an example.

Metadata

Metadata

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions