-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
As of NCCL 2.4 there are functions to detect I/O errors and abort running kernels. This is required to implement timeouts and force workers to raise an error or terminate when other workers fail.
See https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/api/comms.html and ncclCommAbort
and ncclCommGetAsyncError
and https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/ for an example.
ngoyal2707, kuttas and shoaibahmed
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module