Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[NCCL] Additional Lock Optimizations for handleNCCLGuard from Cleanup…
… Loop Pull Request resolved: #43232 **This Commit:** Here we introduce some experimental lock optimizations. This essentially checking for workNCCL completion and `handleNCCLGuard` so that the lock is acquired once instead of twice. The performance implications of this are still being measured. The first 5 diffs in this stack work perfectly well without this optimization. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 110300867 Differential Revision: [D22929042](https://our.internmc.facebook.com/intern/diff/D22929042/)
- Loading branch information