Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix send()/recv() to adhere to timeout (#109611)
Summary: Point to point ops don't enqueue their work to the `workMetaList_` which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts. Test Plan: While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / `MultiprocessTestCase` I manually tested this change with the script in #109401, but need to look more closely at how to automate a test for NCCL watchdog Differential Revision: D49418976 Pull Request resolved: #109611 Approved by: https://github.com/wconstab
- Loading branch information