Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Sync collectives refactoring (#2039)
Summary: Pull Request resolved: #2039 Reland of D57564130 **What is changed after revert**: Torch Library can not be used inside Deploy. Guarded in comm_ops.py all operators definitions and autograd registrations with `not torch._running_with_deploy():` **Catching deploy compat on diff test/land**: D57773561 **Previous diff Summary:** The diff refactors torchrec sync collectives and addresses issues with missing wait_tensor() for backward: - Refactoring using latest Torchrec Library Custom Op API with PT2 compatibility - Removing non-Native functional collectives calls (c10d_functional), as only native exist now in pytorch and non-native are redispatched to native. - Adding test cases for compiled-with-noncompiled ranks (in case of compilation failure on one of the ranks) Issues fixed: - Sync collectives eager backward did not produce gradient -> Fixed - Support gradient_division in sync collectives and its compilation -> Done - Test coverage of sync collectives comparing results with async collectives and compilation. - Fixed Missing wait_tensor The warning: ``` W0520 07:16:25.135696 2546100 Functional.cpp:51] Warning: At the time of process termination, there are still 1 unwaited c10d_functional collective calls. Please review your program to ensure c10d_functional.wait_tensor() is invoked on all tensors returned from c10d_functional collective ops before they are used. (function ~WorkRegistry) ok ``` Reviewed By: ezyang Differential Revision: D57774293 fbshipit-source-id: 76da888f4b6e876aa1ad170857e7db76ac418122
- Loading branch information