Skip to content

Commit

Permalink
Sync collectives refactoring (#2039)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #2039

Reland of D57564130

**What is changed after revert**:
Torch Library can not be used inside Deploy.
Guarded in comm_ops.py all operators definitions and autograd registrations with `not torch._running_with_deploy():`

**Catching deploy compat on diff test/land**: D57773561

**Previous diff Summary:**
The diff refactors torchrec sync collectives and addresses issues with missing wait_tensor() for backward:
- Refactoring using latest Torchrec Library Custom Op API with PT2 compatibility
- Removing non-Native functional collectives calls (c10d_functional), as only native exist now in pytorch and non-native are redispatched to native.
- Adding test cases for compiled-with-noncompiled ranks (in case of compilation failure on one of the ranks)

Issues fixed:
- Sync collectives eager backward did not produce gradient -> Fixed
- Support gradient_division in sync collectives and its compilation -> Done
- Test coverage of sync collectives comparing results with async collectives and compilation.
 - Fixed Missing wait_tensor
The warning:
```
W0520 07:16:25.135696 2546100 Functional.cpp:51] Warning: At the time of process termination, there are still 1 unwaited c10d_functional collective calls. Please review your program to ensure c10d_functional.wait_tensor() is invoked on all tensors returned from c10d_functional collective ops before they are used. (function ~WorkRegistry)
ok
```

Reviewed By: ezyang

Differential Revision: D57774293

fbshipit-source-id: 76da888f4b6e876aa1ad170857e7db76ac418122
  • Loading branch information
Ivan Kobzarev authored and facebook-github-bot committed May 24, 2024
1 parent f24c8dc commit 8c7fa2f
Show file tree
Hide file tree
Showing 2 changed files with 431 additions and 515 deletions.
Loading

0 comments on commit 8c7fa2f

Please sign in to comment.