-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks #47270
Conversation
…in comm hooks This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153 To review the diff on top of #46959, compare V1 vs Latest. Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/) [ghstack-poisoned]
…in comm hooks This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153 To review the diff on top of #46959, compare V1 vs Latest. Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/) ghstack-source-id: 115753518 Pull Request resolved: #47270
Codecov Report
@@ Coverage Diff @@
## gh/SciPioneer/24/base #47270 +/- ##
=========================================================
- Coverage 60.84% 60.83% -0.01%
=========================================================
Files 2749 2749
Lines 254114 254144 +30
=========================================================
+ Hits 154605 154618 +13
- Misses 99509 99526 +17 |
…I as built-in comm hooks" This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153 Main Changes in #46959: 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/) [ghstack-poisoned]
…in comm hooks Pull Request resolved: #47270 This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See #47153 I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways. To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file. To review the diff on top of #46959, compare V1 vs Latest: https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617 Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 ghstack-source-id: 115783237 Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)
Just a heads-up, there might be conflict with #47309. If so, please coordinate landing with @gmagogsfm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
builds look good so far, stamping to unblock. But please wait for all tests.
This pull request has been merged in f91fcef. |
Stack from ghstack:
This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See #47153
I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.
To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.
Main Changes in #46959:
Original PR issue: C++ DDP Communication Hook #46348
Differential Revision: D24700959