New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks #46959
Conversation
…in comm hooks 1. Implement the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 3f27a7b (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 34 times. |
…I as built-in comm hooks" 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/) [ghstack-poisoned]
…in comm hooks Pull Request resolved: #46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 ghstack-source-id: 115321037 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
…I as built-in comm hooks" 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/) [ghstack-poisoned]
…in comm hooks Pull Request resolved: #46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 ghstack-source-id: 115335197 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also validate that this PR shows the same perf improvements on the HPC trainer as our earlier C++ hacks?
…I as built-in comm hooks" 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/) [ghstack-poisoned]
…in comm hooks Pull Request resolved: #46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 ghstack-source-id: 115450614 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
…I as built-in comm hooks" 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/) [ghstack-poisoned]
…in comm hooks Pull Request resolved: #46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 ghstack-source-id: 115455858 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
It seems that this version is slightly (1%-2%) slower than the earlier C++ hacks. I suspect that the overheads are due to the usage of future.then() in our C++ comm hook. |
…I as built-in comm hooks" 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/) [ghstack-poisoned]
…in comm hooks Pull Request resolved: #46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 ghstack-source-id: 115496448 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
Does this mean it is actually on par with the Python Hook or is the python hook slower than this hook? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, although there are some merge conflicts in CI. Please fix those before landing.
Need to run it on Ads 10x to verify. Unfortunately, that training command was somehow broken when I wrote this diff. Will verify it again later. |
Sure. I don't think I can land it before the fix. |
…I as built-in comm hooks" 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/) [ghstack-poisoned]
…in comm hooks Pull Request resolved: #46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 ghstack-source-id: 115629230 Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
This pull request has been merged in ee0033a. |
…in comm hooks This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153 To review the diff on top of #46959, compare V1 vs Latest. Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/) [ghstack-poisoned]
…in comm hooks This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153 To review the diff on top of #46959, compare V1 vs Latest. Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/) ghstack-source-id: 115753518 Pull Request resolved: #47270
…I as built-in comm hooks" This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153 Main Changes in #46959: 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/) [ghstack-poisoned]
…in comm hooks Pull Request resolved: #47270 This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See #47153 I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways. To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file. To review the diff on top of #46959, compare V1 vs Latest: https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617 Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 ghstack-source-id: 115783237 Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)
…in comm hooks (#47270) Summary: Pull Request resolved: #47270 This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See #47153 I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways. To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file. To review the diff on top of #46959, compare V1 vs Latest: https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617 Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook #46348 ghstack-source-id: 115783237 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl //arvr/projects/eye_tracking/Masquerade:python_test USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install Reviewed By: mrshenli Differential Revision: D24700959 fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a
Stack from ghstack:
Original PR issue: C++ DDP Communication Hook #46348
Differential Revision: D24471910