Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks #47270

Closed
wants to merge 2 commits into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Nov 3, 2020

Stack from ghstack:

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See #47153

I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.

To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.

Main Changes in #46959:

  1. Implemented the Pybind part.
  2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook.
  3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: D24700959

…in comm hooks

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153

To review the diff on top of #46959, compare V1 vs Latest.

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 3, 2020
wayi1 pushed a commit that referenced this pull request Nov 3, 2020
…in comm hooks

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153

To review the diff on top of #46959, compare V1 vs Latest.

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)

ghstack-source-id: 115753518
Pull Request resolved: #47270
@codecov
Copy link

codecov bot commented Nov 3, 2020

Codecov Report

Merging #47270 into gh/SciPioneer/24/base will decrease coverage by 0.00%.
The diff coverage is 35.29%.

@@                    Coverage Diff                    @@
##           gh/SciPioneer/24/base   #47270      +/-   ##
=========================================================
- Coverage                  60.84%   60.83%   -0.01%     
=========================================================
  Files                       2749     2749              
  Lines                     254114   254144      +30     
=========================================================
+ Hits                      154605   154618      +13     
- Misses                     99509    99526      +17     

…I as built-in comm hooks"


This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153

Main Changes in #46959:
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Nov 3, 2020
…in comm hooks

Pull Request resolved: #47270

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See #47153

I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.

To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.

To review the diff on top of #46959, compare V1 vs Latest:
https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348
ghstack-source-id: 115783237

Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)
@mrshenli
Copy link
Contributor

mrshenli commented Nov 3, 2020

Just a heads-up, there might be conflict with #47309. If so, please coordinate landing with @gmagogsfm

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

builds look good so far, stamping to unblock. But please wait for all tests.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in f91fcef.

@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/24/head branch November 7, 2020 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants