Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks #46959

Closed
wants to merge 7 commits into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Oct 27, 2020

Stack from ghstack:

  1. Implemented the Pybind part.
  2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook.
  3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: D24471910

…in comm hooks

1. Implement the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Oct 27, 2020

💊 CI failures summary and remediations

As of commit 3f27a7b (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 34 times.

…I as built-in comm hooks"


1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Oct 28, 2020
…in comm hooks

Pull Request resolved: #46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348
ghstack-source-id: 115321037

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
…I as built-in comm hooks"


1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Oct 28, 2020
…in comm hooks

Pull Request resolved: #46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348
ghstack-source-id: 115335197

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also validate that this PR shows the same perf improvements on the HPC trainer as our earlier C++ hacks?

torch/nn/parallel/distributed.py Outdated Show resolved Hide resolved
test/distributed/test_c10d.py Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/reducer.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/reducer.cpp Outdated Show resolved Hide resolved
torch/distributed/algorithms/ddp_comm_hooks/__init__.py Outdated Show resolved Hide resolved
torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved
torch/nn/parallel/distributed.py Show resolved Hide resolved
…I as built-in comm hooks"


1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Oct 29, 2020
…in comm hooks

Pull Request resolved: #46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348
ghstack-source-id: 115450614

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
…I as built-in comm hooks"


1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Oct 29, 2020
…in comm hooks

Pull Request resolved: #46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348
ghstack-source-id: 115455858

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
@wayi1
Copy link
Contributor Author

wayi1 commented Oct 29, 2020

Can we also validate that this PR shows the same perf improvements on the HPC trainer as our earlier C++ hacks?

It seems that this version is slightly (1%-2%) slower than the earlier C++ hacks. I suspect that the overheads are due to the usage of future.then() in our C++ comm hook.

…I as built-in comm hooks"


1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Oct 29, 2020
…in comm hooks

Pull Request resolved: #46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348
ghstack-source-id: 115496448

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
@pritamdamania87
Copy link
Contributor

It seems that this version is slightly (1%-2%) slower than the earlier C++ hacks. I suspect that the overheads are due to the usage of future.then() in our C++ comm hook.

Does this mean it is actually on par with the Python Hook or is the python hook slower than this hook?

Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, although there are some merge conflicts in CI. Please fix those before landing.

@wayi1
Copy link
Contributor Author

wayi1 commented Oct 30, 2020

Looks good to me, although there are some merge conflicts in CI. Please fix those before landing.

Need to run it on Ads 10x to verify. Unfortunately, that training command was somehow broken when I wrote this diff. Will verify it again later.

@wayi1
Copy link
Contributor Author

wayi1 commented Oct 30, 2020

Looks good to me, although there are some merge conflicts in CI. Please fix those before landing.

Sure. I don't think I can land it before the fix.

…I as built-in comm hooks"


1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Oct 30, 2020
…in comm hooks

Pull Request resolved: #46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348
ghstack-source-id: 115629230

Differential Revision: [D24471910](https://our.internmc.facebook.com/intern/diff/D24471910/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in ee0033a.

wayi1 pushed a commit that referenced this pull request Nov 3, 2020
…in comm hooks

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153

To review the diff on top of #46959, compare V1 vs Latest.

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Nov 3, 2020
…in comm hooks

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153

To review the diff on top of #46959, compare V1 vs Latest.

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)

ghstack-source-id: 115753518
Pull Request resolved: #47270
@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/19/head branch November 3, 2020 15:21
wayi1 pushed a commit that referenced this pull request Nov 3, 2020
…I as built-in comm hooks"


This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType is imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported, which is similar to another enum type ReduceOp defined in the same file. See #47153

Main Changes in #46959:
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348

Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Nov 3, 2020
…in comm hooks

Pull Request resolved: #47270

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See #47153

I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.

To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.

To review the diff on top of #46959, compare V1 vs Latest:
https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348
ghstack-source-id: 115783237

Differential Revision: [D24700959](https://our.internmc.facebook.com/intern/diff/D24700959/)
facebook-github-bot pushed a commit that referenced this pull request Nov 4, 2020
…in comm hooks (#47270)

Summary:
Pull Request resolved: #47270

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See #47153

I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.

To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.

To review the diff on top of #46959, compare V1 vs Latest:
https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook #46348
ghstack-source-id: 115783237

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

//arvr/projects/eye_tracking/Masquerade:python_test

USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install

Reviewed By: mrshenli

Differential Revision: D24700959

fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants