Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973

Closed
wants to merge 4 commits into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Jan 23, 2021

Stack from ghstack:

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in PowerSGDState.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D26031478

…e first K iterations

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jan 23, 2021

💊 CI failures summary and remediations

As of commit 61f32b6 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jan 23, 2021
wayi1 pushed a commit that referenced this pull request Jan 23, 2021
…e first K iterations

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)

ghstack-source-id: 120245539
Pull Request resolved: #50973
…duce for the first K iterations"

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 23, 2021
…e first K iterations

Pull Request resolved: #50973

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120246814

Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)
…duce for the first K iterations"

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 23, 2021
…e first K iterations

Pull Request resolved: #50973

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257202

Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)
…duce for the first K iterations"

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)

[ghstack-poisoned]
Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, excited to see the results of this! One interesting area to explore would be hook composition, i.e. the user can specify whether to run vanilla allreduce or fp16 or even some other algorithm for the first t iterations. Ideally, it would be great if we can just express this by composing the different hooks as building blocks, maybe something like:

def main_hook(state, iter, args):
    hook_for_iter = dispatch_table[iter, args] # dispatch table that determines which hook to run based on args
   return hook_for_iter(state, args)

This is admittedly overkill for the current use case, but may be an interesting area to explore if we realize we are composing more hooks.

@rohan-varma
Copy link
Member

Also, can we add an appropriate unittest for this change?

@wayi1
Copy link
Contributor Author

wayi1 commented Jan 26, 2021

Awesome, excited to see the results of this! One interesting area to explore would be hook composition, i.e. the user can specify whether to run vanilla allreduce or fp16 or even some other algorithm for the first t iterations. Ideally, it would be great if we can just express this by composing the different hooks as building blocks, maybe something like:

def main_hook(state, iter, args):
    hook_for_iter = dispatch_table[iter, args] # dispatch table that determines which hook to run based on args
   return hook_for_iter(state, args)

This is admittedly overkill for the current use case, but may be an interesting area to explore if we realize we are composing more hooks.

That will be one direction of the future work. Since we don't have many comm hooks available, it's hard to see how it will work out. I would prioritize exploring other compression approaches over this. A few comments about the more generic combination:

  1. I don't plan to consider fp16 as a compression hook in this setup. Instead, I will consider it as a wrapper, so it can be combined with both vanilla allreduce and PowerSGD, as well as any other future gradient compression. FP16 is just a type casting, it can be applied in conjunction with any other comm hook, rather than being mutually exclusive.

Therefore, if we do not consider fp16 as a valid choice, then so far we only have vanilla allreduce and PowerSGD two valid options.

  1. I am not sure how useful if we support switching back and forth different compression schemes, especially for biased compressor like PowerSGD that typically requires error compensation. PowerSGD needs to cache some error tensors and warm-up tensors. Switching to another compression scheme will discard these cached states, so there will be some extra accuracy loss due to the discontinuation.

@wayi1
Copy link
Contributor Author

wayi1 commented Jan 26, 2021

Also, can we add an appropriate unittest for this change?

Since by default, start_powerSGD_iter is 10 instead of 0, so the first 10 iterations will just run vanilla allreduce. It's already covered by the unit test.

wayi1 pushed a commit that referenced this pull request Jan 26, 2021
… by creating a util function that make a vanilla allreduce future

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 26, 2021
…SGD_hook.py by creating a util function that make a vanilla allreduce future"

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 26, 2021
… by creating a util function that make a vanilla allreduce future

Pull Request resolved: #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120376248

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)
@wayi1 wayi1 closed this Jan 26, 2021
facebook-github-bot pushed a commit that referenced this pull request Jan 26, 2021
…e first K iterations (#50973)

Summary:
Pull Request resolved: #50973

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257202

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D26031478

fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e
wayi1 pushed a commit that referenced this pull request Jan 28, 2021
… for the first K iterations

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 28, 2021
… for the first K iterations

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)

ghstack-source-id: 120400709
Pull Request resolved: #51270
wayi1 pushed a commit that referenced this pull request Jan 29, 2021
…SGD_hook.py by creating a util function that make a vanilla allreduce future"


Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 29, 2021
…a allreduce for the first K iterations"

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 29, 2021
… for the first K iterations

Pull Request resolved: #51270

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120617938

Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)
wayi1 pushed a commit that referenced this pull request Jan 29, 2021
…SGD_hook.py by creating a util function that make a vanilla allreduce future"


Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 29, 2021
… by creating a util function that make a vanilla allreduce future

Pull Request resolved: #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120619680

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)
facebook-github-bot pushed a commit that referenced this pull request Jan 29, 2021
… by creating a util function that make a vanilla allreduce future (#51094)

Summary:
Pull Request resolved: #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120619680

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26070147

fbshipit-source-id: 8c9339f1511e8f24cc906b9411cfe4850a5a6d81
wayi1 pushed a commit that referenced this pull request Jan 30, 2021
…werSGD_hook.py by creating a util function that make a vanilla allreduce future

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 30, 2021
…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future"

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 30, 2021
…werSGD_hook.py by creating a util function that make a vanilla allreduce future

Pull Request resolved: #51400

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120715333

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)
wayi1 pushed a commit that referenced this pull request Jan 31, 2021
…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future"

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 31, 2021
…GD to run vanilla allreduce for the first K iterations"

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 31, 2021
…a allreduce for the first K iterations"

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 31, 2021
…GD to run vanilla allreduce for the first K iterations"

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 31, 2021
…a allreduce for the first K iterations"

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 31, 2021
… for the first K iterations

Pull Request resolved: #51270

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725858

Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)
facebook-github-bot pushed a commit that referenced this pull request Feb 1, 2021
…werSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400)

Summary:
Pull Request resolved: #51400

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725690

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26162333

fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0
facebook-github-bot pushed a commit that referenced this pull request Feb 1, 2021
… for the first K iterations (#51270)

Summary:
Pull Request resolved: #51270

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725858

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

baseline: f248001754
batched PowerSGD: f246960752

The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35

Reviewed By: rohan-varma

Differential Revision: D26077709

fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5
@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/45/head branch February 25, 2021 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants