[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973

wayi1 · 2021-01-23T00:32:31Z

Stack from ghstack:

[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future #51094 [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future
[Gradient Compression] Explicitly specify the dtype of the error tensor #50985 [Gradient Compression] Explicitly specify the dtype of the error tensor
[Gradient Compression] Simplify the implementation of warm-start #50981 [Gradient Compression] Simplify the implementation of warm-start
[Gradient Compression] Typo fixes in PowerSGD #50974 [Gradient Compression] Typo fixes in PowerSGD
[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973 [Gradient Compression] Allow PowerSGD to run vanilla allreduce for the first K iterations

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in PowerSGDState.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D26031478

…e first K iterations This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) [ghstack-poisoned]

facebook-github-bot · 2021-01-23T00:32:36Z

💊 CI failures summary and remediations

As of commit 61f32b6 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

…e first K iterations This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) ghstack-source-id: 120245539 Pull Request resolved: #50973

…duce for the first K iterations" This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) [ghstack-poisoned]

…e first K iterations Pull Request resolved: #50973 This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120246814 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)

…duce for the first K iterations" This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) [ghstack-poisoned]

…e first K iterations Pull Request resolved: #50973 This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)

…duce for the first K iterations" This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) [ghstack-poisoned]

rohan-varma

Awesome, excited to see the results of this! One interesting area to explore would be hook composition, i.e. the user can specify whether to run vanilla allreduce or fp16 or even some other algorithm for the first t iterations. Ideally, it would be great if we can just express this by composing the different hooks as building blocks, maybe something like:

def main_hook(state, iter, args):
    hook_for_iter = dispatch_table[iter, args] # dispatch table that determines which hook to run based on args
   return hook_for_iter(state, args)

This is admittedly overkill for the current use case, but may be an interesting area to explore if we realize we are composing more hooks.

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py

rohan-varma · 2021-01-26T05:50:47Z

Also, can we add an appropriate unittest for this change?

wayi1 · 2021-01-26T06:05:01Z

Awesome, excited to see the results of this! One interesting area to explore would be hook composition, i.e. the user can specify whether to run vanilla allreduce or fp16 or even some other algorithm for the first t iterations. Ideally, it would be great if we can just express this by composing the different hooks as building blocks, maybe something like:
def main_hook(state, iter, args):
    hook_for_iter = dispatch_table[iter, args] # dispatch table that determines which hook to run based on args
   return hook_for_iter(state, args)
This is admittedly overkill for the current use case, but may be an interesting area to explore if we realize we are composing more hooks.

That will be one direction of the future work. Since we don't have many comm hooks available, it's hard to see how it will work out. I would prioritize exploring other compression approaches over this. A few comments about the more generic combination:

I don't plan to consider fp16 as a compression hook in this setup. Instead, I will consider it as a wrapper, so it can be combined with both vanilla allreduce and PowerSGD, as well as any other future gradient compression. FP16 is just a type casting, it can be applied in conjunction with any other comm hook, rather than being mutually exclusive.

Therefore, if we do not consider fp16 as a valid choice, then so far we only have vanilla allreduce and PowerSGD two valid options.

I am not sure how useful if we support switching back and forth different compression schemes, especially for biased compressor like PowerSGD that typically requires error compensation. PowerSGD needs to cache some error tensors and warm-up tensors. Switching to another compression scheme will discard these cached states, so there will be some extra accuracy loss due to the discontinuation.

wayi1 · 2021-01-26T06:09:43Z

Also, can we add an appropriate unittest for this change?

Since by default, start_powerSGD_iter is 10 instead of 0, so the first 10 iterations will just run vanilla allreduce. It's already covered by the unit test.

… by creating a util function that make a vanilla allreduce future Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/) [ghstack-poisoned]

…SGD_hook.py by creating a util function that make a vanilla allreduce future" Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/) [ghstack-poisoned]

… by creating a util function that make a vanilla allreduce future Pull Request resolved: #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120376248 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

…e first K iterations (#50973) Summary: Pull Request resolved: #50973 This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257202 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D26031478 fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e

… for the first K iterations Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]

… for the first K iterations Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) ghstack-source-id: 120400709 Pull Request resolved: #51270

…SGD_hook.py by creating a util function that make a vanilla allreduce future" Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/) [ghstack-poisoned]

…a allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]

… for the first K iterations Pull Request resolved: #51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120617938 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)

…SGD_hook.py by creating a util function that make a vanilla allreduce future" Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/) [ghstack-poisoned]

… by creating a util function that make a vanilla allreduce future Pull Request resolved: #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120619680 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

… by creating a util function that make a vanilla allreduce future (#51094) Summary: Pull Request resolved: #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120619680 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D26070147 fbshipit-source-id: 8c9339f1511e8f24cc906b9411cfe4850a5a6d81

…werSGD_hook.py by creating a util function that make a vanilla allreduce future Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/) [ghstack-poisoned]

…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future" Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/) [ghstack-poisoned]

…werSGD_hook.py by creating a util function that make a vanilla allreduce future Pull Request resolved: #51400 Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120715333 Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future" Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/) [ghstack-poisoned]

…GD to run vanilla allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]

…a allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]

…GD to run vanilla allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]

…a allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]

… for the first K iterations Pull Request resolved: #51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725858 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)

…werSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400) Summary: Pull Request resolved: #51400 Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725690 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D26162333 fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0

… for the first K iterations (#51270) Summary: Pull Request resolved: #51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725858 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl baseline: f248001754 batched PowerSGD: f246960752 The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35 Reviewed By: rohan-varma Differential Revision: D26077709 fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5

wayi1 requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners January 23, 2021 00:32

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jan 23, 2021

wayi1 mentioned this pull request Jan 23, 2021

[Gradient Compression] Typo fixes in PowerSGD #50974

Closed

This was referenced Jan 23, 2021

[Gradient Compression] Simplify the implementation of warm-start #50981

Closed

[Gradient Compression] Explicitly specify the dtype of the error tensor #50985

Closed

rohan-varma approved these changes Jan 26, 2021

View reviewed changes

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py Show resolved Hide resolved

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py Show resolved Hide resolved

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py Show resolved Hide resolved

wayi1 mentioned this pull request Jan 26, 2021

[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future #51094

Closed

wayi1 closed this Jan 26, 2021

wayi1 mentioned this pull request Jan 28, 2021

[Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations #51270

Closed

wayi1 mentioned this pull request Jan 30, 2021

[Resubmission][Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future #51400

Closed

facebook-github-bot deleted the gh/SciPioneer/45/head branch February 25, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973

[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973

wayi1 commented Jan 23, 2021 •

edited

facebook-github-bot commented Jan 23, 2021 •

edited

rohan-varma left a comment

rohan-varma commented Jan 26, 2021

wayi1 commented Jan 26, 2021

wayi1 commented Jan 26, 2021

[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973

[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973

Conversation

wayi1 commented Jan 23, 2021 • edited

facebook-github-bot commented Jan 23, 2021 • edited

💊 CI failures summary and remediations

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma commented Jan 26, 2021

wayi1 commented Jan 26, 2021

wayi1 commented Jan 26, 2021

wayi1 commented Jan 23, 2021 •

edited

facebook-github-bot commented Jan 23, 2021 •

edited