New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973
Conversation
…e first K iterations This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 61f32b6 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…e first K iterations This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) ghstack-source-id: 120245539 Pull Request resolved: #50973
…duce for the first K iterations" This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) [ghstack-poisoned]
…e first K iterations Pull Request resolved: #50973 This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120246814 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)
…duce for the first K iterations" This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) [ghstack-poisoned]
…e first K iterations Pull Request resolved: #50973 This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/)
…duce for the first K iterations" This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26031478](https://our.internmc.facebook.com/intern/diff/D26031478/) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, excited to see the results of this! One interesting area to explore would be hook composition, i.e. the user can specify whether to run vanilla allreduce or fp16 or even some other algorithm for the first t iterations. Ideally, it would be great if we can just express this by composing the different hooks as building blocks, maybe something like:
def main_hook(state, iter, args):
hook_for_iter = dispatch_table[iter, args] # dispatch table that determines which hook to run based on args
return hook_for_iter(state, args)
This is admittedly overkill for the current use case, but may be an interesting area to explore if we realize we are composing more hooks.
Also, can we add an appropriate unittest for this change? |
That will be one direction of the future work. Since we don't have many comm hooks available, it's hard to see how it will work out. I would prioritize exploring other compression approaches over this. A few comments about the more generic combination:
Therefore, if we do not consider fp16 as a valid choice, then so far we only have vanilla allreduce and PowerSGD two valid options.
|
Since by default, |
… by creating a util function that make a vanilla allreduce future Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/) [ghstack-poisoned]
…SGD_hook.py by creating a util function that make a vanilla allreduce future" Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/) [ghstack-poisoned]
… by creating a util function that make a vanilla allreduce future Pull Request resolved: #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120376248 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)
…e first K iterations (#50973) Summary: Pull Request resolved: #50973 This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257202 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D26031478 fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e
… for the first K iterations Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
… for the first K iterations Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) ghstack-source-id: 120400709 Pull Request resolved: #51270
…SGD_hook.py by creating a util function that make a vanilla allreduce future" Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/) [ghstack-poisoned]
…a allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
… for the first K iterations Pull Request resolved: #51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120617938 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)
…SGD_hook.py by creating a util function that make a vanilla allreduce future" Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/) [ghstack-poisoned]
… by creating a util function that make a vanilla allreduce future Pull Request resolved: #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120619680 Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)
… by creating a util function that make a vanilla allreduce future (#51094) Summary: Pull Request resolved: #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120619680 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D26070147 fbshipit-source-id: 8c9339f1511e8f24cc906b9411cfe4850a5a6d81
…werSGD_hook.py by creating a util function that make a vanilla allreduce future Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/) [ghstack-poisoned]
…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future" Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/) [ghstack-poisoned]
…werSGD_hook.py by creating a util function that make a vanilla allreduce future Pull Request resolved: #51400 Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120715333 Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)
…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future" Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/) [ghstack-poisoned]
…GD to run vanilla allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
…a allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
…GD to run vanilla allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
…a allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
… for the first K iterations Pull Request resolved: #51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725858 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)
…werSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400) Summary: Pull Request resolved: #51400 Resubmission of #51094 Address #50973 (comment) Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725690 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D26162333 fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0
… for the first K iterations (#51270) Summary: Pull Request resolved: #51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725858 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl baseline: f248001754 batched PowerSGD: f246960752 The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35 Reviewed By: rohan-varma Differential Revision: D26077709 fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5
Stack from ghstack:
This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.
Also add more comments on the fields in
PowerSGDState
.Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
Differential Revision: D26031478