New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gradient Compression] Error feedback for PowerSGD (still need to fix the key in error_dict) #48670
Conversation
… the key in error_dict) Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/) [ghstack-poisoned]
… the key in error_dict) Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/) ghstack-source-id: 117597707 Pull Request resolved: #48670
💊 CI failures summary and remediationsAs of commit 5309f70 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 5 ongoing upstream failures:These were probably caused by upstream breakages that are not fixed yet:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 10 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, have a couple of comments inline.
…need to fix the key in error_dict)" Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/) [ghstack-poisoned]
…need to fix the key in error_dict)" Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/) [ghstack-poisoned]
… the key in error_dict) Pull Request resolved: #48670 Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117636492 Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/)
This pull request has been merged in 9c6979a. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SciPioneer asked me to have a look at this PR. I added a few suggestions for debugging, but in the end I think the bug might be in the use of the tensor p
:
- in line 160, it is first created
empty
. - in line 167, a new tensor p is created within the scope of
compute_q
. - in line 180, the (still empty) global p is used, but it should be the one from line 167.
… the key in error_dict) (pytorch#48670) Summary: Pull Request resolved: pytorch#48670 Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 117636492 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25240290 fbshipit-source-id: 5b6e11e711caccfb8984ac2767dd107dbf4c9b3b
See my reply in the same thread: It's the same p object throughout the future chain. |
@tvogels Is there any quick way to roughly estimate such "lower limit"? I guess we can only do tuning by experimentations, and such limit varies by specific cases. |
Good question. I am not aware of any theory-based rule here. I guess you can quickly find it by starting with little compression and repeatedly halving the communication budget until you lose too much accuracy. One note: This fundamental limit seems to change during training. I recently saw this paper that demonstrates that you have to be careful during the beginning of training and when the learning rate decays, but you can get away with stronger compression during the rest of training. To verify the implementation, maybe you could start with the settings from the PowerSGD paper (ResNet-18, 16 workers, rank-2 PowerSGD, corresponding to 136× compression) and compare results. |
Stack from ghstack:
Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.
Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.
This is halfway of error feedback. Plan to add the new index field in a separate PR.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
Differential Revision: D25240290