Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gradient Compression] Error feedback for PowerSGD (still need to fix the key in error_dict) #48670

Closed
wants to merge 3 commits into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Dec 1, 2020

Stack from ghstack:

Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D25240290

… the key in error_dict)

Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the
input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 1, 2020
wayi1 pushed a commit that referenced this pull request Dec 1, 2020
… the key in error_dict)

Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the
input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/)

ghstack-source-id: 117597707
Pull Request resolved: #48670
@dr-ci
Copy link

dr-ci bot commented Dec 2, 2020

💊 CI failures summary and remediations

As of commit 5309f70 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 5 ongoing upstream failures:

These were probably caused by upstream breakages that are not fixed yet:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 10 times.

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, have a couple of comments inline.

…need to fix the key in error_dict)"


Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/)

[ghstack-poisoned]
…need to fix the key in error_dict)"


Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Dec 2, 2020
… the key in error_dict)

Pull Request resolved: #48670

Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117636492

Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 9c6979a.

Copy link

@tvogels tvogels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SciPioneer asked me to have a look at this PR. I added a few suggestions for debugging, but in the end I think the bug might be in the use of the tensor p:

  • in line 160, it is first created empty.
  • in line 167, a new tensor p is created within the scope of compute_q.
  • in line 180, the (still empty) global p is used, but it should be the one from line 167.

shaibagon pushed a commit to shaibagon/pytorch that referenced this pull request Dec 3, 2020
… the key in error_dict) (pytorch#48670)

Summary:
Pull Request resolved: pytorch#48670

Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202
ghstack-source-id: 117636492

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D25240290

fbshipit-source-id: 5b6e11e711caccfb8984ac2767dd107dbf4c9b3b
@wayi1
Copy link
Contributor Author

wayi1 commented Dec 3, 2020

@SciPioneer asked me to have a look at this PR. I added a few suggestions for debugging, but in the end I think the bug might be in the use of the tensor p:

  • in line 160, it is first created empty.
  • in line 167, a new tensor p is created within the scope of compute_q.
  • in line 180, the (still empty) global p is used, but it should be the one from line 167.

See my reply in the same thread:
#48670 (comment)

It's the same p object throughout the future chain.

@wayi1
Copy link
Contributor Author

wayi1 commented Dec 4, 2020

At some point, there is a fundamental lower limit to how much you have to communicate.

@tvogels Is there any quick way to roughly estimate such "lower limit"? I guess we can only do tuning by experimentations, and such limit varies by specific cases.

@tvogels
Copy link

tvogels commented Dec 4, 2020

@tvogels Is there any quick way to roughly estimate such "lower limit"? I guess we can only do tuning by experimentations, and such limit varies by specific cases.

Good question. I am not aware of any theory-based rule here. I guess you can quickly find it by starting with little compression and repeatedly halving the communication budget until you lose too much accuracy.

One note: This fundamental limit seems to change during training. I recently saw this paper that demonstrates that you have to be careful during the beginning of training and when the learning rate decays, but you can get away with stronger compression during the rest of training.

To verify the implementation, maybe you could start with the settings from the PowerSGD paper (ResNet-18, 16 workers, rank-2 PowerSGD, corresponding to 136× compression) and compare results.

@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/28/head branch December 5, 2020 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants