[Gradient Compression] Error feedback for PowerSGD (still need to fix the key in error_dict) #48670

wayi1 · 2020-12-01T22:46:08Z

Stack from ghstack:

[Gradient Compression] Error feedback for PowerSGD (still need to fix the key in error_dict) #48670 [Gradient Compression] Error feedback for PowerSGD (still need to fix the key in error_dict)

Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D25240290

… the key in error_dict) Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/) [ghstack-poisoned]

… the key in error_dict) Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/) ghstack-source-id: 117597707 Pull Request resolved: #48670

dr-ci · 2020-12-02T00:03:02Z

💊 CI failures summary and remediations

As of commit 5309f70 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

5/5 broken upstream at merge base 44016e6 since Dec 01

🚧 5 ongoing upstream failures:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_bionic_py3_8_gcc9_coverage_test2 since Dec 01
- 🔁 rerun
pytorch_linux_bionic_py3_6_clang9_test since Dec 01
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_asan_test2 since Dec 01
- 🔁 rerun
pytorch_linux_xenial_py3_6_gcc5_4_test since Dec 01
- 🔁 rerun
pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 since Dec 01
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 10 times.

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py

rohan-varma

Looks good overall, have a couple of comments inline.

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py

…need to fix the key in error_dict)" Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/) [ghstack-poisoned]

… the key in error_dict) Pull Request resolved: #48670 Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117636492 Differential Revision: [D25240290](https://our.internmc.facebook.com/intern/diff/D25240290/)

facebook-github-bot · 2020-12-02T15:14:04Z

This pull request has been merged in 9c6979a.

tvogels

@SciPioneer asked me to have a look at this PR. I added a few suggestions for debugging, but in the end I think the bug might be in the use of the tensor p:

in line 160, it is first created empty.
in line 167, a new tensor p is created within the scope of compute_q.
in line 180, the (still empty) global p is used, but it should be the one from line 167.

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py

… the key in error_dict) (pytorch#48670) Summary: Pull Request resolved: pytorch#48670 Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 117636492 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25240290 fbshipit-source-id: 5b6e11e711caccfb8984ac2767dd107dbf4c9b3b

wayi1 · 2020-12-03T22:30:41Z

@SciPioneer asked me to have a look at this PR. I added a few suggestions for debugging, but in the end I think the bug might be in the use of the tensor p:

in line 160, it is first created empty.

in line 167, a new tensor p is created within the scope of compute_q.

in line 180, the (still empty) global p is used, but it should be the one from line 167.

See my reply in the same thread:
#48670 (comment)

It's the same p object throughout the future chain.

wayi1 · 2020-12-04T01:19:13Z

At some point, there is a fundamental lower limit to how much you have to communicate.

@tvogels Is there any quick way to roughly estimate such "lower limit"? I guess we can only do tuning by experimentations, and such limit varies by specific cases.

tvogels · 2020-12-04T11:23:13Z

@tvogels Is there any quick way to roughly estimate such "lower limit"? I guess we can only do tuning by experimentations, and such limit varies by specific cases.

Good question. I am not aware of any theory-based rule here. I guess you can quickly find it by starting with little compression and repeatedly halving the communication budget until you lose too much accuracy.

One note: This fundamental limit seems to change during training. I recently saw this paper that demonstrates that you have to be careful during the beginning of training and when the learning rate decays, but you can get away with stronger compression during the rest of training.

To verify the implementation, maybe you could start with the settings from the PowerSGD paper (ResNet-18, 16 workers, rank-2 PowerSGD, corresponding to 136× compression) and compare results.

wayi1 requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners December 1, 2020 22:46

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 1, 2020