[Gradient Compression] Refactor tensor grouping in PowerSGD #52981

wayi1 · 2021-02-28T08:34:47Z

Stack from ghstack:

[Gradient Compression] Make GradBucket class public #53099 [Gradient Compression] Make GradBucket class public
[Gradient Compression] Refactor tensor grouping in PowerSGD #52981 [Gradient Compression] Refactor tensor grouping in PowerSGD

No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to _should_compress function.

Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in #52541 (comment)

Differential Revision: D26713689

No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function. Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in #52541 (comment) Differential Revision: [D26713689](https://our.internmc.facebook.com/intern/diff/D26713689/) [ghstack-poisoned]

facebook-github-bot · 2021-02-28T08:34:56Z

💊 CI failures summary and remediations

As of commit f997deb (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)
1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Mar 03 22:20:25 urllib.error.HTTPError: HTTP Error 403: Forbidden

Mar 03 22:20:25   File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
Mar 03 22:20:25     response = meth(req, response)
Mar 03 22:20:25   File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
Mar 03 22:20:25     'http', request, response, code, msg, hdrs)
Mar 03 22:20:25   File "/opt/conda/lib/python3.6/urllib/request.py", line 570, in error
Mar 03 22:20:25     return self._call_chain(*args)
Mar 03 22:20:25   File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
Mar 03 22:20:25     result = func(*args)
Mar 03 22:20:25   File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
Mar 03 22:20:25     raise HTTPError(req.full_url, code, msg, hdrs, fp)
Mar 03 22:20:25 urllib.error.HTTPError: HTTP Error 403: Forbidden
Mar 03 22:20:25 
Mar 03 22:20:25 ----------------------------------------------------------------------
Mar 03 22:20:25 Ran 1 test in 0.725s
Mar 03 22:20:25 
Mar 03 22:20:25 FAILED (errors=1)
Mar 03 22:20:25 Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /tmp/mnist-data/MNIST/raw/train-images-idx3-ubyte.gz
Mar 03 22:20:25 
0it [00:00, ?it/s]
Mar 03 22:20:25 =================== sccache compilation log ===================
Mar 03 22:20:25 + cleanup
Mar 03 22:20:25 + retcode=1

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

wayi1 · 2021-02-28T08:57:18Z

@tvogels Can you take a look? Thanks!

No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function. Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in #52541 (comment) Differential Revision: [D26713689](https://our.internmc.facebook.com/intern/diff/D26713689/) [ghstack-poisoned]

Pull Request resolved: #52981 No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function. Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in #52541 (comment) ghstack-source-id: 122703391 Differential Revision: [D26713689](https://our.internmc.facebook.com/intern/diff/D26713689/)

tvogels · 2021-02-28T15:08:51Z

Looks good to me.

No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function. Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in #52541 (comment) Differential Revision: [D26713689](https://our.internmc.facebook.com/intern/diff/D26713689/) [ghstack-poisoned]

Pull Request resolved: #52981 No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function. Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in #52541 (comment) ghstack-source-id: 122712478 Differential Revision: [D26713689](https://our.internmc.facebook.com/intern/diff/D26713689/)

No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function. Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in #52541 (comment) Differential Revision: [D26713689](https://our.internmc.facebook.com/intern/diff/D26713689/) [ghstack-poisoned]

rohan-varma

LGTM! Thanks for refactoring this

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py

facebook-github-bot · 2021-03-04T03:22:34Z

This pull request has been merged in b59075e.

…52981) Summary: Pull Request resolved: pytorch#52981 No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function. Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in pytorch#52541 (comment) ghstack-source-id: 122997032 Test Plan: waitforbuildbot Already LGTMed by PowerSGD paper author. Ads1x (completed): https://www.internalfb.com/intern/tupperware/details/job/?handle=priv3_global%2Fmast_hpc%2Ftsm_hpc-wayi_ads_10x_POWER_SGD_gpu8_2021-02-28_15-29.trainer&tatwTabs=tasks&task_id=0&task_tab=TASK_LOGS Detectron2: 1) Before refactoring: f254353864 Accuracy: 39.972 Overall training speed: 67498 iterations in 6:15:42 (0.3340 s / it) 2) After refactoring: f254353380 Accuracy: 39.944 Overall training speed: 67498 iterations in 6:09:41 (0.3286 s / it) Reviewed By: rohan-varma Differential Revision: D26713689 fbshipit-source-id: 12cfcb65feaa2a2d94e3c7793073031f13828305

wayi1 requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners February 28, 2021 08:34

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 28, 2021

wayi1 mentioned this pull request Feb 28, 2021

[Gradient Compression] Correct the usage of min_compression_rate #52979

Closed

wayi1 mentioned this pull request Mar 1, 2021

[Gradient Compression] Add a minimum compression rate threshold for PowerSGD communication hook #52541

Closed

wayi1 mentioned this pull request Mar 3, 2021

[Gradient Compression] Make GradBucket class public #53099

Closed

rohan-varma approved these changes Mar 3, 2021

View reviewed changes

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py Show resolved Hide resolved

facebook-github-bot closed this in b59075e Mar 4, 2021

facebook-github-bot added the Merged label Mar 4, 2021

facebook-github-bot deleted the gh/SciPioneer/66/head branch March 7, 2021 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gradient Compression] Refactor tensor grouping in PowerSGD #52981

[Gradient Compression] Refactor tensor grouping in PowerSGD #52981

wayi1 commented Feb 28, 2021 •

edited

facebook-github-bot commented Feb 28, 2021 •

edited

wayi1 commented Feb 28, 2021 •

edited

tvogels commented Feb 28, 2021

rohan-varma left a comment

facebook-github-bot commented Mar 4, 2021

[Gradient Compression] Refactor tensor grouping in PowerSGD #52981

[Gradient Compression] Refactor tensor grouping in PowerSGD #52981

Conversation

wayi1 commented Feb 28, 2021 • edited

facebook-github-bot commented Feb 28, 2021 • edited

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

wayi1 commented Feb 28, 2021 • edited

tvogels commented Feb 28, 2021

rohan-varma left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 4, 2021

wayi1 commented Feb 28, 2021 •

edited

facebook-github-bot commented Feb 28, 2021 •

edited

wayi1 commented Feb 28, 2021 •

edited