Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gradient Compression] Add error feedback to layerwise PowerSGD #49418

Closed
wants to merge 2 commits into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Dec 15, 2020

Stack from ghstack:

Ad error feedback to the original implementation of PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D25555538

Add the error feedback to the original implementation of PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D25555538](https://our.internmc.facebook.com/intern/diff/D25555538/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 15, 2020
wayi1 pushed a commit that referenced this pull request Dec 15, 2020
Add the error feedback to the original implementation of PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D25555538](https://our.internmc.facebook.com/intern/diff/D25555538/)

ghstack-source-id: 118634310
Pull Request resolved: #49418
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 15, 2020

💊 CI failures summary and remediations

As of commit 960c458 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 16 01:38:33 ERROR [2.222s]: test_DistributedDataParallel_powerSGD_ddp_comm_hook (__main__.TestDistBackendWithFork)
Dec 16 01:38:30   test_scatter_group (__main__.TestDistBackendWithFork) ... skip (0.119s)
Dec 16 01:38:30   test_scatter_object_list (__main__.TestDistBackendWithFork) ... ok (0.120s)
Dec 16 01:38:30   test_send_recv (__main__.TestDistBackendWithFork) ... ok (0.220s)
Dec 16 01:38:30   test_send_recv_any_source (__main__.TestDistBackendWithFork) ... ok (0.220s)
Dec 16 01:38:30   test_send_recv_nccl (__main__.TestDistBackendWithFork) ... skip (0.002s)
Dec 16 01:38:30   test_send_recv_with_tag (__main__.TestDistBackendWithFork) ... ok (0.219s)
Dec 16 01:38:31   test_sparse_all_reduce_sum (__main__.TestDistBackendWithFork) ... ok (0.118s)
Dec 16 01:38:33   test_sparse_all_reduce_sum_cuda (__main__.TestDistBackendWithFork) ... ok (2.124s)
Dec 16 01:38:33 
Dec 16 01:38:33 ======================================================================
Dec 16 01:38:33 ERROR [2.222s]: test_DistributedDataParallel_powerSGD_ddp_comm_hook (__main__.TestDistBackendWithFork)
Dec 16 01:38:33 ----------------------------------------------------------------------
Dec 16 01:38:33 Traceback (most recent call last):
Dec 16 01:38:33   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 280, in wrapper
Dec 16 01:38:33     self._join_processes(fn)
Dec 16 01:38:33   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 397, in _join_processes
Dec 16 01:38:33     self._check_return_codes(elapsed_time)
Dec 16 01:38:33   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 433, in _check_return_codes
Dec 16 01:38:33     raise RuntimeError(error)
Dec 16 01:38:33 RuntimeError: Processes 0 1 exited with error code 10
Dec 16 01:38:33 

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 16 02:44:11 ERROR [3.125s]: test_DistributedDataParallel_powerSGD_ddp_comm_hook (__main__.TestDistBackendWithSpawn)
Dec 16 02:44:03   test_scatter_group (__main__.TestDistBackendWithSpawn) ... skip (1.023s)
Dec 16 02:44:04   test_scatter_object_list (__main__.TestDistBackendWithSpawn) ... ok (1.024s)
Dec 16 02:44:05   test_send_recv (__main__.TestDistBackendWithSpawn) ... ok (1.125s)
Dec 16 02:44:06   test_send_recv_any_source (__main__.TestDistBackendWithSpawn) ... ok (1.125s)
Dec 16 02:44:06   test_send_recv_nccl (__main__.TestDistBackendWithSpawn) ... skip (0.002s)
Dec 16 02:44:07   test_send_recv_with_tag (__main__.TestDistBackendWithSpawn) ... ok (1.124s)
Dec 16 02:44:08   test_sparse_all_reduce_sum (__main__.TestDistBackendWithSpawn) ... ok (1.024s)
Dec 16 02:44:11   test_sparse_all_reduce_sum_cuda (__main__.TestDistBackendWithSpawn) ... ok (2.927s)
Dec 16 02:44:11 
Dec 16 02:44:11 ======================================================================
Dec 16 02:44:11 ERROR [3.125s]: test_DistributedDataParallel_powerSGD_ddp_comm_hook (__main__.TestDistBackendWithSpawn)
Dec 16 02:44:11 ----------------------------------------------------------------------
Dec 16 02:44:11 Traceback (most recent call last):
Dec 16 02:44:11   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 280, in wrapper
Dec 16 02:44:11     self._join_processes(fn)
Dec 16 02:44:11   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 397, in _join_processes
Dec 16 02:44:11     self._check_return_codes(elapsed_time)
Dec 16 02:44:11   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 433, in _check_return_codes
Dec 16 02:44:11     raise RuntimeError(error)
Dec 16 02:44:11 RuntimeError: Processes 0 exited with error code 10
Dec 16 02:44:11 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 9 times.

Add the error feedback to the original implementation of PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D25555538](https://our.internmc.facebook.com/intern/diff/D25555538/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Dec 16, 2020
Pull Request resolved: #49418

Add the error feedback to the original implementation of PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118670930

Differential Revision: [D25555538](https://our.internmc.facebook.com/intern/diff/D25555538/)
@wayi1 wayi1 changed the title [Gradient Compression] Add error feedback [Gradient Compression] Add error feedback to layerwise PowerSGD Dec 16, 2020
Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments in line

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 342bfd8.

hwangdeyu pushed a commit to hwangdeyu/pytorch that referenced this pull request Jan 6, 2021
…rch#49418)

Summary:
Pull Request resolved: pytorch#49418

Add error feedback to the original implementation of PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202
ghstack-source-id: 118670930

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25555538

fbshipit-source-id: c01145cc9acf574a4c6aa337dbbba0ba7d9350b2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants