Skip to content

Commit

Permalink
Update on "[Gradient Compression] Implement the original layerwise Po…
Browse files Browse the repository at this point in the history
…werSGD"


The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook".

Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned.

Also add a test in distributed_test.py.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D25511543](https://our.internmc.facebook.com/intern/diff/D25511543/)

[ghstack-poisoned]
  • Loading branch information
wayi committed Dec 18, 2020
1 parent 452ced1 commit 9e60718
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion torch/distributed/algorithms/ddp_comm_hooks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ class DDPCommHookType(Enum):
comm_hook=powerSGD.powerSGD_hook,
matrix_approximation_rank=2,
)
# Batching can lead to a faster training in the cost of accuracy.
# Batching can lead to a faster training at the cost of accuracy.
BATCHED_POWER_SGD = partial(
_powerSGD_comm_hook_wrapper,
comm_hook=powerSGD.batched_powerSGD_hook,
Expand Down

0 comments on commit 9e60718

Please sign in to comment.