Update on "[Gradient Compression] Implement the original layerwise Po…

…werSGD" The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25511543](https://our.internmc.facebook.com/intern/diff/D25511543/) [ghstack-poisoned]
pytorch · Dec 18, 2020 · 9e60718 · 9e60718
1 parent 452ced1
commit 9e60718
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/__init__.py b/torch/distributed/algorithms/ddp_comm_hooks/__init__.py
@@ -66,7 +66,7 @@ class DDPCommHookType(Enum):
         comm_hook=powerSGD.powerSGD_hook,
         matrix_approximation_rank=2,
     )
-    # Batching can lead to a faster training in the cost of accuracy.
+    # Batching can lead to a faster training at the cost of accuracy.
     BATCHED_POWER_SGD = partial(
         _powerSGD_comm_hook_wrapper,
         comm_hook=powerSGD.batched_powerSGD_hook,