Update on "[Gradient Compression] Add a random generator to PowerSGD …

…state for initializing low-rank matrix Q" Previously the random seed is the length of input tensor, which is not guaranteed to be the different for different batches. Now initialize a random generator in PowerSGD state, and use this generator to create a random seed to randomize the low-rank tensor Q at every step. Therefore, the initial tensor Q should be the same across all the replicas at the same step, but different at different steps. 'torch.manual_seed' is used in the same way as https://github.com/epfml/powersgd/blob/master/gradient_reducers.py#L675 Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25191589](https://our.internmc.facebook.com/intern/diff/D25191589/) [ghstack-poisoned]
pytorch · Nov 30, 2020 · 973be1c · 973be1c
1 parent a44241f
commit 973be1c
Showing 1 changed file with 6 additions and 4 deletions.
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py b/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
@@ -35,9 +35,11 @@ class PowerSGDState(object):
     def __init__(self, process_group, matrix_approximation_rank=1, random_seed=0):
         self.process_group = process_group
         self.matrix_approximation_rank = matrix_approximation_rank
-        # The purpose of RNG is to generate different random seed for initializing Q across iterations, but in the same order for all replicas.
-        # Different random seeds across iterations means different 'projections' of the gradients at different SGD steps.
-        # If the same random projection is used, there will be differences between the gradients that are never synchronized.
+        # The purpose of this RNG is to generate different random seeds for initializing Q across iterations,
+        # but in the same order for all the DDP replicas.
+        # Different random seeds across iterations indicate different 'projections' of the gradients at different SGD steps.
+        # If the same random projection is used,
+        # there will be differences between the gradients that are never synchronized.
         self.rng = np.random.RandomState(random_seed)
 
 
@@ -102,7 +104,7 @@ def create_low_rank_tensor(fill_random_values, rng):
         "Returns a low-rank 2D tensor of square_side_length * matrix_approximation_rank."
         if fill_random_values:
             with torch.random.fork_rng(devices=[]):
-                # Fork this RNG to avoid chaning the seed globally and affecting the random sampling anywhere else in the training.
+                # Fork this RNG to avoid changing the seed globally and affecting the random sampling anywhere else in the training.
                 # The seed makes sure that the initial random values are the same across all the DDP replicas.
                 # Such seed should differ at every step.
                 # Since it is very slow to fork RNG state across all the CUDA devices,