Skip to content

Commit

Permalink
Update on "[Gradient Compression] Add a random generator to PowerSGD …
Browse files Browse the repository at this point in the history
…state for initializing low-rank matrix Q"

Previously the random seed is the length of input tensor, which is not guaranteed to be the different for different batches. Now initialize a random generator in PowerSGD state, and use this generator to create a random seed to randomize the low-rank tensor Q at every step.

Therefore, the initial tensor Q should be the same across all the replicas at the same step, but different at different steps.

'torch.manual_seed' is used in the same way as https://github.com/epfml/powersgd/blob/master/gradient_reducers.py#L675

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D25191589](https://our.internmc.facebook.com/intern/diff/D25191589/)

[ghstack-poisoned]
  • Loading branch information
wayi committed Nov 30, 2020
1 parent a44241f commit 973be1c
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
Expand Up @@ -35,9 +35,11 @@ class PowerSGDState(object):
def __init__(self, process_group, matrix_approximation_rank=1, random_seed=0):
self.process_group = process_group
self.matrix_approximation_rank = matrix_approximation_rank
# The purpose of RNG is to generate different random seed for initializing Q across iterations, but in the same order for all replicas.
# Different random seeds across iterations means different 'projections' of the gradients at different SGD steps.
# If the same random projection is used, there will be differences between the gradients that are never synchronized.
# The purpose of this RNG is to generate different random seeds for initializing Q across iterations,
# but in the same order for all the DDP replicas.
# Different random seeds across iterations indicate different 'projections' of the gradients at different SGD steps.
# If the same random projection is used,
# there will be differences between the gradients that are never synchronized.
self.rng = np.random.RandomState(random_seed)


Expand Down Expand Up @@ -102,7 +104,7 @@ def create_low_rank_tensor(fill_random_values, rng):
"Returns a low-rank 2D tensor of square_side_length * matrix_approximation_rank."
if fill_random_values:
with torch.random.fork_rng(devices=[]):
# Fork this RNG to avoid chaning the seed globally and affecting the random sampling anywhere else in the training.
# Fork this RNG to avoid changing the seed globally and affecting the random sampling anywhere else in the training.
# The seed makes sure that the initial random values are the same across all the DDP replicas.
# Such seed should differ at every step.
# Since it is very slow to fork RNG state across all the CUDA devices,
Expand Down

0 comments on commit 973be1c

Please sign in to comment.