From 9082d4b6f1a427b0338d1ad8a2d6e88cf4bbadb3 Mon Sep 17 00:00:00 2001 From: Tim Gianitsos Date: Mon, 5 Jun 2023 22:49:03 -0700 Subject: [PATCH] Optimizer state is not synchronized across replicas like model state is From `DistributedDataParallel` docs: "The module... assumes that [gradients] will be modified by the optimizer in all processes in the same way." Note that this is "assumed", not enforced. From https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html : "each process keeps a dedicated replica of the optimizer. Since DDP has already synchronized gradients in the backward pass, all optimizer replicas will operate on the same parameter and gradient values in every iteration, and this is how DDP keeps model replicas in the same state" --- intermediate_source/FSDP_tutorial.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/intermediate_source/FSDP_tutorial.rst b/intermediate_source/FSDP_tutorial.rst index d69a03b68be..0399d397088 100644 --- a/intermediate_source/FSDP_tutorial.rst +++ b/intermediate_source/FSDP_tutorial.rst @@ -16,7 +16,7 @@ In this tutorial, we show how to use `FSDP APIs `__, (DDP) training, each process/ worker owns a replica of the model and processes a batch of data, finally it uses all-reduce to sum up gradients over different workers. In DDP the model weights and optimizer states are replicated across all workers. FSDP is a type of data parallelism that shards model parameters, optimizer states and gradients across DDP ranks. +In `DistributedDataParallel `__, (DDP) training, each process/ worker owns a replica of the model and processes a batch of data, finally it uses all-reduce to sum up gradients over different workers. In DDP the model weights are replicated across all workers. FSDP is a type of data parallelism that shards model parameters, optimizer states and gradients across DDP ranks. FSDP GPU memory footprint would be smaller than DDP across all workers. This makes the training of some very large models feasible and helps to fit larger models or batch sizes for our training job. This would come with the cost of increased communication volume. The communication overhead is reduced by internal optimizations like communication and computation overlapping.