Gradient collection in stage 1 #7197
-
Based on resources on ZeRO (e.g. page here) for stage 1 where the optimizer state is shared, an all-reduce operation follows the backward pass to update the parameters in the corresponding state holder. However, if I understand correctly, a GPU holding the OS for parameter W1 doesn't need to get the gradient for parameter W2 from the other GPUs, right? A reduce-scatter seems more reasonable where we reduce the gradients so as to average over all batches and then only send the part of the gradient for which there is an optimizer state in the receiving GPU. Unfortunately, the original paper skips the communication volume for stage 1 and only discusses stages 2 and 3. Thanks in advance :) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@ghadialhajj, yes reduce-scatter is the better option for gradient reduction. We manually implemented reduce-scatter because the collective was not supported by torch at that time. We have lacked bandwidth to upgrade the code: DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py Line 1067 in 79ff162 |
Beta Was this translation helpful? Give feedback.
@ghadialhajj, yes reduce-scatter is the better option for gradient reduction.
We manually implemented reduce-scatter because the collective was not supported by torch at that time. We have lacked bandwidth to upgrade the code:
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py
Line 1067 in 79ff162