Gradient collection in stage 1 #7197

ghadialhajj · 2025-04-02T16:55:17Z

ghadialhajj
Apr 2, 2025

Based on resources on ZeRO (e.g. page here) for stage 1 where the optimizer state is shared, an all-reduce operation follows the backward pass to update the parameters in the corresponding state holder.

However, if I understand correctly, a GPU holding the OS for parameter W1 doesn't need to get the gradient for parameter W2 from the other GPUs, right? A reduce-scatter seems more reasonable where we reduce the gradients so as to average over all batches and then only send the part of the gradient for which there is an optimizer state in the receiving GPU.

Unfortunately, the original paper skips the communication volume for stage 1 and only discusses stages 2 and 3.

Thanks in advance :)

Answered by tjruwase

Apr 2, 2025

@ghadialhajj, yes reduce-scatter is the better option for gradient reduction.

We manually implemented reduce-scatter because the collective was not supported by torch at that time. We have lacked bandwidth to upgrade the code:

DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py

Line 1067 in 79ff162

if not self.reduce_scatter:

View full answer

tjruwase · 2025-04-02T17:07:36Z

tjruwase
Apr 2, 2025
Maintainer

@ghadialhajj, yes reduce-scatter is the better option for gradient reduction.

We manually implemented reduce-scatter because the collective was not supported by torch at that time. We have lacked bandwidth to upgrade the code:

DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py

Line 1067 in 79ff162

if not self.reduce_scatter:

1 reply

ghadialhajj Apr 2, 2025
Author

Thanks, I appreciate the swift reply :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gradient collection in stage 1 #7197

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Gradient collection in stage 1 #7197

Uh oh!

Uh oh!

ghadialhajj Apr 2, 2025

Replies: 1 comment · 1 reply

Uh oh!

tjruwase Apr 2, 2025 Maintainer

Uh oh!

Uh oh!

ghadialhajj Apr 2, 2025 Author

ghadialhajj
Apr 2, 2025

Replies: 1 comment 1 reply

tjruwase
Apr 2, 2025
Maintainer

ghadialhajj Apr 2, 2025
Author