Scatter-reduce in stage 2 #7198

ghadialhajj · 2025-04-02T20:08:22Z

ghadialhajj
Apr 2, 2025

The way I understand gradient partitioning is that during the backward pass, after each GPU computes the gradients for a given layer on its own batch, only one GPU (the one holding the optimizer state for that layer) would do a reduce operation to get the gradients from all batches.

In the paper, however, it says, when describing stage 2: "ZeRO only requires a scatter-reduce operation on the gradients". First is this "scatter-reduce" the same as the one here? If so, why do we need that instead of a simple reduce?

Thanks :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scatter-reduce in stage 2 #7198

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Scatter-reduce in stage 2 #7198

Uh oh!

ghadialhajj Apr 2, 2025

Replies: 0 comments

ghadialhajj
Apr 2, 2025