About the reason for doing all-gather in the propagation #7123

Yeosu-expo · 2025-03-10T04:33:44Z

Yeosu-expo
Mar 10, 2025

I am a student studying DNN training engines.
Nowadays, I'm analyzing DeepSpeed, which is powered by ZeRO,
and I've been reading the ZeRO papers (e.g., ZeRO, ZeRO-Offload, ZeRO++).

Sometimes, I wonder why ZeRO performs an all-gather on all parameter partitions in both the forward and backward passes.
In my opinion, since ZeRO partitions parameters in an N-way manner (where N denotes the degree of Data Parallelism), unlike the approach used in Model Parallelism, an all-gather is required for computation.

So, my main question is: Since ZeRO only partitions parameters according to Data Parallelism, does it need to perform an all-gather of all parameter partitions for both the forward and backward passes? And if ZeRO partitioned parameters in the same way as Model Parallelism(divided by column-wise or row-wise, considering computation), perhaps an all-gather would not be necessary, right?

siddharth9820 · 2025-03-10T04:40:55Z

siddharth9820
Mar 10, 2025

Consider a simple linear layer - in the backward pass to calculate the gradient wrt the input (say X) the weight matrix (W) is needed

dx = dy . W^T

Hence one needs to all gather Ws in the backward pass also.

1 reply

Yeosu-expo Mar 10, 2025
Author

In my knowledge, model paralleism however doesn't need to all-gather operation, whether is row-wise partition or column-wise partition paramter, it just computes parameter shard with input(or input seperated), then, merge all of result computed.

So, I think since ZeRO partitions parameter like Model Parallelism, ZeRO doens't need to all-gather paramter.

Maybe, in backward process is needed all parameter for differentiation. But, in forward pass I think it doesn't be needed.

Please, if I misunderstood in my background or thought, let me know.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

About the reason for doing all-gather in the propagation #7123

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

About the reason for doing all-gather in the propagation #7123

Uh oh!

Yeosu-expo Mar 10, 2025

Replies: 1 comment · 1 reply

Uh oh!

siddharth9820 Mar 10, 2025

Uh oh!

Yeosu-expo Mar 10, 2025 Author

Yeosu-expo
Mar 10, 2025

Replies: 1 comment 1 reply

siddharth9820
Mar 10, 2025

Yeosu-expo Mar 10, 2025
Author