Increase default bucket_cap_mb value to from 25MB to a more optimal value #118421

atalman · 2024-01-26T20:53:31Z

🚀 The feature, motivation and pitch

All-reduce comms are used in DDP's backward pass and by default the bucket size is set to 25MB via bucket_cap_mb. Documentation about this can be found here: https://github.com/pytorch/pytorch/blob/main/benchmarks/distributed/ddp/README.md?plain=1#L160

Default 25mb bucket size is very small and most users would have to increase it. Hence this feature request to find and set more optimal default value for general usage.

cc @ptrblck @malfet @roywei @wconstab @chrisG

Alternatives

No response

Additional context

No response

wconstab · 2024-01-26T21:16:51Z

agree, we ought to build a tool for users to apply and find their own optimal size. before then, i think it would be appropriate to increase. 250? 1G? no idea. Calling all experts with input here.

@rohan-varma @xw285cornell @lessw2020 ...

lessw2020 · 2024-01-27T01:55:05Z

I would propose 100MB out as a very safe increase that is likely to an improvement in almost every case vs current 25MB.

I would be more hesitant to go too 'big' on this setting until there's a better/dynamic way to tune that as too big can become almost as bad as too small, but I haven't really seen a case where 100MB would not outperform 25MB for current models.

Anyway, @atalman is def correct that 25MB default is not a good default anymore.

kwen2501 · 2024-01-28T16:32:51Z

What DDP + bucketing cares about is exposed time. The exposed time comprises:

transmission of the last bucket (the smaller the better), and
the accumulated overhead of all buckets (the less the better).

That is, t_e = s_b / w + t_o * n_b, where:

s_b is size of bucket,
w is algorithm bandwidth of all-reduce,
t_o is overhead per all-reduce, and
n_b is number of buckets.

Since n_b = s / s_b, where s is the full size of the model, we have:
t_e = s_b / w + t_o * s / s_b

We know t_e would achieve minimum when
s_b / w = t_o * s / s_b
thus the best s_b would be:
s_b = sqrt( s * t_o * w )

Let's pick some values:

s = 4 GB (taking 1B model for example)
t_o = 4 us (CUDA kernel launch overhead + NCCL kernel overhead)
w = 100 GB/s (algorithm bandwidth of all-reduce on 8 x A100, note this is half of the 200 GB/s bus bandwidth)

then we have the best s_b for this case being:
s_b = 40 MB

Now the question is whether the above values are at the "center" of all DDP models x systems. Would like to hear what people think.

kwen2501 added the module: distributed label Jan 28, 2024

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase default bucket_cap_mb value to from 25MB to a more optimal value #118421

Increase default bucket_cap_mb value to from 25MB to a more optimal value #118421

atalman commented Jan 26, 2024

wconstab commented Jan 26, 2024

lessw2020 commented Jan 27, 2024

kwen2501 commented Jan 28, 2024

Increase default bucket_cap_mb value to from 25MB to a more optimal value #118421

Increase default bucket_cap_mb value to from 25MB to a more optimal value #118421

Comments

atalman commented Jan 26, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

wconstab commented Jan 26, 2024

lessw2020 commented Jan 27, 2024

kwen2501 commented Jan 28, 2024