Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default bucket_cap_mb value to from 25MB to a more optimal value #118421

Open
atalman opened this issue Jan 26, 2024 · 3 comments
Open
Labels
module: distributed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@atalman
Copy link
Contributor

atalman commented Jan 26, 2024

馃殌 The feature, motivation and pitch

Context: #117748

All-reduce comms are used in DDP's backward pass and by default the bucket size is set to 25MB via bucket_cap_mb. Documentation about this can be found here: https://github.com/pytorch/pytorch/blob/main/benchmarks/distributed/ddp/README.md?plain=1#L160

Default 25mb bucket size is very small and most users would have to increase it. Hence this feature request to find and set more optimal default value for general usage.

cc @ptrblck @malfet @roywei @wconstab @chrisG

Alternatives

No response

Additional context

No response

@wconstab
Copy link
Contributor

agree, we ought to build a tool for users to apply and find their own optimal size. before then, i think it would be appropriate to increase. 250? 1G? no idea. Calling all experts with input here.

@rohan-varma @xw285cornell @lessw2020 ...

@lessw2020
Copy link
Contributor

I would propose 100MB out as a very safe increase that is likely to an improvement in almost every case vs current 25MB.

I would be more hesitant to go too 'big' on this setting until there's a better/dynamic way to tune that as too big can become almost as bad as too small, but I haven't really seen a case where 100MB would not outperform 25MB for current models.

Anyway, @atalman is def correct that 25MB default is not a good default anymore.

@kwen2501
Copy link
Contributor

ddp_time

What DDP + bucketing cares about is exposed time. The exposed time comprises:

  1. transmission of the last bucket (the smaller the better), and
  2. the accumulated overhead of all buckets (the less the better).

That is, t_e = s_b / w + t_o * n_b, where:

  • s_b is size of bucket,
  • w is algorithm bandwidth of all-reduce,
  • t_o is overhead per all-reduce, and
  • n_b is number of buckets.

Since n_b = s / s_b, where s is the full size of the model, we have:
t_e = s_b / w + t_o * s / s_b

We know t_e would achieve minimum when
s_b / w = t_o * s / s_b
thus the best s_b would be:
s_b = sqrt( s * t_o * w )

Let's pick some values:

  • s = 4 GB (taking 1B model for example)
  • t_o = 4 us (CUDA kernel launch overhead + NCCL kernel overhead)
  • w = 100 GB/s (algorithm bandwidth of all-reduce on 8 x A100, note this is half of the 200 GB/s bus bandwidth)

then we have the best s_b for this case being:
s_b = 40 MB

Now the question is whether the above values are at the "center" of all DDP models x systems. Would like to hear what people think.

@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: distributed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants