-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Open
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🚀 The feature, motivation and pitch
Context: #117748
All-reduce comms are used in DDP's backward pass and by default the bucket size is set to 25MB via bucket_cap_mb. Documentation about this can be found here: https://github.com/pytorch/pytorch/blob/main/benchmarks/distributed/ddp/README.md?plain=1#L160
Default 25mb bucket size is very small and most users would have to increase it. Hence this feature request to find and set more optimal default value for general usage.
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @ptrblck @malfet @roywei @chrisG
Alternatives
No response
Additional context
No response
wconstab, gottbrath and roywei
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module