Skip to content

2 NCCL all_reduces per step for large models on multiNode training #61353

@puririshi98

Description

@puririshi98

🐛 Bug

When using large models with DDP, we witness two NCCL all reduces per iteration and have no ability to increase the first bucket size. This is due to reducer.hpp KDefaultFirstBucketBytes being a const expr int that cannot be modified through the python api, yet it is exposed as an attribute of torch.distributed.
The value is defined here:

constexpr int kDefaultFirstBucketBytes = int(1024 * 1024);

The bucket size is set here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L1649

There is a Python API exposure, however since the value is a constexpr it is defined at compile time which would make changing it through Python at Runtime problematic. Changing the value through the Python does not currently work.
Python API:

module.attr("_DEFAULT_FIRST_BUCKET_BYTES") = ::c10d::kDefaultFirstBucketBytes;

The user should be able to increase this KDefaultFirstBucketBytes value through the python API.

Reproduction Steps (Should Work in any pytorch container from 21.05 and onwards):

  1. tar -xvzf ddp_overlap.tgz
  2. cp -r ddp_overlap /opt/
  3. cd /opt/ddp_overlap
  4. bash run.sh

Repro Scripts:
ddp_overlap.zip

Example of issue:
evidence

This issue persists if the KDefaultFirstBucketBytes value is edited through the Python binding showing that it does not work. In contrast, changing the hardcoded value of the constexpr defined in reducer.hpp manually and recompiling pytorch does remove the 2 all_reduce calls issue. This value should be properly defined in the backend of PyTorch such that it can be edited by a user in the PyTorch Python API without having to recompile for situations such as this with very large models.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

Metadata

Metadata

Assignees

Labels

high prioritymodule: ddpIssues/PRs related distributed data parallel trainingoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions