Skip to content

Device agnostic gradient reduction #17757

@pietern

Description

@pietern

There are separate implementations for the distributed data parallelism for CUDA and CPU tensors.

The CUDA one has evolved quite a bit and includes:

  • Dynamic bucketing of gradient tensors to reduce the number of reduction calls
  • Start reduction when a bucket is full (i.e. all gradients participating have been computed)
  • Handling of fp32 and fp16 gradients
  • Handling of multiple model copies across devices
  • Buffer broadcast

The CPU one has not evolved and does the following:

  1. Wait for all gradients to be computed
  2. Flatten all gradients into one big tensor
  3. Allreduce
  4. Unflatten

Ideally, we turn the CUDA approach into a device agnostic one, such that we can remove the CPU specific implementation. Parts of the CUDA approach are written in C++ (see #12852, #12954, #13496, #13607) and these can be made device agnostic as well. Both the bucketing and autograd hooks are now defined in Python land but can be turned into C++ code as well.

Separately, @mcarilli and team made really nice improvements to standard DDP in Apex. The ideal bucketing order is discovered dynamically such that it reflects the execution order instead of definition order, which can cause significant speedups if these orderings are mismatched (e.g. layers that are used last are defined first).

Goals:

  • Use the same DDP implementation regardless of backend
  • Enable use of DDP for models defined with the C++ frontend
  • Discover ideal bucketing order dynamically

Let's use this as a starting point for discussion.

cc @mcarilli @mrshenli

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureA request for a proper, new feature.oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions