Device agnostic gradient reduction

There are separate implementations for the distributed data parallelism for CUDA and CPU tensors.

The CUDA one has evolved quite a bit and includes:
* Dynamic bucketing of gradient tensors to reduce the number of reduction calls
* Start reduction when a bucket is full (i.e. all gradients participating have been computed)
* Handling of fp32 and fp16 gradients
* Handling of multiple model copies across devices
* Buffer broadcast

The CPU one has not evolved and does the following:
1. Wait for all gradients to be computed
2. Flatten all gradients into one big tensor
3. Allreduce
4. Unflatten

Ideally, we turn the CUDA approach into a device agnostic one, such that we can remove the CPU specific implementation. Parts of the CUDA approach are written in C++ (see #12852, #12954, #13496, #13607) and these can be made device agnostic as well. Both the bucketing and autograd hooks are now defined in Python land but can be turned into C++ code as well.

Separately, @mcarilli and team made really nice improvements to standard DDP in [Apex](https://github.com/NVIDIA/apex). The ideal bucketing order is discovered dynamically such that it reflects the **execution order** instead of **definition order**, which can cause significant speedups if these orderings are mismatched (e.g. layers that are used last are defined first).

Goals:
* Use the same DDP implementation regardless of backend
* Enable use of DDP for models defined with the C++ frontend
* Discover ideal bucketing order dynamically

Let's use this as a starting point for discussion.

cc @mcarilli @mrshenli 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Device agnostic gradient reduction #17757

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Device agnostic gradient reduction #17757

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions