-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Description
There are separate implementations for the distributed data parallelism for CUDA and CPU tensors.
The CUDA one has evolved quite a bit and includes:
- Dynamic bucketing of gradient tensors to reduce the number of reduction calls
- Start reduction when a bucket is full (i.e. all gradients participating have been computed)
- Handling of fp32 and fp16 gradients
- Handling of multiple model copies across devices
- Buffer broadcast
The CPU one has not evolved and does the following:
- Wait for all gradients to be computed
- Flatten all gradients into one big tensor
- Allreduce
- Unflatten
Ideally, we turn the CUDA approach into a device agnostic one, such that we can remove the CPU specific implementation. Parts of the CUDA approach are written in C++ (see #12852, #12954, #13496, #13607) and these can be made device agnostic as well. Both the bucketing and autograd hooks are now defined in Python land but can be turned into C++ code as well.
Separately, @mcarilli and team made really nice improvements to standard DDP in Apex. The ideal bucketing order is discovered dynamically such that it reflects the execution order instead of definition order, which can cause significant speedups if these orderings are mismatched (e.g. layers that are used last are defined first).
Goals:
- Use the same DDP implementation regardless of backend
- Enable use of DDP for models defined with the C++ frontend
- Discover ideal bucketing order dynamically
Let's use this as a starting point for discussion.