Skip to content

Commit

Permalink
add gotorch part
Browse files Browse the repository at this point in the history
  • Loading branch information
QiJune committed Oct 27, 2020
1 parent ffabacb commit 9949a54
Showing 1 changed file with 16 additions and 6 deletions.
22 changes: 16 additions & 6 deletions doc/allreduce.md
@@ -1,6 +1,5 @@
# AllReduce


## Introduction

Data parallelism enables distributed training by communicating gradients before the optimizer step to make sure that parameters of all model replicas are updated using exactly the same set of gradients, and hence model replicas can stay consistent across iterations.
Expand All @@ -27,7 +26,6 @@ PyTorch offers several tools to facilitate distributed training, including [Data

Single-process multi-GPU is not the recommended mode now, becase of its overhead of scatter/gather and GIL contention in every forward pass. So, let's focus on DistributedDataParallel.


### Collective Communication Library

PyTorch could use different collective communication libraries as the backend, including [NCCL](https://developer.nvidia.com/nccl) and [Gloo](https://github.com/facebookincubator/gloo). NCCL supports GPU, while Gloo supports both CPU and GPU. The performance on GPU of NCCL is better than Gloo. So we use NCCL in GPU training, and Gloo in CPU training.
Expand Down Expand Up @@ -75,6 +73,8 @@ int main(int argc, char** argv) {
The training samples are partitioned statically in distributed training of PyTorch. The [DistributedSampler](https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler) generates a sequence of indices of training samples for each training process. Then, each training process load subset samples by the indices.
**Note:** The dataset is assumed to be of constant size.
### Launch Utility
The `torch.distributed` package provides a launch utility in [torch.distributed.launch](https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py). This helper utility can be used to launch multiple processes per node for distributed training. If the utility is used for GPU training, each distributed process will be operating on a single GPU.
Expand All @@ -89,21 +89,31 @@ The naive implementation of training procedures in [Introduction](#Introduction)
PyTorch does more optimizations to solve these two problems:
- bucket gradients to reduce AllReduce kernels overhead.
- register AllReduce kernels as autograd hooks to overlap communication and computation.
- Bucketing gradients to reduce AllReduce kernels overhead.
- Registering AllReduce kernels as autograd hooks to overlap communication and computation.
For more details, please refer to the [paper](https://arxiv.org/abs/2006.15704).
## AllReduce in GoTorch
### Stage 1
We plan to implement the functionalities of DistributedDataParallel gradually in GoTorch. At stage 1, we provide a naive solution. A MNIST distributed example is the target in this stage. At stage 2, we will provide a optimized solution. Bucketing gradients and registering hooks will be implemented at this stage.
### RecordIODataLoader
### Go wrapper of c10d Library
[ProcessGroupNCCL](https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.hpp) implements NCCL bindings for c10d library. After adding a Go wrapper of this class, we could doing allreduce on torch tensors in Go.
### Go Launch Utility
Go provides [os/exec](https://golang.org/pkg/os/exec/) library to spawn processes.
### Optimization at Stage 2
### Stage 2
TBD
## Reference
Expand Down

0 comments on commit 9949a54

Please sign in to comment.