### MPI Collective Operations

Collective operators use the entire communicator to perform:

* Synchronization - processes wait until all members of the group have reached the synchronization point.
* Data Movement - broadcast, scatter/gather, all to all.
* Collective Computation (reductions) - one member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data.

~

<img src="https://hpc-tutorials.llnl.gov/mpi/images/collective_comm.gif" width=512 />

An MPI barrier can be through of as a synchronous collective message with no data.

> `MPI_barrier(com)`: Synchronization operation. Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. Then all tasks are free to proceed.


#### All Reduce

This section drawn  from https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/.

The critical operation in distributed deep-learning has become all-reduce.

<img src="https://tech.preferred.jp/wp-content/uploads/2018/07/fig_1.png" width=386 />

The blog states --
> In synchronized data-parallel distributed deep learning, the major computation steps are:
> 1. Compute the gradient of the loss function using a minibatch on each GPU.
> 2. Compute the mean of the gradients by inter-GPU communication.
> 3. Update the model.

This is an early effort (2017) in the space. It goes on to describe the ring algorithm, which has been used for a long time in HPC.  It takes two passes around the ring.
  * The first collects one element per node. $O(n^2)$ messages in $n-1$ steps.
  * The next distributes data to each node.  Same complexity.
  
<img src="https://tech.preferred.jp/wp-content/uploads/2018/07/fig_2.png" width=256 />  
<img src="https://tech.preferred.jp/wp-content/uploads/2018/07/fig_4.png" width=256 />
<img src="https://tech.preferred.jp/wp-content/uploads/2018/07/fig_5.png" width=256 />

All-Reduce is actually a compound operation. It consists of two phases:
  * scatter-reduce: reduce all elements in a distributed fashion
  * all-gather: collect the reduced elements
  
We see this in the two-phases of the ring algorithm.

The ring reduce algorithm is implemented by TensorFlow. It is a simple algorithm that has nice properties.

Another [blog](https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/) writes:

> In the system we described, each of the N GPUs will send and receive values $N-1$ times for the scatter-reduce, and $N-1$ times for the allgather. Each time, the GPUs will send $K / N$ values, where $K$ is the total number of values in array being summed across the different GPUs. Therefore, the total amount of data transferred to and from every GPU is $2(N−1)KN$ which, crucially, is independent of the number of GPUs.
  
  





### All-Reduce Strategies

There are a variety of more complex reduction strategies that have latency/data tradeoffs. A [2013 paper](https://arxiv.org/abs/1312.3020) characterizes the tradeoffs.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*-hwfc8HGjvbOpUR4.png" width=386 />

We cannot apply a simple work/span analsysis of these algorithms. Because the message sizes and the number of work per node varies.  For example, in the 
tree all-reduce:
  * messages in the first half of the protocol double in size
  * messages in the second half of the tree are the entire array
Tree distribution does send a minimum amount of data at the expense of poor parallelism.
  
Round robin all-reduce has asymptoticly optimal bandwidth. The down side is that it sends lots of messages.  It is perfectly parallel. 

Butterfly networks minimize rounds of communication. They also can be mapped to HPC networks with physical restrictions, specifically map the reduction to the specific network topology.



### NCCL NVidia Collective Communication Library

> The NVIDIA Collective Communications Library (NCCL, pronounced “Nickel”) is a library providing inter-GPU communication primitives that are topology-aware and can be easily integrated into applications.

>NCCL implements both collective communication and point-to-point send/receive primitives. It is not a full-blown parallel programming framework; rather, it is a library focused on accelerating inter-GPU communication.

Essentially, NCCL is a set of MPI routines for GPUs over NVidia 

> The AllGather operation gathers N values from k ranks into an output of size k*N, and distributes that result to all ranks. The output is ordered by rank index. The AllGather operation is therefore impacted by a different rank or device mapping.

<img src="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/_images/allgather.png" width=512 />
