<!-- ---
title: Collective Communication with Ignite
weight: 5
date: 2021-10-5
downloads: true
sidebar: true
tags:
  - idist
  - all_gather
  - all_reduce
  - broadcast
  - barrier
--- -->

# Collective Communication with Ignite

In this tutorial, we will see how to use advanced distributed functions like `all_reduce()`, `all_gather()`, `broadcast()` and `barrier()`. We will discuss unique use cases for all of them and represent them visually.

<!--more-->

## Required Dependencies

In [None]:
!pip install pytorch-ignite

## Imports

In [2]:
import torch

import ignite.distributed as idist

## All Reduce

![All Reduce Diagram](assets/all-reduce.png)

The [`all_reduce()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.all_reduce) method is used to collect specified tensors from each process and make them available on every node then perform a specified operation (sum, product, min, max, etc) on them. For example - You need to find the average of all the gradients available on different processes. 

> First, we get the number of GPUs available, with the get_world_size method. Then, for every model parameter, we do the following:
>
>    1. Gather the gradients on each process
>    2. Apply the sum operation on the gradients
>    3. Divide by the world size to average them
>
> Finally, we can go on to update the model parameters using the averaged gradients!
>
> -- <cite>[Distributed Deep Learning 101: Introduction](https://towardsdatascience.com/distributed-deep-learning-101-introduction-ebfc1bcd59d9)</cite>

You can get the number of GPUs (processes) available using another helper method [`idist.get_world_size()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.get_world_size) and then use `all_reduce()` to collect the gradients and apply the SUM operation.

In [None]:
def average_gradients(model):
    num_processes = idist.get_world_size()
    for param in model.parameters():
        idist.all_reduce(param.grad.data, op="SUM")
        param.grad.data = param.grad.data / num_processes

## All Gather

![All Gather Diagram](assets/all-gather.png)

The [`all_gather()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.all_gather) method is used when you just want to collect a tensor, number or string across all participating processes. For example - You need to gather the predicted values which are distributed across all the processes on the main process so you could store them to a file. Here is how you can do it: 

In [None]:
def write_preds_to_file(predictions, filename):
    prediction_tensor = torch.tensor(predictions)
    prediction_tensor = idist.all_gather(prediction_tensor)

    if idist.get_rank() == 0:
        torch.save(prediction_tensor, filename)

## Broadcast

![Broadcast Diagram](assets/broadcast.png)

The [`broadcast()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.broadcast) method copies a tensor, float or string from a source process to all the other processes. 

For example - You need to gather the predicted and actual values from all the processes on rank 0 for computing a metric and avoiding a memory error. You can do do this by first using `all_gather()`, then computing the metric and finally using `broadcast()` to share the result with all processes. `src` below refers to the rank of the source process.

In [None]:
def compute_metric(prediction_tensor, target_tensor):

    prediction_tensor = idist.all_gather(prediction_tensor)
    target_tensor = idist.all_gather(target_tensor)

    result = 0.0
    if idist.get_rank() == 0:
        result = compute_fn(prediction_tensor, target_tensor)

    result = idist.broadcast(result, src=0)

    return result

## Barrier

The [`barrier()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.barrier) method helps synchronize all processes. For example - while downloading data during training, we have to make sure only the main process (`rank = 0`) downloads the datasets to prevent the sub processes (`rank > 0`) from downloading the same file to the same path at the same time. This way all sub processes get a copy of this already downloaded dataset. This is where we can utilize `barrier()` to make the sub processes wait until the main process downloads the datasets. Once that is done, all the subprocesses instantiate the datasets, while the main process waits. Finally, all the processes are synced up.

In [None]:
def get_datasets(config):
    if idist.get_local_rank() > 0:
        idist.barrier()

    train_dataset, test_dataset = get_train_test_datasets(config["data_path"])

    if idist.get_local_rank() == 0:
        idist.barrier()

    return train_dataset, test_dataset