
<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="3.Hands-on-Multi-GPU.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="1.Introduction-to-Distributed-Deep-Learning.ipynb">1</a>
        <a href="2.1.System-Topology.ipynb">2</a>
        <a href="3.Hands-on-Multi-GPU.ipynb">3</a>
        <a >4</a>
    </span>
</div>



# Introduction to Distributed Deep Learning - Part 4

**Table of Contents**

- [**Challenges with convergence**](#Challenges-with-convergence)
    - [Concepts](#Concepts)
        - [Impact of Batch size](#Impact-of-Batch-size)
        - [Impact on test and validation accuracy](#Impact-on-test-and-validation-accuracy)
    - [Techniques for faster convergence](#Techniques-for-faster-convergence)
        - [Batch norm](#Batch-norm)
        - [Learning rate scaling](#Learning-rate-scaling)
        - [Learning rate warmup](#Learning-rate-warmup)
        - [Using Optimizers built for Exascale Deep learning](#Using-Optimizers-built-for-Exascale-Deep-learning)

**The objectives of this Notebook is to:**

- clarify the reasons behind slower convergence with increase in Batch size. 
- Learn techniques to make convergence faster.

# Challenges with convergence

We noticed in the previous notebook that the Distributed Training with 2 or more GPUs had higher throughput than the single GPU training, but when we focused on the convergence of our model, we found it to converge slower than the former.
*Quoted from the previous notebook* : If we take a closer look at the results, we can find even after 8 epochs in both cases, the run with a single GPU at the end of 8 epochs has a **loss of 0.0487 and accuracy of 0.9851** , comparing that with 8 GPU case, we find at the end of 8 epochs we have a **loss of 0.3582 and accuracy of 0.8957**.

We were also able to notice the accuracy took a bigger drop during training. let's try to understand why this happens and how we can use some techniques to solve them.

## Concepts


### Impact of Batch size
In the paper **Measuring the Effects of Data Parallelism on Neural Network Training** by **Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl**, we see a relationship between steps taken to convergence and the batch size. 


<center><img src="images/paper1.png" width="700"/></center>


To understand this phenomenon, a term called as Critical batch size was coined in the paper **An Empirical Model of Large-Batch Training** by **Sam McCandlish, Jared Kaplan, Dario Amodei, OpenAI Dota Team**

#### Critical batch size 

Critical batch size is when compute efficiency drops below 50% optimal and larger batch sizes yield diminishing returns. 


It is found that we can approximately predict the maximum useful batch size by measuring gradient noise scaling, which is a simple statistic that quantifies the signal-to-noise ratio of the network gradients. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data.

Below is the critical batch size of some popular networks.

<center><img src="images/noise-summary-3.svg" width="720"/></center>


Now we will dissect another problem that we face in large-batch training. 

### Impact on test and validation accuracy 

It is found that test/validation accuracy decreases with an increase in Batch size. This is noticeable in our previous notebook as well from the figure below in the paper **ImageNet Training in Minutes** by `Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer` 

<center><img src="images/accuracy.png"/></center>

This lack of generalization ability is because large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by a significant number of large positive eigenvalues in $\nabla^2 f(x)$ , and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by having numerous small eigenvalues of $\nabla^2 f(x)$. We have observed that the loss function landscape of deep neural networks is such that large-batch methods are attracted to regions with sharp minimizers. Unlike small-batch methods, are unable to escape basins of attraction of these minimizers.

<center><img src="images/minima.png" width="600"/></center>

This convergences to sharp minima and the difference between the training and test function lead to a higher validation loss. 
Now that we've understood the challenges with regards to convergence, let's see some techniques to improve the convergence rate.

`In achieving that, a different dataset (CIFAR-10) which has 3 channel per image (RGB) and slightly more complex compared to FMNIST that has only 1 would be used.`

Let's start with benchmarking CIFAR-10 with single and 8 GPUs and compare the convergence using train accuracy.

In [1]:
SINGULARITY_RUN="singularity run --nv --env TF_CPP_MIN_LOG_LEVEL=3  ~/DDL.simg "

In [2]:
COMMAND = SINGULARITY_RUN + ' horovodrun -np 1 python3 ../source_code/N4/cifar_base.py --batch-size=8192 2> /dev/null'
!echo $COMMAND > command && srun --partition=gpu -n1 --gres=gpu:1 /bin/bash ./command
#!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 1 python3 ../source_code/N4/cifar_base.py --batch-size=512 2> /dev/null

[1,0]<stdout>:Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
[1,0]<stdout>:Epoch 1/12
[1,0]<stdout>:Epoch time : 21.723037481307983
[1,0]<stdout>:Images/sec: 2301.7
[1,0]<stdout>:Epoch 2/12
[1,0]<stdout>:Epoch time : 1.3986859321594238
[1,0]<stdout>:Images/sec: 35747.84
[1,0]<stdout>:Epoch 3/12
[1,0]<stdout>:Epoch time : 0.6776523590087891
[1,0]<stdout>:Images/sec: 73784.15
[1,0]<stdout>:Epoch 4/12
[1,0]<stdout>:Epoch time : 0.6524465084075928
[1,0]<stdout>:Images/sec: 76634.63
[1,0]<stdout>:Epoch 5/12
[1,0]<stdout>:Epoch time : 0.6554052829742432
[1,0]<stdout>:Images/sec: 76288.67
[1,0]<stdout>:Epoch 6/12
[1,0]<stdout>:Epoch time : 0.6544959545135498
[1,0]<stdout>:Images/sec: 76394.67
[1,0]<stdout>:Epoch 7/12
[1,0]<stdout>:Epoch time : 0.6476020812988281
[1,0]<stdout>:Images/sec: 77207.91
[1,0]<stdout>:Epoch 8/12
[1,0]<stdout>:Epoch time : 0.6485230922698975
[1,0]<stdout>:Images/sec: 77098.26
[1,0]<stdout>:Epoch 9/12
[1,0]<stdout>:Epoch time : 0.64487528

In [3]:
COMMAND = SINGULARITY_RUN + ' horovodrun -np 2 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_base.py --batch-size=8192 2> /dev/null'
!echo $COMMAND > command && srun --partition=gpu -n1 --gres=gpu:2 /bin/bash ./command
#!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_base.py --batch-size=512 2> /dev/null

[1,0]<stdout>:Epoch 1/12
[1,0]<stdout>:Epoch time : 23.96260094642639
[1,0]<stdout>:Images/sec: 2086.58
[1,0]<stdout>:Epoch 2/12
[1,0]<stdout>:Epoch time : 2.6798129081726074
[1,0]<stdout>:Images/sec: 18658.02
[1,0]<stdout>:Epoch 3/12
[1,0]<stdout>:Epoch time : 2.242502212524414
[1,0]<stdout>:Images/sec: 22296.52
[1,0]<stdout>:Epoch 4/12
[1,0]<stdout>:Epoch time : 0.9757363796234131
[1,0]<stdout>:Images/sec: 51243.35
[1,0]<stdout>:Epoch 5/12
[1,0]<stdout>:Epoch time : 0.8617544174194336
[1,0]<stdout>:Images/sec: 58021.17
[1,0]<stdout>:Epoch 6/12
[1,0]<stdout>:Epoch time : 0.5665063858032227
[1,0]<stdout>:Images/sec: 88260.26
[1,0]<stdout>:Epoch 7/12
[1,0]<stdout>:Epoch time : 0.5755453109741211
[1,0]<stdout>:Images/sec: 86874.13
[1,0]<stdout>:Epoch 8/12
[1,0]<stdout>:Epoch time : 0.536384105682373
[1,0]<stdout>:Images/sec: 93216.78
[1,0]<stdout>:Epoch 9/12
[1,0]<stdout>:Epoch time : 0.5372896194458008
[1,0]<stdout>:Images/sec: 93059.68
[1,0]<stdout>:Epoch 10/12
[1,0]<stdout>:Epoch time


Now that we have a baseline , let us now try to improve the convergence rate of the models.

# Techniques for faster convergence

### Batch normalisation

In the paper **Train longer, generalize better: closing the generalization gap in large batch training of neural networks** by `Elad Hoffer, Itay Hubara, Daniel Soudry,` Ghost Batch Normalization to reduce generalization error was introduced.

Batch Normalization is known to accelerate the training, increase the robustness of the neural network to different initialization schemes and improve generalization. Nonetheless, since it uses batch statistics, it is bounded to depend on the chosen batch size. We study this dependency and observe that by acquiring the statistics on small virtual ("ghost") batches instead of the real large batch, we can reduce the generalization error. This modification by itself reduces the generalization error substantially.

This can be implemented by the Normalization of a smaller batch present in each GPU instead of the larger batch across all GPUs.

Let us now implement Batch Norm for every GPU and test out that performance.

This can be implemented by adding `BatchNormalisation` layers to the model using the Keras API 

```python3
X = BatchNormalization(axis=3)(X)
```

In [4]:
COMMAND = SINGULARITY_RUN + ' horovodrun -np 2 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_batch_norm.py --batch-size=8192 2> /dev/null'
!echo $COMMAND > command && srun --partition=gpu -n1 --gres=gpu:2 /bin/bash ./command
#!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_batch_norm.py --batch-size=512 2> /dev/null

[1,0]<stdout>:Epoch 1/12
[1,0]<stdout>:Epoch time : 29.39343810081482
[1,0]<stdout>:Images/sec: 1701.06
[1,0]<stdout>:Epoch 2/12
[1,0]<stdout>:Epoch time : 3.5292608737945557
[1,0]<stdout>:Images/sec: 14167.27
[1,0]<stdout>:Epoch 3/12
[1,0]<stdout>:Epoch time : 1.8264081478118896
[1,0]<stdout>:Images/sec: 27376.14
[1,0]<stdout>:
[1,0]<stdout>:Epoch 3: finished gradual learning rate warmup to 0.002.[1,0]<stdout>:
[1,0]<stdout>:Epoch 4/12
[1,0]<stdout>:Epoch time : 0.8266372680664062
[1,0]<stdout>:Images/sec: 60486.02
[1,0]<stdout>:Epoch 5/12
[1,0]<stdout>:Epoch time : 0.8698344230651855
[1,0]<stdout>:Images/sec: 57482.2
[1,0]<stdout>:Epoch 6/12
[1,0]<stdout>:Epoch time : 0.7305412292480469
[1,0]<stdout>:Images/sec: 68442.41
[1,0]<stdout>:Epoch 7/12
[1,0]<stdout>:Epoch time : 0.6021847724914551
[1,0]<stdout>:Images/sec: 83030.99
[1,0]<stdout>:Epoch 8/12
[1,0]<stdout>:Epoch time : 0.6670920848846436
[1,0]<stdout>:Images/sec: 74952.17
[1,0]<stdout>:Epoch 9/12
[1,0]<stdout>:Epoch time : 0.6

### Learning rate scaling

In the research paper **Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour** by `Facebook researchers`, they introduced the Linear Scaling Rule which state that `When the minibatch size is multiplied by k, multiply the learning rate by k.`

This can be implemented by Horovod using the following line. 

```python 
scaled_lr = 0.001 * hvd.size()       # Scale Learning rate to number of GPUs 
opt = tf.optimizers.Adam(scaler_lr)
opt = hvd.DistributedOptimizer(opt)  # Wrap Optimizer with hvd.DistributedOptimizer()
```

Now, let's try running the cell with a scaled learning rate. 

In [8]:
!nvidia-msi 

/bin/bash: nvidia-msi: command not found


In [5]:
COMMAND = SINGULARITY_RUN + ' horovodrun -np 2 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_scalelr.py --batch-size=8192 2> /dev/null'
!echo $COMMAND > command && srun --partition=gpu -n1 --gres=gpu:2 /bin/bash ./command
#!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_scalelr.py --batch-size=512 2> /dev/null

[1,0]<stdout>:Epoch 1/12
[1,0]<stdout>:Epoch time : 27.99455165863037
[1,0]<stdout>:Images/sec: 1786.06
[1,0]<stdout>:Epoch 2/12
[1,0]<stdout>:Epoch time : 3.2106778621673584
[1,0]<stdout>:Images/sec: 15573.04
[1,0]<stdout>:Epoch 3/12
[1,0]<stdout>:Epoch time : 1.9681758880615234
[1,0]<stdout>:Images/sec: 25404.23
[1,0]<stdout>:Epoch 4/12
[1,0]<stdout>:Epoch time : 0.7592966556549072
[1,0]<stdout>:Images/sec: 65850.42
[1,0]<stdout>:Epoch 5/12
[1,0]<stdout>:Epoch time : 0.8540310859680176
[1,0]<stdout>:Images/sec: 58545.88
[1,0]<stdout>:Epoch 6/12
[1,0]<stdout>:Epoch time : 0.868973970413208
[1,0]<stdout>:Images/sec: 57539.12
[1,0]<stdout>:Epoch 7/12
[1,0]<stdout>:Epoch time : 0.6741039752960205
[1,0]<stdout>:Images/sec: 74172.53
[1,0]<stdout>:Epoch 8/12
[1,0]<stdout>:Epoch time : 0.7635788917541504
[1,0]<stdout>:Images/sec: 65481.12
[1,0]<stdout>:Epoch 9/12
[1,0]<stdout>:Epoch time : 0.7279303073883057
[1,0]<stdout>:Images/sec: 68687.89
[1,0]<stdout>:Epoch 10/12
[1,0]<stdout>:Epoch tim


We might notice that this time training accuracy did not converge, this is because the linear scaling rule breaks down when the network is changing rapidly, which commonly occurs in early stages of training. We find that this issue can be alleviated by a properly designed warmup. 

**Note**: You might get lucky and get it converge, so try running the above cell multiple times to notice the effect of larger batch size.

### Learning rate warmup

Learning rate warmup gradually ramps up the learning rate from a small to a large value. This ramp avoids a sudden increase from a small learning rate to a large one, allowing healthy convergence at the start of training. In practice, with a large minibatch of size `kn`, we start from a learning rate of `η` and increment it by a constant amount at each iteration such that it reaches `ηˆ = kη` after 5 epochs. After the warmup phase, we go back to the original learning rate schedule. 

Learning rate warmup can be implemented in Horovod using the following callback.

```python
# Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final accuracy.
# Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during the first three epochs.
hvd.callbacks.LearningRateWarmupCallback(initial_lr=scaled_lr, warmup_epochs=3, verbose=1)
```

Let us now implement this and run this.

In [6]:
COMMAND = SINGULARITY_RUN + ' horovodrun -np 2 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_warmup.py --batch-size=8192 2> /dev/null'
!echo $COMMAND > command && srun --partition=gpu -n1 --gres=gpu:2 /bin/bash ./command
#!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_warmup.py --batch-size=512 2> /dev/null

[1,0]<stdout>:Epoch 1/12
[1,0]<stdout>:Epoch time : 28.79374885559082
[1,0]<stdout>:Images/sec: 1736.49
[1,0]<stdout>:Epoch 2/12
[1,0]<stdout>:Epoch time : 1.8994009494781494
[1,0]<stdout>:Images/sec: 26324.09
[1,0]<stdout>:Epoch 3/12
[1,0]<stdout>:Epoch time : 1.008152961730957
[1,0]<stdout>:Images/sec: 49595.65
[1,0]<stdout>:
[1,0]<stdout>:Epoch 3: finished gradual learning rate warmup to 0.002.
[1,0]<stdout>:Epoch 4/12
[1,0]<stdout>:Epoch time : 0.752589225769043
[1,0]<stdout>:Images/sec: 66437.31
[1,0]<stdout>:Epoch 5/12
[1,0]<stdout>:Epoch time : 0.7895505428314209
[1,0]<stdout>:Images/sec: 63327.17
[1,0]<stdout>:Epoch 6/12
[1,0]<stdout>:Epoch time : 0.6286990642547607
[1,0]<stdout>:Images/sec: 79529.31
[1,0]<stdout>:Epoch 7/12
[1,0]<stdout>:Epoch time : 0.6321053504943848
[1,0]<stdout>:Images/sec: 79100.74
[1,0]<stdout>:Epoch 8/12
[1,0]<stdout>:Epoch time : 0.6572461128234863
[1,0]<stdout>:Images/sec: 76075.0
[1,0]<stdout>:Epoch 9/12
[1,0]<stdout>:Epoch time : 0.6581833362579346




In this specific case, we obtain the results close to Batch Normalisation but with increase in the batch size to a much higher value. we noticed the improvement provided by Learning rate scaling and warmup.

### Using Optimizers built for Exascale Deep learning

#### The LAMB Optimizer

A series of optimizers have been created to address this problem and allow for scaling to very large batch sizes and learning rates. One such optimizer is the LAMB optimizer. LAMB uses a layerwise adaptive large batch optimization to train with very little hyperparameter tuning. This optimizer enables the use of very large batch sizes without any degradation of performance.

We can implement LAMB using the following lines.

```python
# Import LAMB from Tensorflow Addons
from tensorflow_addons.optimizers import LAMB

# Replace the Adam optimizer with LAMB:
opt = LAMB(learning_rate=scaled_lr)
```

Let's now train using the LAMB optimizer and compare the results. 

You can learn more about the LAMB optimizer from its paper [here](https://arxiv.org/pdf/1904.00962.pdf)

In [7]:
COMMAND = SINGULARITY_RUN + ' horovodrun -np 2 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_lamb.py --batch-size=8192 2> /dev/null'
!echo $COMMAND > command && srun --partition=gpu -n1 --gres=gpu:2 /bin/bash ./command
#!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args="--oversubscribe" python3 ../source_code/N4/cifar_lamb.py --batch-size=512 2> /dev/null

[1,0]<stdout>:Epoch 1/12
[1,0]<stdout>:Epoch time : 32.26944613456726
[1,0]<stdout>:Images/sec: 1549.45
[1,0]<stdout>:Epoch 2/12
[1,0]<stdout>:Epoch time : 3.8513550758361816
[1,0]<stdout>:Images/sec: 12982.44
[1,0]<stdout>:Epoch 3/12
[1,0]<stdout>:Epoch time : 2.732008457183838
[1,0]<stdout>:Images/sec: 18301.55
[1,0]<stdout>:
[1,0]<stdout>:Epoch 3: finished gradual learning rate warmup to 0.002.
[1,0]<stdout>:Epoch 4/12
[1,0]<stdout>:Epoch time : 0.7786927223205566
[1,0]<stdout>:Images/sec: 64210.18
[1,0]<stdout>:Epoch 5/12
[1,0]<stdout>:Epoch time : 0.7232086658477783
[1,0]<stdout>:Images/sec: 69136.34
[1,0]<stdout>:Epoch 6/12
[1,0]<stdout>:Epoch time : 0.7267439365386963
[1,0]<stdout>:Images/sec: 68800.02
[1,0]<stdout>:Epoch 7/12
[1,0]<stdout>:Epoch time : 0.6811413764953613
[1,0]<stdout>:Images/sec: 73406.2
[1,0]<stdout>:Epoch 8/12
[1,0]<stdout>:Epoch time : 0.6811294555664062
[1,0]<stdout>:Images/sec: 73407.48
[1,0]<stdout>:Epoch 9/12
[1,0]<stdout>:Epoch time : 0.7070589065551758


We now tabulate all the results together to understand the improvements in convergence time.

|# of GPUs|Condition|Accuracy on V100 after 12 epochs|Accuracy on A100 after 12 epochs|
|-|-|-|-| 
|1|Single GPU||0.41|
|8|Multi-GPU naive||0.31|
|8|Batch Normalisation||0.61|"
|8|Learning rate scaling+warmup||0.55 + 0.62|"
|8|LAMB Optimizer||0.48|


Now that we are aware of the concepts in Distributed deep learning training and techniques to use for faster convergence, Go to the next section to run the challenge notebook.

***

## Licensing

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="3.Hands-on-Multi-GPU.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="1.Introduction-to-Distributed-Deep-Learning.ipynb">1</a>
        <a href="2.1.System-Topology.ipynb">2</a>
        <a href="3.Hands-on-Multi-GPU.ipynb">3</a>
        <a >4</a>
    </span>
</div>

<br>

<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>
