# Benchmark mixed precision training on Cifar100

In this notebook we will benchmark 1) native PyTorch mixed precision module [`torch.cuda.amp`](https://pytorch.org/docs/master/amp.html) and 2) NVidia/Apex package.

We will train Wide-ResNet model on Cifar100 dataset using Turing enabled GPU and compare training times.

**TL;DR**

The ranking is the following:
- 1st place: Nvidia/Apex "O2"
- 2nd place: `torch.cuda.amp`: autocast and scaler
- 3rd place: Nvidia/Apex "O1"
- 4th place: fp32

According to @mcarilli: "Native amp is more like a faster, better integrated, locally enabled O1"

## Installations and setup

1) Recently added [`torch.cuda.amp`](https://pytorch.org/docs/master/notes/amp_examples.html#working-with-multiple-models-losses-and-optimizers) module to perform automatic mixed precision training instead of using Nvidia/Apex package is available in PyTorch >=1.6.0. At the moment of writing, we need to install nightly release to benefit.

In [1]:
# !pip install --pre --upgrade torch==1.6.0.dev20200411+cu101 torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
# !pip install --pre --upgrade pytorch-ignite 
# !pip install --upgrade pynvml fire

2) Let's install Nvidia/Apex package:

In [2]:
# !pip install --upgrade --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex/

In [3]:
import torch
import torchvision
import ignite
torch.__version__, torchvision.__version__, ignite.__version__

('1.6.0.dev20200411+cu101', '0.6.0+cu101', '0.4.0.dev20200411')

3) The scripts we will execute are located in `ignite/examples/contrib/cifar100_amp_benchmark` of github repository. Let's clone the repository and setup PYTHONPATH to execute benchmark scripts:

In [4]:
!git clone https://github.com/pytorch/ignite.git /tmp/ignite
scriptspath="/tmp/ignite/examples/contrib/cifar100_amp_benchmark/"
setup=f"cd {scriptspath} && export PYTHONPATH=$PWD:$PYTHONPATH"

Cloning into '/tmp/ignite'...
remote: Enumerating objects: 5534, done.[K
remote: Total 5534 (delta 0), reused 0 (delta 0), pack-reused 5534[K
Receiving objects: 100% (5534/5534), 21.83 MiB | 14.43 MiB/s, done.
Resolving deltas: 100% (3458/3458), done.


4) Download dataset

In [7]:
from torchvision.datasets.cifar import CIFAR100
CIFAR100(root="/tmp/cifar100/", train=True, download=True)

Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to /tmp/cifar100/cifar-100-python.tar.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting /tmp/cifar100/cifar-100-python.tar.gz to /tmp/cifar100/


Dataset CIFAR100
    Number of datapoints: 50000
    Root location: /tmp/cifar100/
    Split: Train

## Training in fp32

In [8]:
!{setup} && python benchmark_fp32.py /tmp/cifar100/ --batch_size=256 --max_epochs=20

Files already downloaded and verified
Epoch [1/20]: [195/195] 100%|████████████████████, batch loss=4.53 [00:16<00:00]
Epoch [2/20]: [195/195] 100%|████████████████████, batch loss=4.25 [00:16<00:00]
Epoch [3/20]: [195/195] 100%|████████████████████, batch loss=4.19 [00:16<00:00]
Epoch [4/20]: [195/195] 100%|████████████████████, batch loss=3.94 [00:16<00:00]
Epoch [5/20]: [195/195] 100%|████████████████████, batch loss=3.98 [00:16<00:00]
Epoch [6/20]: [195/195] 100%|████████████████████, batch loss=3.91 [00:16<00:00]
Epoch [7/20]: [195/195] 100%|█████████████████████, batch loss=3.8 [00:17<00:00]
Epoch [8/20]: [195/195] 100%|████████████████████, batch loss=3.68 [00:17<00:00]
Epoch [9/20]: [195/195] 100%|████████████████████, batch loss=3.52 [00:17<00:00]
Epoch [10/20]: [195/195] 100%|███████████████████, batch loss=3.61 [00:17<00:00]
Epoch [11/20]: [195/195] 100%|███████████████████, batch loss=3.63 [00:17<00:00]
Epoch [12/20]: [195/195] 100%|███████████████████, batch loss=3.63 [00:

## Training with `torch.cuda.amp`

In [9]:
!{setup} && python benchmark_torch_cuda_amp.py /tmp/cifar100/ --batch_size=256 --max_epochs=20

Files already downloaded and verified
Epoch [1/20]: [195/195] 100%|████████████████████, batch loss=4.62 [00:10<00:00]
Epoch [2/20]: [195/195] 100%|████████████████████, batch loss=4.22 [00:10<00:00]
Epoch [3/20]: [195/195] 100%|████████████████████, batch loss=4.22 [00:10<00:00]
Epoch [4/20]: [195/195] 100%|████████████████████, batch loss=3.96 [00:10<00:00]
Epoch [5/20]: [195/195] 100%|████████████████████, batch loss=3.88 [00:10<00:00]
Epoch [6/20]: [195/195] 100%|████████████████████, batch loss=3.93 [00:10<00:00]
Epoch [7/20]: [195/195] 100%|████████████████████, batch loss=3.71 [00:10<00:00]
Epoch [8/20]: [195/195] 100%|████████████████████, batch loss=3.73 [00:10<00:00]
Epoch [9/20]: [195/195] 100%|████████████████████, batch loss=3.61 [00:10<00:00]
Epoch [10/20]: [195/195] 100%|███████████████████, batch loss=3.52 [00:10<00:00]
Epoch [11/20]: [195/195] 100%|███████████████████, batch loss=3.39 [00:10<00:00]
Epoch [12/20]: [195/195] 100%|███████████████████, batch loss=3.35 [00:

## Training with `Nvidia/apex`


- we check 2 optimization levels: "O1" and "O2"
    - "O1" optimization level: automatic casts arount Pytorch functions and tensor methods
    - "O2" optimization level: fp16 training with fp32 batchnorm and fp32 master weights

In [10]:
!{setup} && python benchmark_nvidia_apex.py /tmp/cifar100/ --batch_size=256 --max_epochs=20 --opt="O1"

Files already downloaded and verified
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Epoch [1/20]: [1/195]   1%|                      , batch loss=5.03 [00:00<00:00]Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Epoch [1/20]: [1/195]   1%|        

In [11]:
!{setup} && python benchmark_nvidia_apex.py /tmp/cifar100/ --batch_size=256 --max_epochs=20 --opt="O2"

Files already downloaded and verified
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Epoch [1/20]: [1/195]   1%|                      , batch loss=5.01 [00:00<00:00]Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Epoch [1/20]: [1/195]   