# CNN Training on IPUs
Last updated: Jun 27th 2022

This notebook demonstrates how to run convolutional neural network (CNN) models such as ResNet and EfficientNet for image recognition training on the IPU.

**TODO**: Sanity check runs OK after Graphcore cluster is stabilized  
**OPTIONAL**: Add ImageNet 150GB data as another Gradient Dataset; see if model runs in OK amount of time; if it does update text accordingly

## 0. Deep Learning on IPUs

Graphcore Intelligence Processing Units (IPUs) provide a way to radically speed up deep learning models compared to conventional hardware, and even compared to Paperspace's regular hardware accelerators, i.e., GPUs.

By implementing optimized versions of deep learning models on the IPU hardware, training times and throughput speeds that are not otherwise feasible can be easily achieved. This allows new models to be built and new problems to be solved.

Achieving these optimized implementations requires some code changes.

Graphcore supports the two most common deep learning frameworks, PyTorch and TensorFlow, making the changes needed straightforward.

## 1. Overview of this notebook

We have a series of example notebooks showing deep learning models running on Graphcore IPUs in Paperspace:

- TensorFlow 1: Convolutional neural networks (CNNs)
- TensorFlow 2: Cluster GCNs
- PyTorch: BERT fine-tuning
- PopART: BERT training and inference

In this one, we are showing TensorFlow 1 on Convolutional neural networks.

## 1.1: Graphcore ResNet-50 and EfficientNet models

Deep CNN residual learning models such as ResNet and EfficientNet are used for image recognition and classification. The training examples given below use models implemented in TensorFlow 1, optimized to best utilize Graphcore's IPU processors.

As the whole of the model is always in memory on the IPU, smaller micro batch sizes of data become more efficient than on other hardware. The micro batch size is the number of samples processed in one full forward/backward pass of the algorithm. This contrasts with the global batch size which is the total number of samples processed in a weight update, and is defined as the product of micro batch size, number of accumulated gradients and total number of replicas.

The IPU's built-in stochastic rounding support improves accuracy when using half-precision, which allows greater throughput. The model uses loss scaling to maintain accuracy at low precision, and has techniques such as cosine learning rates and label smoothing to improve final verification accuracy. Both model and data parallelism can be used to scale training efficiently over many IPUs.

A ResNeXt model is also available. ResNeXt is a variant of ResNet that adds multiple paths to the Residual Blocks.

## 2. Setup
In Paperspace, the notebook is ready to run, with no setup outside the notebook or data downloading being required after signing in.

In this section we

1. Install the additional libraries required to run this notebook's models
2. Mount the datasets to be used for model training: CIFAR-10 and CIFAR-100

This is followed by a simple example of training the model, ways to extend this, and finally next steps.

### 2.1: Requirements
We add installation of Git as it is required by the code, followed by the rest of the extra library requirements.

Note the `apt` and `pip` installs have to be run each time the Gradient Notebook is restarted, as the state is not retained.

In [8]:
### TODO: Remove this if Graphcore updates and adds git to their graphcore/tensorflow-jupyter:1-intel-2.5.1-ubuntu-18.04-20220513 Docker image

!apt-get update
!apt-get install -y --no-install-recommends git

Get:1 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]      
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]    
Get:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]    
Get:5 http://archive.ubuntu.com/ubuntu bionic/multiverse amd64 Packages [186 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic/restricted amd64 Packages [13.5 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic/universe amd64 Packages [11.3 MB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages [1344 kB]    
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [1047 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2297 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [29.8 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [3298 kB]
Get:13 http://sec

In [9]:
### TODO cd to requirements.txt's dir first
### The command path will change when we are using our own repo rather than Graphcore's https://github.com/graphcore/examples

!pip install -r /notebooks/applications/tensorflow/cnns/training/requirements.txt

Obtaining mlperf-logging from git+https://github.com/mlcommons/logging.git@44767a7aec43ed70dfe3c37f34f27412dcac2c2d#egg=mlperf-logging (from -r /notebooks/applications/tensorflow/cnns/training/requirements.txt (line 7))
  Skipping because already up-to-date.
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting portpicker>=1.3.1
  Downloading portpicker-1.5.2-py3-none-any.whl (14 kB)
Collecting pytest==6.2.5
  Downloading pytest-6.2.5-py3-none-any.whl (280 kB)
     |████████████████████████████████| 280 kB 9.7 MB/s            
[?25hCollecting pytest-pythonpath>=0.7.3
  Downloading pytest_pythonpath-0.7.4-py3-none-any.whl (3.7 kB)
Collecting pyyaml>=5.4.1
  Downloading PyYAML-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (603 kB)
     |████████████████████████████████| 603 kB 77.0 MB/s            
[?25hCollecting wandb>=0.12.1
  Downloading wandb-0.12.19-py2.py3-none-any.whl (1.8 MB)
     |███████████████████████████████

### 2.2: Data

**TODO**: Update if `.gradient/settings.yaml` automount works: user doesn't need to mount them?

The data used in this notebook are

- CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
- CIFAR-100: https://www.cs.toronto.edu/~kriz/cifar-100-binary.tar.gz

We have made these available as Gradient Datasets, and they can therefore be accessed here by mounting them.

#### Mounting the data

To mount the data, click on the Data Sources tab on the left-hand navigation bar, locate the Data Sources named cifar-10 and cifar-100 and click Mount. This will make them available in the directory `/datasets`.

## 3. Train the model

Model training is currently invoked by calling the `.py` Python scripts from this notebook.

### 3.1: Basic training run
We can therefore do a basic training run on CIFAR-10 by running the appropriate command.  
It takes about 6 minutes to run.

In [3]:
### TODO cd to train.py's dir first
### The command path will change when we are using our own repo

!python3 /notebooks/applications/tensorflow/cnns/training/train.py --dataset cifar-10 --data-dir /datasets/cifar-10/cifar-10-batches-bin

2022-06-22 22:34:59.871107: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:47] Poplar version: 2.5.0 (76e88974fc) Poplar package: e94d646535



The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Namespace(config=None, config_path='/notebooks/applications/tensorflow/cnns/training/configs.yml', help=False, lr_schedule='stepped', model='resnet', restore_path=None)
{'model': 'resnet', 'lr_schedule': 'stepped', 'restore_path': None, 'help': False, 'config': None, 'config_path': '/notebooks/applications/tensorflow/cnns/training/configs.yml', 'model_size': None, 'batch_norm': False, 'group_norm': False, 'groups': None, 'BN_decay': None, 'dataset': 'cifar-10', 'data_dir':

Similarly, training on CIFAR-100 can be run with

```
!python3 /notebooks/applications/tensorflow/cnns/training/train.py --dataset cifar-100 --data-dir /datasets/cifar-100/cifar-100-binary
```

### 3.2: Training configurations

The training script supports a large number of program arguments, allowing you to make the most of the IPU performance for a given model.

In order to facilitate the handling of a large number of parameters, you can define an ML configuration, in the more human-readable YAML format, in the `configs.yml` file.

After the configuration is defined, you can run the training script with this option:

```
python3 train.py --config my_config
```

From the command line you can override any option previously defined in a configuration. For example, you can change the number of training epochs of a configuration with:

```
python3 train.py --config my_config --epochs 50
```

We provide reference configurations for the models described below.

### 3.3: Selecting checkpoints for validation

Large deep learning models can take a long time to train, even on an IPU. We therefore record the model's state at steps along the way using checkpoints. Another step that we want to do besides training is evaluation, i.e., see how well the model is performing. It can therefore be useful to be able to perform evaluations on checkpoints as well as a model at the end of a training run.

With `validation.py` either a single or all generated checkpoints can be evaluated. To evaluate a single checkout specify its name using `--restore-path`:

```
python3 validation.py --config my_config --data-dir <path-to-dataset> --restore-path logs/<training-dir>/<ckpt-id>
```

To evaluate all generated checkpoints specify the directory that contains them:

```
python3 validation.py --config my_config --data-dir <path-to-dataset> --restore-path logs/<training-dir>
```

## 4. Distributed training

**OPTIONAL**: Try these out; needs stabilized cluster to run.

To get the most performance from our IPU-PODs, this application example now supports PopDist, Graphcore's Poplar distributed configuration library. For more information about PopDist and PopRun, see the [User Guide](https://docs.graphcore.ai/projects/poprun-user-guide/). Beyond the benefit of enabling scaling an ML workload, the CNN application benefits from distributed training because the additional launched instances increase the number of input data feeds, therefore increasing the throughput of the ML workload.

For example the ResNet50 Group Norm configuration for 16 IPUs is defined in `configs.yml`. You can distribute this configuration on an IPU-POD16 with:

```
poprun -v --numa-aware 1 --num-instances 8 --num-replicas 16 --ipus-per-replica 1 --mpi-local-args="--tag-output" \
python3 train.py --config mk2_resnet50_gn_16ipus --data-dir your_dataset_dir_path --no-validation
```

As mentioned above, each instance sets an independent input data feed to the devices. The maximum number of instances is limited by the number of replicas, so we could in theory, define `--num-instances 16` and have one input feed for each replica. However, we are finding that given the NUMA configuration currently in use on IPU-PODs, the optimal maximum number of instances is 8. So while in total there are 16 replicas, each instance is managing 2 local replicas. A side effect from this is that when executing distributed workloads, the program option `--replicas` value is ignored and overwritten by the number of local replicas. However, all of this is done programatically in `train.py` to make it more manageable.

Note that, there is currently a limitation with distributed training that prevents execution of validation after training in the same process, therefore we need to pass the option `--no-validation`. Then, after training is complete, you can run validation with:

```
poprun -v --numa-aware 1 --num-instances 8 --num-replicas 16 --ipus-per-replica 1 --mpi-local-args="--tag-output" \
python3 validation.py --config mk2_resnet50_gn_16ipus --data-dir your_dataset_dir_path --restore-path generated_checkpoints_dir_path
```

When running pipelined models, such as the ResNet50 Batch Norm 16 IPU configuration, where each model is pipelined across 4 IPUs during training, you need to change the distributed training command line accordingly:

```
poprun -v --numa-aware 1 --num-instances 4 --num-replicas 4 --ipus-per-replica 4 --mpi-local-args="--tag-output" \
python3 train.py --config mk2_resnet50_bn_16ipus --data-dir your_dataset_path --no-validation
```

Note that we reduced the number of instances from 8 to 4 since we are only running 4 replicas. During validation the model does not need to be pipelined as it fits in a single IPU. So distributed validation can be executed with:

```
poprun -v --numa-aware 1 --num-instances 8 --num-replicas 16 --ipus-per-replica 1 --mpi-local-args="--tag-output" \
python3 validation.py --config mk2_resnet50_bn_16ipus --shards 1 --data-dir your_dataset_dir_path  \
--restore-path generated_checkpoints_dir_path
```

IPU-POD systems can also be equipped with more than one host server, enabling instances to be run across all servers. For more on this, see the [original repo readme](https://github.com/graphcore/examples/tree/master/applications/tensorflow/cnns/training).

## 5. Optimal model configurations

**OPTIONAL**: Try these out; needs stabilized cluster to run.

In this section we show some examples of optimal model configurations for training ResNet, ResNeXt, and EfficientNet on the IPU. These configurations provide examples that can be adapted to your own use cases.

### 5.1: ImageNet - ResNet-50

The following configuration trains ResNet50 using 16 Mk2 IPUs. Each IPU runs a single data-parallel replica of the model with a micro-batch size of 20. We use a gradient accumulation count of 6, and 16 replicas in total for a global batch size of 1920 (20 * 6 * 16). Activations for the forwards pass are re-calculated during the backwards pass. Partials saved to tile memory within the convolutions are set to half-precision to maximise throughput on the IPU. Batch norm statistics computed for each batch of samples are distributed across groups of 2 IPUs to improve numerical stability and convergence, This example uses the SGD-M optimizer, cosine learning rate and label smoothing to train to >75.90% validation accuracy in 45 epochs. The example uses PopDist with 8 instances to maximize throughput.

```
POPLAR_ENGINE_OPTIONS='{"opt.enableMultiAccessCopies":"false"}' poprun -vv --mpi-global-args='--tag-output --allow-run-as-root' \
--mpi-local-args='-x POPLAR_ENGINE_OPTIONS' --ipus-per-replica 1 --numa-aware 1 \
--num-instances 8 --num-replicas 16 python train.py --config mk2_resnet50_mlperf_pod16_bs20 --epochs-per-sync 20 \
--data-dir your_dataset_path --no-validation
```

After training is complete, you can validate the previously saved checkpoints. As above, each IPU runs one replica of the model and the model is replicated over the 16 IPUs. To make sure there are no validation samples discarded when sharding the validation dataset across 8 instances, a batch size of 25 is used.

```
POPLAR_ENGINE_OPTIONS='{"opt.enableMultiAccessCopies":"false"}' poprun -vv --mpi-global-args='--tag-output --allow-run-as-root' \
--mpi-local-args='-x POPLAR_ENGINE_OPTIONS' --ipus-per-replica 1 --numa-aware 1 \
--num-instances 8 --num-replicas 16 python validation.py --config mk2_resnet50_mlperf_pod16_bs20 --no-stochastic-rounding \
--micro-batch-size 25 --available-memory-proportion 0.6 --data-dir your_dataset_path --restore-path generated_checkpoints_dir_path
```

### 5.2: ImageNet - ResNeXt

ResNeXt is a variant of ResNet that adds multiple paths to the Residual Blocks.

The following configuration will train a ResNeXt-101 model to 78.8% validation accuracy in 120 epochs on 16 Mk2 IPUs. The model is pipelined over 2 IPUs with a micro batch size 6. We use a gradient accumulation count of 16 and 8 replicas for a global batch of 2048 (6 * 16 * 8).

```
poprun --mpi-global-args="--allow-run-as-root --tag-output" --numa-aware 1 --num-replicas 8 --ipus-per-replica 2 \
--num-instances 8 python3 train.py --config mk2_resnext101_16ipus --data-dir your_dataset_path --no-validation
```

As above, you can run validation after training:

```
poprun --mpi-global-args="--allow-run-as-root --tag-output" --numa-aware 1 --num-replicas 16 --ipus-per-replica 1 \
--num-instances 16 python3 validation.py --config mk2_resnext101_16ipus --shards 1 --data-dir your_dataset_path \
--restore-path generated_checkpoints_dir_path
```

### 5.3: ImageNet - EfficientNet
#### 5.3.1: Training

The following configuration trains EfficientNet-B4 to ~82% using 16 Mk2 IPUs. Each model is pipelined across 4 IPUs with a micro-batch size of 3. We use a gradient accumulation count of 64, and 4 replicas in total for a global batch size of 768 (3 * 64 * 4).

```
poprun --mpi-global-args="--allow-run-as-root --tag-output" --numa-aware 1 --num-replicas 4 --num-instances 4 --ipus-per-replica 4 \ python3 train.py --config mk2_efficientnet_b4_g1_16ipus --data-dir your_dataset_path --no-validation
```

As above, you can run validation after training:

```
poprun -v --numa-aware 1 --num-instances 8 --num-replicas 16 --ipus-per-replica 1 --mpi-local-args="--tag-output" \
python3 validation.py --config mk2_efficientnet_b4_g1_16ipus --shards 1 --data-dir your_dataset_dir_path  \
--restore-path generated_checkpoints_dir_path
```

Changing the dimension of the group convolutions can make the model more efficient on the IPU. To keep the number of parameters approximately the same, you can reduce the expansion ratio. For example a modified EfficientNet-B4 model, with a similar number of trainable parameters can be trained using:

```
poprun --mpi-global-args="--allow-run-as-root --tag-output" --numa-aware 1 --num-replicas 8 --num-instances 8 --ipus-per-replica 2 \ python3 train.py --config mk2_efficientnet_b4_g16_16ipus --data-dir your_dataset_path --no-validation --identical-replica-seeding
```

This configuration trains EfficientNet-B4 to ~82.6% validation accuracy with improved training throughput, achieved by using half-precision arithmetic throughput and by pipelining across just 2 IPUs. The global batch size is 6144, enabled by using the LARS optimizer and polynomial decay learning rate, in addition to other hyperparameter tuning. This makes the configuration appropriate for a range of different sized systems.

#### 5.3.2: Inference
The training harness can also be used to demonstrate inference performance using the validation.py script. For example, to check inference for EfficientNet use:

```
python validation.py --model efficientnet --model-size B0 --dataset imagenet --micro-batch-size 8 \
--generated-data --repeat 10 --batch-norm
```

There is also a possibility to run inference using the embedded application runtime which allows us to save a precompiled graph to a file and skip the compilation in the subsequent runs. It can be tested using the `inference_embedded.py` script. Each time the script is executed it looks for the precompiled graph in the working directory, then loads it and executes for the given number of iterations. If the graph is not found, then it is constructed, compiled and saved to a file. For example, to test the performance of EfficientNet inference use:

```
python inference_embedded.py --model efficientnet --model-size B0 --dataset imagenet --micro-batch-size 1 \
--iterations 1000 --batches-per-step 100 --eight-bit-io --no-dataset-cache --generated-data
```

## 6. View the results in Weights & Biases

**TODO**: Show screenshot of W&B working. Paperspace has some W&B integration, so this is good to highlight. Needs stabilized cluster to run.

Weights and Biases is a tool that helps you tracking different metrics of your machine learning job, for example the loss and accuracy but also the memory utilisation. For more information please see https://www.wandb.com/. Installing the `requirements.txt` file will install a version of `wandb`. You can login to `wandb` as you prefer and then simply activate it using the flag `--wandb`, e.g.,

```
python train.py --config mk2_resnet8_test --wandb
```

Near the start of the run you will see a link to your run appearing in your output.

## 7. More model options

The scripts come with a large number of model options, some of which have been explored above. Use `--help` to show all available options.

These include

- Training options
  - ResNet
  - EfficientNet
  - General
  - IPU
- Validation options
- Other options
 
For full details of these, see the original repo readme [training options section](https://github.com/graphcore/examples/tree/master/applications/tensorflow/cnns/training).

## 8. Resuming training runs

Training can be resumed from a checkpoint using the restore.py script. You must supply the `--restore-path` option with a valid checkpoint.

## 9. Benchmarking

How fast are the IPUs training the models?

To see this in more detail, check out the [benchmarks section](https://github.com/graphcore/examples/blob/master/applications/tensorflow/cnns/training/README_Benchmarks.md) of the original repo.

**OPTIONAL:** Sanity-check this as a link

# Conclusions and next steps

**TODO:** Add links to TF2, PyTorch, PopART when done.

Now that you have seen convolutional neural networks being trained in TensorFlow 1 on Paperspace, some next steps that you can take are:

- Check out the other Graphcore Paperspace content: TensorFlow 2, PyTorch, and PopART.
- The [original repository](https://github.com/graphcore/examples/tree/master/applications/tensorflow/cnns/training) for this content includes some details omitted here for brevity, such as a tabulated description of the other files in the repository.
- Explore the extensive and informative [Graphcore documentation](https://docs.graphcore.ai/).
- In particular, the [Examples and Tutorials](https://docs.graphcore.ai/en/latest/examples.html) section is a great way to navigate all of the GitHub repository content that appears under the [Graphcore code examples](https://github.com/graphcore/examples) and [Graphcore Tutorials](https://github.com/graphcore/tutorials). Note that most of these are Python-script-based (`.py`) rather than Jupyter-Notebook-based (`.ipynb`).