# Using PyTorch Ignite with DALI:

You can find the general documentation on DALI here: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html

In this notebook we are going to show one way of building a DALI pipeline for Ignite that mimicks how we might do it with the default PyTorch transformers and loaders syntax wise.

DALI is a GPU accelerated library for data loading and augmentation. One thing you can do with it is use a GPU to accelerate jpeg decode when reading from a directory structure, and also do some basic resizing and normalization on said jpegs. 

DALI is unique in that it allows you to do __every__ step of the data loading and transforming process on GPU by composing pipelines of exclusively GPU ops (DALI does support CPU ops, but they must come _before_ the GPU accelerated part of the pipeline). If __every__ step of the data loading process up until training is all on GPU, it saves having to do CPU-GPU communication. So that is what DALI does - it helps build and optimize the "GPU onramp" to a deep learning model.

CPU-GPU communication can end up bottlenecking training workflows, so it can be really useful. This is especially true as more training GPUs are added to a particular node. This is perfect for high-level APIs like Ignite when combined with PyTorch Distributed - their rapid prototyping nature means they can quickly build intense multi-GPU workloads. A lot of the time, the first bottleneck in these workloads ends up being the IO pipeline.

So ideally we make building a better IO pipeline just as easy as building any normal sequential PyTorch model with Ignite!

# Running This Notebook:

You'll need a computer with an NVIDIA GPU, the appropriate drivers, and CUDA 9 or 10, and PyTorch properly configured for GPU use.

You can also run this notebook on Google Colabs, which give you free GPU access for research etc with a Google account:
https://colab.research.google.com/notebooks/welcome.ipynb

Another option is to take a computer with an NVIDIA GPU on it and use one of the free NGC containers with nvidia-docker - this can help sidestep the configuration. 

The one-liner for doing this (on your own machine) and starting Jupyter (with nvidia-docker installed) is:

You can get nvidia-docker here: https://github.com/NVIDIA/nvidia-docker

# Environment:

nvidia-smi lets us verify what GPUs we have:

In [1]:
!nvidia-smi

Wed Feb  5 21:48:41 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0  On |                    0 |
| N/A   42C    P0    58W / 300W |    198MiB / 16125MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   41C    P0    38W / 300W |      0MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   

We need PyTorch, Ignite and DALI for this. 

We will also use ipywidgets, as progress bars help visualize performance differences and are also fun.

If you need the dependencies, run this cell and then refresh the page to reset Jupyter for the widgets:

In [None]:
!pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/nightly/cuda/10.0 nvidia-dali-nightly
!pip install git+https://github.com/pytorch/ignite.git
!pip install ipywidgets

# Download Cats vs Dogs dataset

To demo this workflow we are just going to use a generic Cats vs Dogs classification task.

The images are a 50-50 even balance between Cats and Dogs, 1000 samples per class in jpeg format. They are of varying sizes, so as one step in the pipeline we will be resizing them to 128x128.

In [3]:
!wget --no-check-certificate \
    https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \
    -O /tmp/cats_and_dogs_filtered.zip

--2020-02-05 21:48:53--  https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.11.208, 2607:f8b0:4025:800::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.11.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68606236 (65M) [application/zip]
Saving to: ‘/tmp/cats_and_dogs_filtered.zip’


2020-02-05 21:48:54 (90.3 MB/s) - ‘/tmp/cats_and_dogs_filtered.zip’ saved [68606236/68606236]



In [4]:
import os
import zipfile

local_zip = '/tmp/cats_and_dogs_filtered.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()

In [5]:
!ls /tmp/cats_and_dogs_filtered/train/
!file /tmp/cats_and_dogs_filtered/train/cats/cat.0.jpg 

cats  dogs
/tmp/cats_and_dogs_filtered/train/cats/cat.0.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 500x374, frames 3


In [6]:
DATASET_NAME = '/tmp/cats_and_dogs_filtered'
BATCH_SIZE=64

NUM_LABELS=2
WIDTH=128
HEIGHT=128
NUM_CHANNELS=3

# Setup DALI Pipeline

ComposeOps makes it really easy to make a DALI workflow like you would a regular PyTorch transform Compose:

In [7]:
from dali_transform_utilities import ComposeOps

from nvidia.dali.pipeline import Pipeline
import nvidia.dali.ops as ops
import nvidia.dali.types as types

oplist = []

# ImageDecoder takes in the raw images and uses the GPU to accelerate the decoding of them - in this case jpegs.
oplist.append(ops.ImageDecoder(device = "mixed", output_type = types.RGB)) # "mixed" means 'use GPU and CPU simultaneously'

# Everything in a DALI pipeline that follows a "gpu" (or "mixed")...
# op must also be a "gpu" op, as it is assumed to be staying on the GPU through training
                                                         
oplist.append(ops.Resize(device = "gpu", image_type = types.RGB, 
                            interp_type = types.INTERP_LINEAR, resize_x=WIDTH, resize_y=HEIGHT))

# CropMirrorNormalize is a multipurpose op. We are just using it for normalizing, 
# but if we also want to crop or mirror we use the same op. Doing them all at once saves time.
oplist.append(ops.CropMirrorNormalize(device="gpu",output_dtype=types.FLOAT,
                                                    output_layout=types.NCHW,
                                                    image_type=types.RGB,
                                                    mean=[255//2, 255//2, 255//2],
                                                    std=[255//2, 255//2, 255//2]))
              
transforms_list = ComposeOps(oplist)

# Create 'TransformPipeline', feed DALI operators into it:

Once we compose our DALI operation pipeline, we feed it into TransformPipeline which basically takes a Composed DALI ops pipeline and translates it to an ignite-compatible DALI graph.

In [8]:
from dali_transform_utilities import TransformPipeline, DALILoader

In [9]:
from nvidia.dali.plugin.pytorch import DALIClassificationIterator

pipe = TransformPipeline(batch_size=BATCH_SIZE,
        num_threads=8,
        device_id=0,
        transform=transforms_list,
        reader=ops.FileReader(file_root = "%s/train"%DATASET_NAME, random_shuffle = True))

dali_iter = DALILoader([pipe])

In [10]:
pipe = TransformPipeline(batch_size=BATCH_SIZE,
        num_threads=8,
        device_id=0,
        transform=transforms_list,
        reader=ops.FileReader(file_root = "%s/validation"%DATASET_NAME))

dali_iter_valid = DALILoader([pipe])

# Set up PyTorch ImageFolder for comparison

Let's set up a traditional PyTorch workflow with all the default loaders just for comparison, both syntax and performance wise:

In [11]:
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from PIL import Image

transform_list = []

transform_list.append(transforms.Resize((WIDTH,HEIGHT)))
transform_list.append(transforms.ToTensor())
transform_list.append(transforms.Normalize((0.5,), (0.5,)))

transform = transforms.Compose(transform_list)

train_set = datasets.ImageFolder("%s/train" % DATASET_NAME, transform=transform, target_transform=None, loader=Image.open)
valid_set = datasets.ImageFolder("%s/validation" % DATASET_NAME, transform=transform, target_transform=None, loader=Image.open)

train_loader_folder = DataLoader(train_set, batch_size=64, shuffle=True)
valid_loader_folder = DataLoader(valid_set, batch_size=64, shuffle=True)

# Build a Lenet-esque test network

Build a simple example just so we can see how to put everything together:

In [12]:
import torch
from torch.autograd import Variable
import numpy as np

def get_flat_size(in_size, fts):
    f = fts(Variable(torch.ones(1,*in_size)))
    return int(np.prod(f.size()[1:]))

In [13]:
import torch.nn as nn
import torch.nn.functional as F

layers = []

def add_conv_layer(layers, size_in=64, size_out=64, maxpool=True):
    layers.append(nn.Conv2d(size_in,size_out,3))
    layers.append(nn.ReLU())
    layers.append(nn.BatchNorm2d(size_out))
    if maxpool:
        layers.append(nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

add_conv_layer(layers, size_in=NUM_CHANNELS)
add_conv_layer(layers)
add_conv_layer(layers)

layers.append(nn.Flatten())
size = get_flat_size((NUM_CHANNELS,WIDTH,HEIGHT), nn.Sequential(*layers))

layers.append(nn.Linear(size,32))
layers.append(nn.Dropout(.6))
layers.append(nn.Linear(32,10))
layers.append(nn.Softmax(dim=1))

model = nn.Sequential(*layers)
model.to("cuda:0")

Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1))
  (1): ReLU()
  (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
  (5): ReLU()
  (6): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (7): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (8): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
  (9): ReLU()
  (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (11): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (12): Flatten()
  (13): Linear(in_features=14400, out_features=32, bias=True)
  (14): Dropout(p=0.6, inplace=False)
  (15): Linear(in_features=32, out_features=10, bias=True)
  (16): Softmax(dim=1)
)

# Set up trainer and evaluator
### ...and make them look like Keras (Mostly for fun, but also for timing)

Prepare batch just reformats the batches coming out out DALI to feed into Ignite:

In [14]:
from dali_example_utilities import create_custom_supervised_trainer
from ignite.engine import create_supervised_evaluator

def prepare_batch(batch, device, non_blocking=True):
    x = batch[0]["data"]
    y = batch[0]["label"]
    y = y.squeeze().long().to(device)
    return x, y

Set up our trainers using the custom trainer functions:

In [15]:
from ignite.metrics import Accuracy, Loss, RunningAverage, ConfusionMatrix
import torch.optim as optim

opt = optim.Adam(model.parameters(), lr=0.0001)
loss = nn.NLLLoss()

def get_metrics():
    return {'accuracy' : Accuracy(), 'nll' : Loss(loss)}

trainer = create_custom_supervised_trainer(model, opt, loss, metrics=get_metrics(), prepare_batch=prepare_batch, device="cuda:0")
evaluator = create_supervised_evaluator(model, metrics=get_metrics(), prepare_batch=prepare_batch, device="cuda:0")

trainer_default = create_custom_supervised_trainer(model, opt, loss, metrics=get_metrics(), device="cuda:0")
evaluator_default = create_supervised_evaluator(model, metrics=get_metrics(), device="cuda:0")

Add bells and whistles: (functionality of said bells and whistles is highly dependent on your Jupyter/ipywidgets environment)

In [16]:
from dali_example_utilities import make_keras_like # Add the progress bars

make_keras_like(trainer, evaluator, dali_iter_valid)
make_keras_like(trainer_default, evaluator_default, valid_loader_folder)

# Setup early stopping

In [17]:
from ignite.engine import Events
from ignite.handlers import ModelCheckpoint, EarlyStopping

def score_function(engine):
    return -1*engine.state.metrics['nll']

handler = EarlyStopping(patience=10, score_function=score_function, trainer=trainer)
evaluator.add_event_handler(Events.COMPLETED, handler)

handler_trainer_default = EarlyStopping(patience=10, score_function=score_function, trainer=trainer_default)
evaluator_default.add_event_handler(Events.COMPLETED, handler_trainer_default)

<ignite.engine.events.RemovableEventHandle at 0x7f10c00e5dd8>

# Train model!

In [18]:
trainer.run(dali_iter, max_epochs=5)

HBox(children=(IntProgress(value=0, max=31), HTML(value='')))



HBox(children=(IntProgress(value=0, max=15), HTML(value='')))

Train Epoch 1:  acc: 47.83% loss: -0.46, train time: 2.22s --- Valid Epoch 1:  acc: 52.50% loss: -0.50


HBox(children=(IntProgress(value=0, max=31), HTML(value='')))



HBox(children=(IntProgress(value=0, max=15), HTML(value='')))

Train Epoch 2:  acc: 50.35% loss: -0.50, train time: 0.77s --- Valid Epoch 2:  acc: 57.81% loss: -0.58


HBox(children=(IntProgress(value=0, max=31), HTML(value='')))



HBox(children=(IntProgress(value=0, max=15), HTML(value='')))

Train Epoch 3:  acc: 57.36% loss: -0.57, train time: 0.78s --- Valid Epoch 3:  acc: 64.27% loss: -0.64


HBox(children=(IntProgress(value=0, max=31), HTML(value='')))



HBox(children=(IntProgress(value=0, max=15), HTML(value='')))

Train Epoch 4:  acc: 62.10% loss: -0.62, train time: 0.77s --- Valid Epoch 4:  acc: 66.35% loss: -0.66


HBox(children=(IntProgress(value=0, max=31), HTML(value='')))



HBox(children=(IntProgress(value=0, max=15), HTML(value='')))

Train Epoch 5:  acc: 65.68% loss: -0.65, train time: 0.77s --- Valid Epoch 5:  acc: 70.42% loss: -0.70


State:
	iteration: 155
	epoch: 5
	epoch_length: 31
	max_epochs: 5
	output: <class 'tuple'>
	batch: <class 'list'>
	metrics: <class 'dict'>
	dataloader: <class 'dali_transform_utilities.DALILoader'>
	seed: 12

# Default Loaders:

In [19]:
trainer_default.run(train_loader_folder, max_epochs=5)

HBox(children=(IntProgress(value=0, max=32), HTML(value='')))



HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Train Epoch 1:  acc: 70.30% loss: -0.70, train time: 4.84s --- Valid Epoch 1:  acc: 71.10% loss: -0.71


HBox(children=(IntProgress(value=0, max=32), HTML(value='')))



HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Train Epoch 2:  acc: 73.15% loss: -0.73, train time: 4.79s --- Valid Epoch 2:  acc: 73.70% loss: -0.74


HBox(children=(IntProgress(value=0, max=32), HTML(value='')))



HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Train Epoch 3:  acc: 76.75% loss: -0.75, train time: 4.79s --- Valid Epoch 3:  acc: 73.10% loss: -0.73


HBox(children=(IntProgress(value=0, max=32), HTML(value='')))



HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Train Epoch 4:  acc: 78.20% loss: -0.77, train time: 4.82s --- Valid Epoch 4:  acc: 74.90% loss: -0.74


HBox(children=(IntProgress(value=0, max=32), HTML(value='')))



HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Train Epoch 5:  acc: 78.75% loss: -0.78, train time: 4.83s --- Valid Epoch 5:  acc: 74.90% loss: -0.75


State:
	iteration: 160
	epoch: 5
	epoch_length: 32
	max_epochs: 5
	output: <class 'tuple'>
	batch: <class 'list'>
	metrics: <class 'dict'>
	dataloader: <class 'torch.utils.data.dataloader.DataLoader'>
	seed: 12

# Conclusion:

In conclusion, we built a DALI pipeline for Ignite that mimicks how we might do it with the default PyTorch transformers and loaders syntax wise! Depending on the GPU used, you can get a great speedup just by offloading the jpeg decode alone onto the GPU, even before doing any true data augmentation.

DALI does a lot more than this though - the whole data augmentation pipeline can be managed by DALI, and it has a variety of operations and transformations to help with this. Last, it is worth noting that, while not covered in this notebook, DALI does have multi-GPU support. This is especially useful when moving to very large jobs with complex augmentation pipelines.

For more information on the transformations and features available be sure to check out the DALI documentation!

https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html