# Accelerating and scaling inference with ONNX on GPU
## 01 - Getting started
#### By Ramon Lins
------------------

**Table of contents**
* [Introduction](#introduction)
* [Setup](#setup)
* [Tutorial](#tutorial)
* [Visualization](#zetane)
* [Optional](#option)

Reference:

- Tutorial
    > https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html(cpu tutorial)
    
    > https://pytorch.org/docs/master/onnx.html

- Setup
    > https://onnxruntime.ai/

    > https://developer.nvidia.com/cuda-10.1-download-archive-update2?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal (cuda 10.1)

    > https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/cudnn-10.2-linux-x64-v7.6.5.32.tgz (cudnn 7.6.5)

    > https://docs.nvidia.com/cuda/cuda-installation-guide-linux/

- ONNX
    > https://onnxruntime.ai/docs/tutorials/export-pytorch-model.html
    
    > https://pytorch.org/docs/master/onnx.html

- ONNXRuntime
    > https://onnxruntime.ai/docs/tutorials/
    
    > https://github.com/microsoft/onnxruntime
    
    > https://onnxruntime.ai/docs/tutorials/accelerate-pytorch/pytorch.html

    > https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_GPU.ipynb (gpu tutorial)
    
    > https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html(version compatibility)
    
    > https://onnxruntime.ai/docs/build/eps.html#cuda
    
- Visualization
    > https://github.com/onnx/tutorials/blob/main/tutorials/VisualizingAModel.md

- Comparison
    > https://github.com/onnx/tutorials/blob/main/tutorials/CorrectnessVerificationAndPerformanceComparison.ipynb

- Optional
    > https://github.com/onnx/onnx-docker/blob/master/onnx-ecosystem/converter_scripts/float32_float16_onnx.ipynb
    
    > https://github.com/onnx/onnx-docker

<a id="introduction"></a>
### Introduction

<a id="introduction">

ONNX is an open source project designed to accelerate machine learning across a wide variety of 
frameworks, operating systems, and hardware platforms.

The main objective of this task is to use the ONNX engine to optimize the patch-based density model,
a vgg-16 customized network, to reducing latency.

<a id="setup"></a>
### Setup

Create a environment.yml with:
```
name: onnx_gpu
channels:
  - pytorch-lts
dependencies:
  - python=3.7.*
  - pytorch=1.8.2
  - torchvision=0.9.2
  - cudatoolkit=10.2
  - pip
  - pip:
      - onnx
      - onnxruntime-gpu==1.4
```
run in terminal

```
conda env create
```

Prerequistes to run the jupyter notebook:
```bash
conda activate onnx_gpu
conda install -c anaconda ipykernel
conda install -c conda-forge ipywidgets
python -m ipykernel install --user --name=onnx_gpu
```

<a id="tutorial"></a>
### Tutorial


Pytorch use build-in cuda and cudnn version

In [1]:
import io
import numpy as np

import torch.nn.init as init
import torch.utils.model_zoo as model_zoo
import torch.onnx

from torch import nn

print("pytorch version:", torch.__version__)
print("cuda version:" , torch.version.cuda)
print("cudnn version:", torch.backends.cudnn.version())

pytorch version: 1.8.2
cuda version: 10.1
cudnn version: 7603


To run onnx runtime, cuda and cudnn version should be installed from source.

To install ***cuda*** (10.1) from source follow the instructions:

```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub

sudo apt-get update

sudo apt-get -y install cuda
```

To install ***cudnn*** (7.6.5) from source follow the instructions:

```bash
wget https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/cudnn-10.2-linux-x64-v7.6.5.32.tgz

tar -zxf cudnn-10.2-linux-x64-v7.6.5.32.tgz

cd cuda
sudo cp -P lib64/* /usr/local/cuda/lib64/
sudo cp -P include/* /usr/local/cuda/include/
```

Perhaps environment paths are not set correctly, so after Install CUDA and cuDNN:
- The path to the CUDA installation must be provided via the `CUDA_PATH` environment variable
- The path to the cuDNN installation (include the cuda folder in the path) must be provided via the `cuDNN_PATH` environment variable. The cuDNN path should contain bin, include and lib directories.
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}

export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

In [2]:
class SuperResolutionNet(nn.Module):
    def __init__(self, upscale_factor: int):
        """Super Resolution Network for increasing the resolutiono of images

        Args:
            upscale_factor (int): The factor by which the image resolution is increased
        """
        super(SuperResolutionNet, self).__init__()
        global batch_size 
        batch_size = 1 # Batch size
        num_filters = 64 # number of filters
        kernel_size_in = 5 # 5x5 kernel for input convolution
        kernel_size_hl = 3 # 3x3 kernel for hidden layer convolution
        stride = 1 # stride of the convolution
        padding_in = 2 # padding for input convolution
        padding_hl = 1 # padding for hidden layers

        self.relu = nn.ReLU()
        self.conv1 = nn.Conv2d(batch_size, num_filters, kernel_size_in, stride, padding_in)
        self.conv2 = nn.Conv2d(num_filters, num_filters, kernel_size_hl, stride, padding_hl)
        self.conv3 = nn.Conv2d(num_filters, num_filters//2, kernel_size_hl, stride, padding_hl)
        self.conv4 = nn.Conv2d(num_filters//2, upscale_factor**2, kernel_size_hl, stride, padding_hl)
        self.pixel_shuffle = nn.PixelShuffle(upscale_factor)

        self._initialize_weights()

    def forward(self, x):
        """forward operation

        Args:
            x (tensor): input image of shape (batch_size, 1, H, W)

        Returns:
            tensor: output image of shape (batch_size, 1, H*upscale_factor, W*upscale_factor)
        """
        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        x = self.relu(self.conv3(x))
        
        return self.pixel_shuffle(self.conv4(x))
    
    def _initialize_weights(self):
        #initialize weights for the network using orthogonal initialization
        init.orthogonal_(self.conv1.weight, init.calculate_gain('relu'))
        init.orthogonal_(self.conv2.weight, init.calculate_gain('relu'))
        init.orthogonal_(self.conv3.weight, init.calculate_gain('relu'))
        init.orthogonal_(self.conv4.weight)


model = SuperResolutionNet(upscale_factor=3)

### inference

In [3]:
# Load pretrained model weights
model_url = 'https://s3.amazonaws.com/pytorch/test_data/export/superres_epoch100-44c6958e.pth'

# pretrained model weights
model.load_state_dict(model_zoo.load_url(model_url))

use_gpu = torch.cuda.is_available()
device = torch.device("cuda" if use_gpu else "cpu")

# evaluation mode
model.eval()
model.to(device)

SuperResolutionNet(
  (relu): ReLU()
  (conv1): Conv2d(1, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(32, 9, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pixel_shuffle): PixelShuffle(upscale_factor=3)
)

Pytorch

In [4]:
import time

total_samples = 10000
torch_inputs = torch.randn(total_samples, batch_size, 224, 224, requires_grad=True).to(device)

latency = []

# inference torch
for i in range(total_samples):
    torch_input = torch_inputs[i].unsqueeze_(0)
    start = time.time()
    model(torch_input)
    latency.append(time.time() - start)

print(f"Pytorch inference time = {sum(latency)}")

Pytorch inference time = 41.80208492279053


Export pytorch to onnx model

In [5]:
import onnx
import onnxruntime

print("onnx version:", onnx.__version__)
print("onnxruntime version:", onnxruntime.__version__)

onnx_version = onnx.__version__.split('.')[1]

onnx version: 1.12.0
onnxruntime version: 1.4.0


In [6]:
# input example necessary to export onnx model
torch_input = torch.randn(1, 1, 224, 224, requires_grad=True).to(device)

torch.onnx.export(
    model, # model being run
    torch_input, # model input (or a tuple for multiple inputs)
    "superres.onnx", # where to save the model (can be a file or file-like object)
    export_params=True, # store the trained parameter weights inside the model file
    input_names=['input'], # the model's input names
    output_names=['output'], # the model's output names
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}) # variable length axes

try:
    # print human readable representation of the graph if exist
    print(onnx.helper.printable_graph(model.graph))
except AttributeError as error:
    print(error)

'SuperResolutionNet' object has no attribute 'graph'


Before verifying the model’s output with ONNX Runtime, we will check the ONNX model with ONNX’s API. 

1. First, onnx.load("superres.onnx") will load the saved model and will output a onnx.ModelProto structure (a top-level file/container format for bundling a ML model. For more information onnx.proto documentation.). 
2. Then, onnx.checker.check_model(onnx_model) will verify the model’s structure and confirm that the model has a valid schema. The validity of the ONNX graph is verified by checking the model’s version, the graph’s structure, as well as the nodes and their inputs and outputs.

In [7]:
onnx_model = onnx.load("superres.onnx")
# check consistency of the model. 
# if model is larger than 2GB, should model should be checked with path instead of model itself
print("ONNX export is valid.") if onnx.checker.check_model(onnx_model) == None else print("ONNX export is invalid.") 

ONNX export is valid.


Compute inference output using onnx runtime.

To run the model, it is necessary to create an inference session. Once it is created, the model is evaluated using `run()` API.

In [8]:
#  explicity execution Providers (EP) framework
# BUG: An warning is issued when the EP is used.
provider = ['CPUExecutionProvider'] if device.type == "cpu" else ['CUDAExecutionProvider']

# create a onnx runtime session
session = onnxruntime.InferenceSession("superres.onnx", providers=provider)

# NOTE: This can be a bottleneck for gpu devices, since the tensor is transferred to cpu to
# convert to numpy array. Then, the numpy array is transferred to gpu .

# turn tensor into numpy array 
def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

# inference onnx
onnx_output = session.run(None, {"input": to_numpy(torch_input)},)

# inference pytorch
torch_output = model(torch_input)

# compare onnx X with pytorch results
try:
    np.testing.assert_allclose(to_numpy(torch_output), onnx_output[0], rtol=1e-03, atol=1e-05)
    print("ONNX and PyTorch results match!")
except AssertionError as error:
    print("ONNX and PyTorch results do not match!")
    print(error)

ONNX and PyTorch results match!


Test latency inference output with ONNX Runtime

In [9]:
onnx_inputs = []
for i in range(total_samples):
    onnx_inputs.append(to_numpy(torch_inputs[i].unsqueeze_(0)))

In [10]:
latency = []
for onnx_input in onnx_inputs:
    onnx_input = {"input": onnx_input}
    start = time.time()
    onnx_output = session.run(None, onnx_input)
    latency.append(time.time() - start)

print(f"ONNX inference time = {sum(latency)}")

ONNX inference time = 49.88228797912598


Test using images

In [11]:
from PIL import Image
import torchvision.transforms as transforms

img = Image.open("/home/ramon/Git/utils/img/cat.jpg")

resize= transforms.Resize([224, 224])
img_rs = resize(img)

img_ycbcr = img_rs.convert('YCbCr')
img_y, img_cb, img_cr = img_ycbcr.split()

to_tensor = transforms.ToTensor()
img_y = to_tensor(img_y)
img_y = img_y.unsqueeze_(0)

# inference
onnx_inputs = {session.get_inputs()[0].name: to_numpy(img_y)}
onnx_output = session.run(None, onnx_inputs)
img_out_y = onnx_output[0]

img_out_y = Image.fromarray(np.uint8((img_out_y[0] * 255.0).clip(0, 255)[0]), mode='L')

# get the output image follow post-processing step from PyTorch implementation
final_img = Image.merge(
    "YCbCr", [
        img_out_y,
        img_cb.resize(img_out_y.size, Image.BICUBIC),
        img_cr.resize(img_out_y.size, Image.BICUBIC),
    ]).convert("RGB")

# Save the image, we will compare this with the output image from mobile device
final_img.save("./out/cat.jpg")