# Accelerating and scaling inference with ONNX on GPU
## 01 - Getting started
#### By Ramon Lins
------------------

**Table of contents**
* [Introduction](#introduction)
* [Setup](#setup)
* [Tutorial](#tutorial)
* [Visualization](#zetane)
* [Optional](#option)

Reference:

- Tutorial
    > https://pytorch.org/tutorials/advanced/super_resolution_with_ort.html(cpu tutorial)
    
    > https://pytorch.org/docs/master/onnx.html

- Setup
    > https://onnxruntime.ai/

    > https://developer.nvidia.com/cuda-10.1-download-archive-update2?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal (cuda 10.1)

    > https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/cudnn-10.2-linux-x64-v7.6.5.32.tgz (cudnn 7.6.5)

    > https://docs.nvidia.com/cuda/cuda-installation-guide-linux/

- ONNX
    > https://onnxruntime.ai/docs/tutorials/export-pytorch-model.html
    
    > https://pytorch.org/docs/master/onnx.html

- ONNXRuntime
    > https://onnxruntime.ai/docs/tutorials/
    
    > https://github.com/microsoft/onnxruntime
    
    > https://onnxruntime.ai/docs/tutorials/accelerate-pytorch/pytorch.html

    > https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_GPU.ipynb (gpu tutorial)
    
    > https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html(version compatibility)
    
    > https://onnxruntime.ai/docs/build/eps.html#cuda
    
- Visualization
    > https://github.com/onnx/tutorials/blob/main/tutorials/VisualizingAModel.md

- Comparison
    > https://github.com/onnx/tutorials/blob/main/tutorials/CorrectnessVerificationAndPerformanceComparison.ipynb

- Optional
    > https://github.com/onnx/onnx-docker/blob/master/onnx-ecosystem/converter_scripts/float32_float16_onnx.ipynb
    
    > https://github.com/onnx/onnx-docker

<a id="introduction"></a>
### Introduction

<a id="introduction">

ONNX is an open source project designed to accelerate machine learning across a wide variety of 
frameworks, operating systems, and hardware platforms.

The main objective of this task is to use the ONNX engine to optimize the patch-based density model,
a vgg-16 customized network, to reducing latency.

<a id="setup"></a>
### Setup

Create a environment.yml with:
```
name: onnx_gpu
channels:
  - pytorch-lts
dependencies:
  - python=3.7.*
  - pytorch=1.8.2
  - torchvision=0.9.2
  - cudatoolkit=10.1
  - pip
  - pip:
      - onnx
      - onnxruntime-gpu==1.4
```
run in terminal

```
conda env create
```

Prerequistes to run the jupyter notebook:
```bash
conda activate onnx_gpu
conda install -c anaconda ipykernel
conda install -c conda-forge ipywidgets
python -m ipykernel install --user --name=onnx_gpu
```

<a id="tutorial"></a>
### Tutorial


Pytorch use build-in cuda and cudnn version

In [28]:
import numpy as np
import torch.onnx

print("pytorch version:", torch.__version__)
print("cuda version:" , torch.version.cuda)
print("cudnn version:", torch.backends.cudnn.version())

pytorch version: 1.8.2
cuda version: 10.1
cudnn version: 7603


To run onnx runtime, cuda and cudnn version should be installed from source.

To install ***cuda*** (10.1) from source follow the instructions:

```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub

sudo apt-get update

sudo apt-get -y install cuda
```

To install ***cudnn*** (7.6.5) from source follow the instructions:

```bash
wget https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/cudnn-10.2-linux-x64-v7.6.5.32.tgz

tar -zxf cudnn-10.2-linux-x64-v7.6.5.32.tgz

cd cuda
sudo cp -P lib64/* /usr/local/cuda/lib64/
sudo cp -P include/* /usr/local/cuda/include/
```

Perhaps environment paths are not set correctly, so after Install CUDA and cuDNN:
- The path to the CUDA installation must be provided via the `CUDA_PATH` environment variable
- The path to the cuDNN installation (include the cuda folder in the path) must be provided via the `cuDNN_PATH` environment variable. The cuDNN path should contain bin, include and lib directories.
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}

export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

### inference

Pytorch

In [29]:
from models.estimator import Estimator

img_res = (512, 512)
model_path = "/home/ramon/Git/adroit/vision_foliage_density/weights/foliage_density_v3/density_model_reg.pth"

predictor = Estimator((512, 512), model_path)

In [30]:
import time

device = 'cuda'
batch_size = 1
total_samples = 1000

In [31]:
# Create random input data
torch_inputs = torch.randn(total_samples, 3, 512, 512, requires_grad=False)
torch_inputs.shape

torch.Size([1000, 3, 512, 512])

In [36]:
# inference torch
# Remeber that profiler can add delay
with torch.autograd.profiler.profile(use_cuda=True) as pytorch_profiler:
    for i in range(0, total_samples, batch_size):
        torch_input = torch_inputs[i:i+batch_size].to(device)
        predictor.estimate(torch_input)

In [37]:
# profile pytorch by cuda time total
print(pytorch_profiler.key_averages().table(sort_by="self_cuda_time_total"))

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
          aten::cudnn_convolution         1.54%        1.200s         1.66%        1.293s      58.760us       59.034s        74.43%       59.034s       2.683ms         22000  
                       aten::add_         0.58%     449.405ms         0.58%     449.405ms      20.427us        7.735s         9.75%        7.735s     351.600us         22000  
                 aten::threshold_         0.32%     252.075ms         0.32%     252.075ms      10.960us        6.446s  

Loading model to gpu seems to be the main problem of latency

Export pytorch to onnx model

In [38]:
import onnx
import onnxruntime as ort

ort.set_default_logger_severity(0)

print("onnx version:", onnx.__version__)
print("onnxruntime version:", ort.__version__)

onnx_version = onnx.__version__.split('.')[1]

onnx version: 1.12.0
onnxruntime version: 1.4.0


In [39]:
# input example necessary to export onnx model
torch_input = torch.randn(batch_size, 3, 512, 512, requires_grad=True).to(device)

model = Estimator((512, 512), model_path).model

torch.onnx.export(
    model, # model being run
    torch_input, # model input (or a tuple for multiple inputs)
    "density.onnx", # where to save the model (can be a file or file-like object)
    opset_version=12,
    export_params=True, # store the trained parameter weights inside the model file
    input_names=['input'], # the model's input names
    output_names=['output'], # the model's output names
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}) # variable length axes

try:
    # print human readable representation of the graph if exist
    print(onnx.helper.printable_graph(model.graph))
except AttributeError as error:
    print(error)

'VGGReg' object has no attribute 'graph'


Before verifying the model’s output with ONNX Runtime, we will check the ONNX model with ONNX’s API. 

1. First, onnx.load("superres.onnx") will load the saved model and will output a onnx.ModelProto structure (a top-level file/container format for bundling a ML model. For more information onnx.proto documentation.). 
2. Then, onnx.checker.check_model(onnx_model) will verify the model’s structure and confirm that the model has a valid schema. The validity of the ONNX graph is verified by checking the model’s version, the graph’s structure, as well as the nodes and their inputs and outputs.

In [40]:
onnx_model = onnx.load("density.onnx")
# check consistency of the model. 
# if model is larger than 2GB, model should be checked with path instead of model itself
print("ONNX export is valid.") if onnx.checker.check_model(onnx_model) == None else print("ONNX export is invalid.") 

ONNX export is valid.


Compute inference output using onnx runtime.

To run the model, it is necessary to create an inference session. Once it is created, the model is evaluated using `run()` API.

In [41]:
ort.set_default_logger_severity(2)

# create a onnx runtime session
session = ort.InferenceSession("density.onnx")

# NOTE: This can be a bottleneck for gpu devices, since the tensor is transferred to cpu to
# convert to numpy array. Then, the numpy array is transferred to gpu .

# turn tensor into numpy array 
def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

# inference onnx
onnx_output = session.run(None, {"input": to_numpy(torch_input)},)

# inference pytorch
torch_output = predictor.estimate(torch_input)

# compare onnx X with pytorch results
try:
    np.testing.assert_allclose(to_numpy(torch_output), onnx_output[0], rtol=1e-03, atol=1e-05)
    print("ONNX and PyTorch results match!")
except AssertionError as error:
    print("ONNX and PyTorch results do not match!")
    print(error)

ONNX and PyTorch results match!


Test latency inference output with ONNX Runtime

In [42]:
with torch.autograd.profiler.profile(use_cuda=True) as onnx_profiler:
    for i in range(0, total_samples, batch_size):
        onnx_input = {"input": to_numpy(torch_inputs[i:i+batch_size])}
        onnx_output = session.run(None, onnx_input)

In [43]:
print(onnx_profiler.key_averages().table(sort_by="self_cuda_time_total"))

--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
            aten::to        30.89%       9.258ms        30.89%       9.258ms       9.258us       4.676ms       100.00%       4.676ms       4.676us          1000  
         aten::slice        52.08%      15.609ms        69.11%      20.713ms      20.713us       0.000us         0.00%       0.000us       0.000us          1000  
    aten::as_strided        17.03%       5.104ms        17.03%       5.104ms       5.104us       0.000us         0.00%       0.000us       0.000us          1000  
-------------------- 

: 

pytorch profiling doesn’t work for onnx model.