In [None]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="https://developer.download.nvidia.com/tesla/notebook_assets/nv_logo_torch_trt_resnet_notebook.png" style="width: 90px; float: right;">

# ResNet C++ Serving Example

This example shows how you can load a pretrained ResNet-50 model, convert it to a Torch-TensorRT optimized model (via the Torch-TensorRT Python API), save the model as a torchscript module, and then finally load and serve the model with the PyTorch C++ API. The process can be demonstrated with the below workflow diagram:

<img src="./images/Torch-TensorRT-CPP-inference.JPG">

The Python conversion part largely follows the [Resnet50-example](./Resnet50-example.ipynb). Here for simplicity, we will only download the model and do the conversion.


## Pre-requisite
This example should be executed from an NGC PyTorch container. 
```
docker pull nvcr.io/nvidia/pytorch:22.05-py3
docker run --rm --net=host -it nvcr.io/nvidia/pytorch:22.05-py3 bash
```
Though this example was tested with the `22.05` version, you can try and replace `22.05` with a later version of the container. 

Inside the container, install and start Jupyter-lab with:
```
apt update && pip install jupyterlab
jupyter lab --ip 0.0.0.0 --allow-root
```


## 1. Download and optimize the ResNet-50 pretrained model

In [None]:
import torch
import torchvision

torch.hub._validate_not_a_forked_repo=lambda a,b,c: True

resnet50_model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
resnet50_model.eval()

### Torch-TensorRT optimization

In [None]:
import torch_tensorrt

# The compiled module will have precision as specified by "op_precision".
# Here, it will have FP32 precision.
trt_model_fp32 = torch_tensorrt.compile(resnet50_model, inputs = [torch_tensorrt.Input((128, 3, 224, 224), dtype=torch.float32)],
    enabled_precisions = torch.float32, # Run with FP32
    workspace_size = 1 << 22
)

Next, we save this optimized model for later inference in C++.

In [None]:
trt_model_fp32.save('trt_model_fp32.ts')

Similarly, we optimize and save the model with FP16 precision.

In [None]:
# The compiled module will have precision as specified by "op_precision".
# Here, it will have FP16 precision.
trt_model_fp16 = torch_tensorrt.compile(resnet50_model, inputs = [torch_tensorrt.Input((128, 3, 224, 224), dtype=torch.half)],
    enabled_precisions = {torch.half}, # Run with FP16
    workspace_size = 1 << 22
)
trt_model_fp16.save('trt_model_fp16.ts')

## 2. Load and serve the model in C++

First, we will need to download the PyTorch C++ API dependencies.

### Dependencies

In [None]:
%%bash
mkdir deps
cd deps
wget https://download.pytorch.org/libtorch/cu113/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu113.zip
rm -r libtorch
unzip libtorch-cxx11-abi-shared-with-deps-1.11.0+cu113.zip

In [None]:
%%bash
cd deps
wget https://github.com/pytorch/TensorRT/releases/download/v1.1.0/libtorchtrt-v1.1.0-cudnn8.2-tensorrt8.2-cuda11.3-libtorch1.11.0.tar.gz
tar -xvzf libtorchtrt-v1.1.0-cudnn8.2-tensorrt8.2-cuda11.3-libtorch1.11.0.tar.gz

## Prepare C++ Code for FP32 Inference

The below demonstrate a minimum C++ code harness for loading and inference with the FP32 model: 
- A makefile 
- The C++ code for loading the model and run inference on a dummy input

In [None]:
%%file Makefile
CXX=g++
DEP_DIR=$(PWD)/deps
INCLUDE_DIRS=-I$(DEP_DIR)/libtorch/include -I$(DEP_DIR)/torch_tensorrt/include
LIB_DIRS=-L$(DEP_DIR)/torch_tensorrt/lib -L$(DEP_DIR)/libtorch/lib 
LIBS=-Wl,--no-as-needed -ltorchtrt_runtime -Wl,--as-needed -ltorch -ltorch_cuda -ltorch_cpu -ltorch_global_deps -lbackend_with_compiler -lc10 -lc10_cuda
SRCS=main.cpp

TARGET=torchtrt_runtime_example

$(TARGET):
	$(CXX) $(SRCS) $(INCLUDE_DIRS) $(LIB_DIRS) $(LIBS) -o $(TARGET)

clean:
	$(RM) $(TARGET)

In [None]:
%%file main.cpp
#include <iostream>
#include <fstream>
#include <memory>
#include <sstream>
#include <vector>
#include "torch/script.h"

int main(int argc, const char* argv[]) {
  if (argc < 2) {
    std::cerr
        << "usage: samplertapp <path-to-pre-built-trt-ts module>\n";
    return -1;
  }

  std::string trt_ts_module_path = argv[1];

  torch::jit::Module trt_ts_mod;
  try {
    // Deserialize the ScriptModule from a file using torch::jit::load().
    trt_ts_mod = torch::jit::load(trt_ts_module_path);
  } catch (const c10::Error& e) {
    std::cerr << "error loading the model from : " << trt_ts_module_path << std::endl;
    return -1;
  }

  std::cout << "Running TRT engine" << std::endl;
  std::vector<torch::jit::IValue> trt_inputs_ivalues;
  trt_inputs_ivalues.push_back(at::randint(-5, 5, {128, 3, 224, 224}, {at::kCUDA}).to(torch::kFloat32));
  torch::jit::IValue trt_results_ivalues = trt_ts_mod.forward(trt_inputs_ivalues);
  std::cout << "==================TRT outputs================" << std::endl;
  std::cout << trt_results_ivalues << std::endl;
  std::cout << "=============================================" << std::endl;
  std::cout << "TRT engine execution completed. " << std::endl;
}


We are now ready to compile.

In [None]:
!make clean && make

And finally, run the inference in C++.

In [None]:
!./torchtrt_runtime_example $PWD/trt_model_fp32.ts

### ## Prepare C++ Code for FP16 Inference

In a similar fashion, we can carry out inference with the FP16 model.

In [None]:
%%file main.cpp
#include <iostream>
#include <fstream>
#include <memory>
#include <sstream>
#include <vector>
#include "torch/script.h"

int main(int argc, const char* argv[]) {
  if (argc < 2) {
    std::cerr
        << "usage: samplertapp <path-to-pre-built-trt-ts module>\n";
    return -1;
  }

  std::string trt_ts_module_path = argv[1];

  torch::jit::Module trt_ts_mod;
  try {
    // Deserialize the ScriptModule from a file using torch::jit::load().
    trt_ts_mod = torch::jit::load(trt_ts_module_path);
  } catch (const c10::Error& e) {
    std::cerr << "error loading the model from : " << trt_ts_module_path << std::endl;
    return -1;
  }

  std::cout << "Running TRT engine" << std::endl;
  std::vector<torch::jit::IValue> trt_inputs_ivalues;
  trt_inputs_ivalues.push_back(at::randint(-5, 5, {128, 3, 224, 224}, {at::kCUDA}).to(torch::kFloat16));
  torch::jit::IValue trt_results_ivalues = trt_ts_mod.forward(trt_inputs_ivalues);
  std::cout << "==================TRT outputs================" << std::endl;
  std::cout << trt_results_ivalues << std::endl;
  std::cout << "=============================================" << std::endl;
  std::cout << "TRT engine execution completed. " << std::endl;
}


In [None]:
!make clean && make

In [None]:
!./torchtrt_runtime_example $PWD/trt_model_fp16.ts

# Conclusion

In this example, we have walked you through a bare-bone example of optimizing a ResNet model with the Torch-TensorRT API, and then carry out inference with the optimized model in C++. Next, try this on your own models.