## Lab - TensorRT + Profiling - TensorRT Python API
## E6692 Spring 2022

Using this notebook and the python script **pytorch_inference.py** you will load the trained weights generated in **TrainPytorchMNIST.ipynb** into a TensorRT model. TensorRT is a framework and C++ library developed by NVIDIA for model deployment. It offers high performance inference optimization for NVIDIA GPUs (like the Jetson Nano, T4 on GCP, etc.), which is why we're interested in it. 

To get a general feel for TensorRT, I recommend starting [here](https://developer.nvidia.com/tensorrt). Then it would be helpful to read the [How TensorRT Works](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work) section of the TRT docs, and finally the [TensorRT's Capabilities](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#fit) and [The TensorRT Python API](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#python_topics) will guide you as you work on this lab.

In this first part you will be defining a model structure using the TensorRT Python API. Then you will load the trained PyTorch weights into your TensorRT model to perform inference optimizations. The architecture of the TensorRT model needs to be identical to the PyTorch model defined in **pyTorchCNN.py**, otherwise the weights will not transfer successfully. You can print the model summary of the PyTorch MnistClassifier to use as a blueprint when defining the TensorRT model with the Python API. 

You will need to review the [documentation of the Python TensorRT API](https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/index.html). 
**TODO:** Use the Python API to define the MNIST classifier structure with the function **generate_serialized_trt_engine()** in **tensorRTCNN.py**. Use the function comments as a guide. The cell below can be used to print the model summary of the PyTorch MnistClassifier.

In [1]:
import tensorrt as trt # import modules

from utils.utils import load_serialized_engine
from serializeMNIST import serializeMNIST, trt_prediction

trained_pytorch_weights = './weights/trained_pytorch_weights' # define pytorch weights path, change if necessary
trt_mnist_engine = './engines/trt_model.engine' # define the tensorrt serialized engine path

TEST_CASES = 100 # test case iterations

TRT_LOGGER = trt.Logger(trt.Logger.WARNING) # defnie trt logger object with warnings enabled
runtime = trt.Runtime(TRT_LOGGER) # define runtime context

%load_ext autoreload
%autoreload 2

ModuleNotFoundError: No module named 'tensorrt'

**TODO:** Use the Python API to define the MNIST classifier structure with the function **generate_serialized_trt_engine()** in **tensorRTCNN.py**. Use the function comments as a guide. The cell below can be used to print the model summary of the PyTorch MnistClassifier.

In [None]:
# TODO: it may be helpful to print the MnistClassifier summary here as a reference to the model structure



**TODO:** After you have finished implementing the TRT MNIST classifier in **generate_serialized_trt_engine()**, use **serializeMNIST()** to (i) load the trained PyTorch weights, (ii) generate the serialized engine, and (iii) save the serialized engine to a file. 

In [None]:
# TODO: call serializeMNIST() to generate the serialized MNIST classifier TRT engine


As you've learned, the Jetson Nano has very limited memory resources. It's therefore a good idea to restart the kernel with `Kernel` --> `Restart Kernel...` to free as much memory as possible for the next steps. You might not need to, but if you run into OOM errors, try restarting here. Once you have generated the serial engine file you do not need to repeat the weight loading and inference optimization process. 

**TODO:** Load the serialized engine in the cell below with **load_serialized_engine()**.

In [None]:
# TODO: Use load_serialized_engine() to load the serialized engine into memory


**TODO:** Deserialize the TRT engine with [**runtime.deserialize_cuda_engine()**](https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/Runtime.html#tensorrt.Runtime.deserialize_cuda_engine).

In [None]:
# TODO: generate an ICudaEngine with runtime.deserialize_cuda_engine()


**TODO:** Calculate the accuracy of the TensorRT MNIST Classifier on `TEST_CASE` validation images. You can use **trt_prediction()** in **serializeMNIST.py** to generate random ground truth and predictions, or you can use **allocate_buffers()** and **do_inference()** to allocate input/output memory buffers and make predictions without choosing validation images randomly. Either approach is fine here. 

In [None]:
# TODO: calculate accuracy of TRT MNIST classifier on the validation set


## Discussion

### In what situation would converting a model to TensorRT be useful?

**TODO:** Your answer here.

### How does the PyTorch model validation accuracy compare to the TensorRT model validation accuracy?

**TODO:** Your answer here.

### Briefly explain how to following inference optimization techniques can increase the throughput of a model.

#### Reduced Floating Point Precision

**TODO:** Your answer here.

#### Layer Fusion

**TODO:** Your answer here.

#### Dynamic Tensor Memory

**TODO:** Your answer here.
