<a href="https://colab.research.google.com/github/vilsonrodrigues/MLOps/blob/main/notebooks/tensorrt_build_engines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NVIDIA TensorRT is an SDK for optimizing trained deep learning models to enable high-performance inference. TensorRT contains a deep learning inference optimizer for trained deep learning models, and a runtime for execution.

**Network Definition**
A representation of a model in TensorRT. A network definition is a graph of tensors and operators.

**Builder**:
TensorRT’s model optimizer. The builder takes as input a network definition, performs device-independent and device-specific optimizations, and creates an engine.

**Engine**:
A representation of a model that has been optimized by the TensorRT builder.

**Logger**: Associated with the builder and engine to capture errors, warnings, and other information during the build and inference phases.

**ONNX parser**: Takes a converted PyTorch trained model into the ONNX format as input and populates a network object in TensorRT.

**Plan**:
An optimized inference engine in a serialized format. To initialize the inference engine, the application will first deserialize the model from the plan file. A typical application will build an engine once, and then serialize it as a plan file for later use.

**Runtime**:
The component of TensorRT that performs inference on a TensorRT engine. The runtime API supports synchronous and asynchronous execution, profiling, and enumeration and querying of the bindings for an engine inputs and outputs.




In [20]:
!pip install -U torch

Collecting torch
  Downloading torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m779.1/779.1 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-nccl-cu12==2.20.5 (from torch)
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.2/176.2 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting triton==2.3.0 (from torch)
  Downloading triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.1/168.1 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: triton, nvidia-nccl-cu12, torch
  Attempting uninstall: triton
    Found existing installation: triton 2.2.0
    Uninstalling triton-2.2.0:
      Successfully uninstalled triton-2.2.0
  Attempting uninstall: nvidia-nccl-cu12
    Found existing ins

In [None]:
!pip install tensorrt onnx

## Export Model to ONNX

In [None]:
!pip install -U timm>=0.9.0 torchvision

In [11]:
import timm

model = timm.create_model("resnet50.a1_in1k", pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/102M [00:00<?, ?B/s]

In [13]:
import torch

channels = 3
width = 224
height = 224
input_model = [channels, height, width]
max_batch_size = 4

In [15]:
shape_input_model = [max_batch_size] + input_model

In [16]:
shape_input_model

[4, 3, 224, 224]

In [52]:
tensor_input = torch.randn(shape_input_model)

In [13]:
tensor_input.shape

torch.Size([4, 3, 224, 224])

In [66]:
# https://pytorch.org/docs/stable/onnx.html
# Pytorch has two way to export model to ONNX
# dynamo and script based
# dynamo preserves the dynamic nature of the model instead
# of using traditional static tracing techniques
# But dynamo export in Pytorch 2.3.0 is still Beta
# To apply TensorRT onnx parser the following exception is raised
# the input ~input_name~ is duplicate

if tensor_input.size(0) > 1:
    dynamic = {
        "inputs": {0: "batch", 2: "height", 3: "width"},
        "outputs": {0: "batch", 1: "logits"},
    }
else:
    dynamic = None

opset_version = 18
f = "model.onnx"

torch.onnx.export(
    model,
    tensor_input,
    f,
    verbose=True,
    input_names=["inputs"],
    output_names=["outputs"],
    opset_version=opset_version,
    do_constant_folding=True,  # torch>=1.12 require do_constant_folding=False
    dynamic_axes=dynamic,
)



In [67]:
import onnx
model_onnx = onnx.load(f)
onnx.checker.check_model(model_onnx)

In [69]:
print(model_onnx.graph.input)

[name: "inputs"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_param: "batch"
      }
      dim {
        dim_value: 3
      }
      dim {
        dim_param: "height"
      }
      dim {
        dim_param: "width"
      }
    }
  }
}
]


## Build Engine

In [1]:
import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)

In [2]:
builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()

In [None]:
# Set cache
cache = config.create_timing_cache(b"")
config.set_timing_cache(cache, ignore_mismatch=False)

In [None]:
# https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#build_engine_python
# Max Workspace define a memory limit to TensorRT layers
# From documentation:
# One important property is the maximum workspace size. Layer implementations often require a
# temporary workspace, and this parameter limits the maximum size that any layer in the network
# can use. If insufficient workspace is provided, it is possible that TensorRT will not be able
# to find an implementation for a layer. By default, the workspace is set to the total global
# memory size of the given device; restrict it when necessary, for example, when multiple engines
# are to be built on a single device.

# max_workspace = (1 << 30)
# config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, max_workspace)

In [3]:
# https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#version-compat
# https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#explicit-implicit-batch
# In implicit batch mode, every tensor has an implicit batch dimension and all other dimensions must have constant length.
# In explicit batch mode, all dimensions are explicit and can be dynamic, that is their length can change at execution time.
# Many new features, such as dynamic shapes and loops, are available only in this mode. It is also required by the ONNX parser.
# In TensorRT 10 implicit batch is deprecated, explict batch is default is not possible disable
flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(flag)
parser = trt.OnnxParser(network, TRT_LOGGER)

In [4]:
path_onnx_model = "./model.onnx"

In [5]:
with open(path_onnx_model, "rb") as f:
    if not parser.parse(f.read()):
        print(f"ERROR: Failed to parse the ONNX file {path_onnx_model}")
        for error in range(parser.num_errors):
            print(parser.get_error(error))

In [6]:
inputs = [network.get_input(i) for i in range(network.num_inputs)]
outputs = [network.get_output(i) for i in range(network.num_outputs)]

In [7]:
inputs

[<tensorrt_bindings.tensorrt.ITensor at 0x7cabaef44f70>]

In [9]:
outputs

[<tensorrt_bindings.tensorrt.ITensor at 0x7cabaef1b670>]

In [11]:
for input in inputs:
    print(f"Model {input.name} shape: {input.shape} {input.dtype}")
for output in outputs:
    print(f"Model {output.name} shape: {output.shape} {output.dtype}")
# https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes
# -1 indicates that dimension is runtime dimension
# in build phase is not necessary specify to TensorRT
# the real dimensions just in Runtime

Model inputs shape: (-1, 3, -1, -1) DataType.FLOAT
Model outputs shape: (-1, 1000) DataType.FLOAT


In [14]:
max_batch_size

4

In [17]:
shape_input_model

[4, 3, 224, 224]

In [18]:
shape_input_model[-3:]

[3, 224, 224]

In [19]:
if max_batch_size > 1:
    # https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles
    # To explict batch, set min, opt and max shape
    # This help to TensorRT to search better optimizations
    profile = builder.create_optimization_profile()
    min_shape = [1] + shape_input_model[-3:]
    opt_shape = [int(max_batch_size/2)] + shape_input_model[-3:]
    max_shape = shape_input_model
    for input in inputs:
        profile.set_shape(input.name, min_shape, opt_shape, max_shape)
    config.add_optimization_profile(profile)

In [23]:
config.get_calibration_profile()

In [21]:
# Check if fast Half is avaliable
builder.platform_has_fast_fp16

True

In [30]:
trt.BuilderFlag.BF16

<BuilderFlag.BF16: 17>

In [31]:
# https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reduced-precision
# https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix
# Reduce precision. Has three options: FP16, INT8 and TF32 (Tensor Cores)
# Note that TensorRT will still choose a higher-precision kernel if it
# results in overall lower runtime, or if no low-precision implementation exists.
half = True
int8 = False
if half:
    config.set_flag(trt.BuilderFlag.FP16)
elif int8:
    config.set_flag(trt.BuilderFlag.INT8)

In [36]:
# https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#weightless-build
# https://github.com/NVIDIA/TensorRT/tree/main/samples/python/sample_weight_stripping
# Help to create and optimize an engine without unnecessary weights
# On inference load engine and refit with onnx weights
# It`s more fast and no duplicate weights
strip_weights = False
if strip_weights:
    config.set_flag(trt.BuilderFlag.STRIP_PLAN)
# To remove strip plan from config
# config.flags &= ~(1 << int(trt.BuilderFlag.STRIP_PLAN))

In [34]:
# Build engine
engine_bytes = builder.build_serialized_network(network, config)

In [35]:
engine_path = "./model.engine"
with open(engine_path, "wb") as f:
    f.write(engine_bytes)

## Execute Engine

In [48]:
# Colab bug...
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [49]:
!pip install cuda-python>=12.2.0

Collecting cuda-python
  Downloading cuda_python-12.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.5/24.5 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: cuda-python
Successfully installed cuda-python-12.4.0


In [37]:
def load_stripped_engine_and_refit(
    engine_path: str,
    onnx_model_path: str,
) -> trt.ICudaEngine:
    runtime = trt.Runtime(TRT_LOGGER)
    with open(engine_path, "rb") as engine_file:
        engine = runtime.deserialize_cuda_engine(engine_file.read())
        refitter = trt.Refitter(engine, TRT_LOGGER)
        parser_refitter = trt.OnnxParserRefitter(refitter, TRT_LOGGER)
        assert parser_refitter.refit_from_file(onnx_model_path)
        assert refitter.refit_cuda_engine()
        return engine

def load_normal_engine(engine_path: str) -> trt.ICudaEngine:
    runtime = trt.Runtime(TRT_LOGGER)
    with open(engine_path, "rb") as plan:
        engine = runtime.deserialize_cuda_engine(plan.read())
        return engine

In [38]:
if strip_weights:
    engine = load_stripped_engine_and_refit(engine_path, path_onnx_model)
else:
    engine = load_normal_engine(engine_path)

In [39]:
engine

<tensorrt_bindings.tensorrt.ICudaEngine at 0x7caa3f5d32f0>