# So sánh việc dùng trtexec và TensorRT Python API

| Tiêu chí             | `trtexec`               | TensorRT Python API                  |
| -------------------- | ----------------------- | ------------------------------------ |
| Dễ dùng              | ✅ rất dễ                | ❌ cần học API                        |
| Dùng thử nhanh       | ✅                       | ❌                                    |
| Benchmark            | ✅ rất tốt               | ✅ (phức tạp hơn)                     |
| Production inference | ❌ không phù hợp         | ✅ chuẩn production                   |
| Control buffer       | ❌ không có              | ✅ full control                       |
| Dynamic batching     | ❌                       | ✅                                    |
| Debug timing         | ✅ rất mạnh              | ✅ nhưng phải tự log                  |
| Cài đặt              | Có sẵn khi cài TensorRT | Cần thêm pycuda, TensorRT Python SDK |


# Chuyển đổi một model object detection YOLOv5 từ ONNX sang TensorRT, build engine, optimize batch size, và benchmark tốc độ inference.

- Trong bài toán này, sẽ đụng đến:
    - Dynamic shapes (batch size động)
    - Workspace size (GPU memory allocation cho engine builder)
    - FP16 optimization
    - TensorRT calibration cho INT8 (mình sẽ giới thiệu, nhưng chưa bắt buộc code full calibration)

# Import các thư viện ở package khác

In [31]:
import sys
import os

# Thêm thư mục gốc project (/app) vào sys.path
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))
sys.path.insert(0, project_root)

# Sau đó import
from util.util import *
from util.config import *
import trt_infer

# Configuration

In [25]:
onnx_path = "yolov5s.onnx"
engine_path = "yolov5s.engine"
img_folder_path = './data/images'
output_folder_path = 'output_results_trt'

## 1️⃣ Chuẩn bị model

In [15]:
!pip install numpy==1.24.4
!pip install onnx==1.14.1 --upgrade
!apt update
!apt install -y libgl1

import cv2
print("OpenCV version:", cv2.__version__)

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
91 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libgl1 is already the newest version (1.4.0-1).
0 upgraded, 0 newly installed, 0 to remove and 91 not upgraded.
OpenCV version: 4.11.0


In [None]:
!git clone https://github.com/ultralytics/yolov5
# %cd yolov5

!pip install -r requirements.txt

# Export model sang ONNX
!python export.py --weights yolov5s.pt --include onnx

Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/tmp/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
  import pkg_resources as pkg
[34m[1mexport: [0mdata=data/coco128.yaml, weights=['yolov5s.pt'], imgsz=[640, 640], batch_size=1, device=cpu, half=False, inplace=False, keras=False, optimize=False, int8=False, per_tensor=False, dynamic=False, cache=, simplify=False, mlmodel=False, opset=17, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['onnx']
fatal: detected dubious ownership in repository at '/app/tensorRt/yolo/yolov5'
To add an exception for this directory, call:

	git config --global --add safe.directory /app/tensorRt/yolo/yolov5
YOLOv5 🚀 2025-6-15 Python-3.10.12 torch-1.11.0+cpu CPU

Downloading 

**Nếu gặp lỗi**
```
ONNX: starting export with onnx 1.14.1...
ONNX: export failure ❌ 0.0s: Unsupported ONNX opset version: 17
```
- Giải thích:
    - YOLOv5 mặc định hiện tại export sang ONNX với opset=17
    - Nhưng onnx version trong môi trường đang dùng chỉ hỗ trợ tối đa opset 16  
- Giải quyết có 2 cách:
    - Cách 1:
        -  Xuất ONNX với opset thấp hơn (an toàn nhất) 
    - Cách 2:
        - Cập nhật phiên bản ONNX lên 17

- Vì sao hạ opset_version thường an toàn hơn?
    - Tính tương thích (compatibility)
        - ONNX opset định nghĩa các phép toán (ops) tại từng phiên bản.
        - Các exporter (như torch.onnx.export() hay tf2onnx) chuyển các hàm thành toán tử ONNX tương ứng.
        - Opset mới có thể chưa được onnxruntime hay các engine inference khác hỗ trợ đầy đủ ⇒ dễ gặp lỗi khi chạy (unsupported ops, runtime errors…).
        - Hạ opset_version giúp dùng tập toán tử ổn định, đã được hỗ trợ rộng rãi trên nhiều nền tảng.
        - Giảm thiểu khả năng gặp lỗi "Unsupported operator" khi inference ở production, cloud, mobile, hay các thiết bị nhúng.
    - ...

In [16]:
# Hạ xuống opset version 16
!python export.py --weights yolov5s.pt --include onnx --opset 15


  import pkg_resources as pkg
[34m[1mexport: [0mdata=data/coco128.yaml, weights=['yolov5s.pt'], imgsz=[640, 640], batch_size=1, device=cpu, half=False, inplace=False, keras=False, optimize=False, int8=False, per_tensor=False, dynamic=False, cache=, simplify=False, mlmodel=False, opset=15, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['onnx']
fatal: detected dubious ownership in repository at '/app/tensorRt/yolo/yolov5'
To add an exception for this directory, call:

	git config --global --add safe.directory /app/tensorRt/yolo/yolov5
YOLOv5 🚀 2025-6-15 Python-3.10.12 torch-1.11.0+cpu CPU

Downloading https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s.pt to yolov5s.pt...
100%|███████████████████████████████████████| 14.1M/14.1M [00:27<00:00, 546kB/s]

Fusing layers... 
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients

[34m[1mPyTorch:[0m starting from yolov5s.pt with

# 2️⃣ Build TensorRT Engine Với Dynamic Shapes

## Dùng trtexec

### 1️⃣ Kiểm tra file ONNX hợp lệ

In [None]:
import onnx
import os

# Kiểm tra file có tồn tại không
if not os.path.isfile(onnx_path):
    raise FileNotFoundError(f"Không tìm thấy file {onnx_path}")

# Load và kiểm tra model
onnx_model = onnx.load(onnx_path)
onnx.checker.check_model(onnx_model)
print("✅ ONNX model hợp lệ!")

✅ ONNX model hợp lệ!


### 2️⃣ Kiểm tra trtexec có sẵn

In [None]:
!which trtexec || echo "⚠️ trtexec chưa được cài đặt hoặc không nằm trong PATH."

/opt/tensorrt/bin/trtexec


### 3️⃣ Convert ONNX sang TensorRT Engine

#### Kiểm tra input name trong file onnx để truyền vào minShapes, optShapes, maxShapes

In [None]:
model = onnx.load(onnx_path)
for input in model.graph.input:
    print("INPUT NAME: " + input.name) 

INPUT NAME: images


#### kiểm tra INPUT của ONNX model là static hay dynamic shape

In [None]:
check_onnx_input_shapes(onnx_path)

Model inputs của yolov5s.onnx:
- images: [1, 3, 640, 640] --> STATIC


{'images': {'shape': [1, 3, 640, 640], 'dynamic': False}}

In [60]:

# --onnx: Chỉ định file onnx.
# --saveEngine: Đầu ra file engine.
# --explicitBatch: Bắt buộc khi dùng TensorRT 7+ với dynamic shape.
# minShapes, optShapes, maxShapes: Giới hạn phạm vi batch size engine có thể tối ưu (từ 1 đến 8).
# --workspace: Cấp phát RAM GPU tối đa 4GB cho quá trình build engine.
# --fp16: Bật mixed precision (half precision)(FP16), tăng tốc độ nhưng vẫn giữ độ chính xác khá cao.
# Có thể thêm --int8 nếu đã calibrate mô hình.

'''
Dynamic input shape
'''
# cmd = """
# trtexec \
#   --onnx=yolov5s.onnx \
#   --saveEngine=yolov5s.engine \
#   --explicitBatch \
#   --minShapes=images:1x3x640x640 \
#   --optShapes=images:4x3x640x640 \
#   --maxShapes=images:8x3x640x640 \
#   --workspace=4096 \
#   --fp16
# """
# !$cmd

cmd = """
trtexec \
  --onnx=yolov5s.onnx \
  --saveEngine=yolov5s.engine \
  --explicitBatch \
  --workspace=4096 \
  --fp16
"""
!$cmd


&&&& RUNNING TensorRT.trtexec [TensorRT v8603] # trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s.engine --explicitBatch --workspace=4096 --fp16
[06/15/2025-09:50:31] [W] --explicitBatch flag has been deprecated and has no effect!
[06/15/2025-09:50:31] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built.
[06/15/2025-09:50:31] [W] --workspace flag has been deprecated by --memPoolSize flag.
[06/15/2025-09:50:31] [I] === Model Options ===
[06/15/2025-09:50:31] [I] Format: ONNX
[06/15/2025-09:50:31] [I] Model: yolov5s.onnx
[06/15/2025-09:50:31] [I] Output:
[06/15/2025-09:50:31] [I] === Build Options ===
[06/15/2025-09:50:31] [I] Max batch: explicit batch
[06/15/2025-09:50:31] [I] Memory Pools: workspace: 4096 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/15/2025-09:50:31] [I] minTiming: 1
[06/15/2025-09:50:31] [I] avgTiming: 8
[06/15/2025-09:50:31] [I] Precision: FP32+FP16
[06/15/2025-0

In [None]:
### 4️⃣ Kiểm tra file engine sau khi build
if os.path.isfile(engine_path):
    print("✅ TensorRT engine đã được tạo thành công!")
else:
    raise FileNotFoundError("❌ TensorRT engine chưa được tạo.")

✅ TensorRT engine đã được tạo thành công!


In [62]:
### 5️⃣ Benchmark TensorRT Engine
# --loadEngine: Chạy inference benchmark trên engine đã build
# --iterations=100: Chạy 100 lần để đánh giá tốc độ

!trtexec --loadEngine=yolov5s.engine --iterations=100


&&&& RUNNING TensorRT.trtexec [TensorRT v8603] # trtexec --loadEngine=yolov5s.engine --iterations=100
[06/15/2025-10:08:37] [I] === Model Options ===
[06/15/2025-10:08:37] [I] Format: *
[06/15/2025-10:08:37] [I] Model: 
[06/15/2025-10:08:37] [I] Output:
[06/15/2025-10:08:37] [I] === Build Options ===
[06/15/2025-10:08:37] [I] Max batch: 1
[06/15/2025-10:08:37] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/15/2025-10:08:37] [I] minTiming: 1
[06/15/2025-10:08:37] [I] avgTiming: 8
[06/15/2025-10:08:37] [I] Precision: FP32
[06/15/2025-10:08:37] [I] LayerPrecisions: 
[06/15/2025-10:08:37] [I] Layer Device Types: 
[06/15/2025-10:08:37] [I] Calibration: 
[06/15/2025-10:08:37] [I] Refit: Disabled
[06/15/2025-10:08:37] [I] Version Compatible: Disabled
[06/15/2025-10:08:37] [I] ONNX Native InstanceNorm: Disabled
[06/15/2025-10:08:37] [I] TensorRT runtime: full
[06/15/2025-10:08:37] [I] Lean DLL Path: 
[06/15/2025-10:08:37] [I] Tempfile 

# Cách 2: Dùng TensorRT Python API

In [None]:
import tensorrt as trt
import os

FP16_MODE = True
WORKSPACE_SIZE = 1 << 30  # 1GB

# Dynamic shape configs
MIN_BATCH = 1
OPT_BATCH = 4
MAX_BATCH = 16
HEIGHT = 640
WIDTH = 640

In [73]:
# ==== STEP 1: Setup TensorRT logger & builder ==== #
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(TRT_LOGGER)
network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(network_flags)
parser = trt.OnnxParser(network, TRT_LOGGER)

[06/15/2025-10:15:38] [TRT] [I] The logger passed into createInferBuilder differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[06/15/2025-10:15:38] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1653, GPU 1265 (MiB)
[06/15/2025-10:15:38] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading


In [None]:
# ==== STEP 2: Parse ONNX model ==== #
if not os.path.isfile(onnx_path):
    raise FileNotFoundError(f"Không tìm thấy file ONNX: {onnx_path}")

with open(onnx_path, 'rb') as f:
    if not parser.parse(f.read()):
        for idx in range(parser.num_errors):
            print(parser.get_error(idx))
        raise RuntimeError("❌ Không parse được ONNX!")

print("✅ Parse ONNX thành công!")

[06/15/2025-10:15:39] [TRT] [W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
✅ Parse ONNX thành công!


In [75]:
# ==== STEP 3: Xác định input/output model ==== #
input_tensor = network.get_input(0)
print(f"Input tensor: {input_tensor.name}, shape: {input_tensor.shape}")
output_tensor = network.get_output(0)
print(f"Output tensor: {output_tensor.name}, shape: {output_tensor.shape}")

Input tensor: images, shape: (1, 3, 640, 640)
Output tensor: output0, shape: (1, 25200, 85)


In [76]:
# ==== STEP 4: Create builder config và set FP16 ==== #
config = builder.create_builder_config()
config.max_workspace_size = WORKSPACE_SIZE
if FP16_MODE:
    config.set_flag(trt.BuilderFlag.FP16)

  config.max_workspace_size = WORKSPACE_SIZE


In [77]:
# ==== STEP 5: Tạo Optimization Profile ==== #
profile = builder.create_optimization_profile()
input_name = input_tensor.name

## dynamic shape input
# profile.set_shape(input_name,
#                   (MIN_BATCH, 3, HEIGHT, WIDTH),
#                   (OPT_BATCH, 3, HEIGHT, WIDTH),
#                   (MAX_BATCH, 3, HEIGHT, WIDTH))

## static shape input
profile.set_shape(input_name,
                  (1, 3, HEIGHT, WIDTH),
                  (1, 3, HEIGHT, WIDTH),
                  (1, 3, HEIGHT, WIDTH))
config.add_optimization_profile(profile)

0

In [78]:
# ==== STEP 6: Build engine ==== #
print("⚙️ Bắt đầu build TensorRT engine...")
engine = builder.build_engine(network, config)
if engine is None:
    raise RuntimeError("❌ Build engine thất bại!")

⚙️ Bắt đầu build TensorRT engine...
[06/15/2025-10:15:54] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[06/15/2025-10:15:54] [TRT] [I] Graph optimization time: 0.0303711 seconds.
[06/15/2025-10:15:54] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[06/15/2025-10:15:54] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.


  engine = builder.build_engine(network, config)


[06/15/2025-10:21:07] [TRT] [I] Detected 1 inputs and 4 output network tensors.
[06/15/2025-10:21:07] [TRT] [I] Total Host Persistent Memory: 290704
[06/15/2025-10:21:07] [TRT] [I] Total Device Persistent Memory: 540672
[06/15/2025-10:21:07] [TRT] [I] Total Scratch Memory: 512
[06/15/2025-10:21:07] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 17 MiB, GPU 213 MiB
[06/15/2025-10:21:07] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 108 steps to complete.
[06/15/2025-10:21:07] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 4.05187ms to assign 11 blocks to 108 nodes requiring 17066496 bytes.
[06/15/2025-10:21:07] [TRT] [I] Total Activation Memory: 17066496
[06/15/2025-10:21:07] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[06/15/2025-10:21:07] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjus

In [None]:
# ==== STEP 7: Save engine to file ==== #
with open(engine_path, 'wb') as f:
    f.write(engine.serialize())
print(f"✅ Engine đã lưu: {engine_path}")


✅ Engine đã lưu: yolov5s_api.engine


# 4️⃣ Code Thực Thi Inference TensorRT

In [None]:
# Tạo 1 lệnh string chuẩn bằng f-string
cmd = f"python trt_infer.py --engine {engine_path} --input {img_folder_path} --output {output_folder_path}"

# Sau đó truyền vào shell bằng !
!{cmd} 


[06/15/2025-15:23:09] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
Found 2 images in ./data/images
Processed: ./data/images/bus.jpg -> output_results_trt/bus_output.jpg
Processed: ./data/images/zidane.jpg -> output_results_trt/zidane_output.jpg


# Code Thực Thi Inference ONNX

In [39]:
onnx_path = "yolov5s.onnx"
img_folder_path = './data/images'
output_folder_path = 'output_results_onnx'

In [43]:
# Tạo 1 lệnh string chuẩn bằng f-string
cmd = f"python onnx_infer.py --onnx {onnx_path} --input {img_folder_path} --output {output_folder_path}"

# Sau đó truyền vào shell bằng !
!{cmd} 

Found 2 images
Saved ONNX output to output_results_onnx/bus_onnx.jpg
Saved ONNX output to output_results_onnx/zidane_onnx.jpg


# So sánh kết quả của onnx và engine

In [4]:
!python tensorrt_benchmark.py

=== Bắt đầu benchmark ===
ONNX avg time: 62.121 ms
[06/15/2025-11:27:19] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
  input_shape = engine.get_binding_shape(0)
  output_shape = engine.get_binding_shape(1)
TensorRT avg time: 4.687 ms
TensorRT nhanh hơn 13.25 lần so với ONNX
=== So sánh output ===
Max difference: 4.368988037109375
Mean difference: 0.0042354934848845005
