Skip to content

Deep Learning Model Optimization Using by TensorRT API, window

License

Notifications You must be signed in to change notification settings

yester31/TensorRT_API

Repository files navigation

TensorRT_EX

Enviroments


  • Windows 10 laptop
  • CPU i7-11375H
  • GPU RTX-3060
  • Visual studio 2017
  • CUDA 11.1
  • TensorRT 8.0.3.4 (unet)
  • TensorRT 8.2.0.6 (detr, yolov5s, real-esrgan)
  • Opencv 3.4.5
  • make Engine directory for engine file
  • make Int8_calib_table directory for ptq calibration table

Custom plugin example

  • Layer for input preprocess(NHWC->NCHW, BGR->RGB, [0, 255]->[0, 1] (Normalize))
  • plugin_ex1.cpp (plugin sample code)
  • preprocess.hpp (plugin define)
  • preprocess.cu (preprocessing cuda kernel function)
  • Validation_py/Validation_preproc.py (Result validation with pytorch)

Classification model

vgg11 model

  • vgg11.cpp
  • with preprocess plugin

resnet18 model

  • resnet18.cpp
  • 100 images from COCO val2017 dataset for PTQ calibration
  • Match all results with PyTorch
  • Comparison of calculation execution time of 100 iteration and GPU memory usage for one 224x224x3 image
PytorchTensorRTTensorRTTensorRT
PrecisionFP32FP32FP16Int8(PTQ)
Avg Duration time [ms] 4.1 ms 1.7 ms 0.7 ms 0.6 ms
FPS [frame/sec] 243 fps 590 fps 1385 fps 1577 fps
Memory [GB] 1.551 GB 1.288 GB 0.941 GB 0.917 GB

Semantic Segmentaion model

  • UNet model (unet.cpp)
  • use TensorRT 8.0.3.4 version for unet model(For version 8.2.0.6, an error about the unet model occurs)
  • unet_carvana_scale0.5_epoch1.pth
  • additional preprocess (resize & letterbox padding) with openCV
  • postprocess (model output to image)
  • Match all results with PyTorch
  • Comparison of calculation execution time of 100 iteration and GPU memory usage for one 512x512x3 image
PytorchPytorchTensorRTTensorRTTensorRT
PrecisionFP32FP16FP32FP16Int8(PTQ)
Avg Duration time [ms] 66.21 ms 34.58 ms 40.81 ms 13.52 ms 8.19 ms
FPS [frame/sec] 15 fps 29 fps 25 fps 77 fps 125 fps
Memory [GB] 3.863 GB 2.677 GB 1.552 GB 1.367 GB 1.051 GB

Object Detection model(ViT)

  • DETR model (detr_trt.cpp)
  • additional preprocess (mean std normalization function)
  • postprocess (show out detection result to the image)
  • Match all results with PyTorch
  • Comparison of calculation execution time of 100 iteration and GPU memory usage for one 500x500x3 image
PytorchPytorchTensorRTTensorRTTensorRT
PrecisionFP32FP16FP32FP16Int8(PTQ)
Avg Duration time [ms] 37.03 ms 30.71 ms 16.40 ms 6.07 ms 5.30 ms
FPS [frame/sec] 27 fps 33 fps 61 fps 165 fps 189 fps
Memory [GB] 1.563 GB 1.511 GB 1.212 GB 1.091 GB 1.005 GB

Object Detection model

  • Yolov5s model (yolov5s.cpp)
  • Comparison of calculation execution time of 100 iteration and GPU memory usage for one 640x640x3 image resized & padded
PytorchTensorRTTensorRT
PrecisionFP32FP32Int8(PTQ)
Avg Duration time [ms] 7.72 ms 6.16 ms 2.86 ms
FPS [frame/sec] 129 fps 162 fps 350 fps
Memory [GB] 1.670 GB 1.359 GB 0.920 GB

Super-Resolution model

  • Real-ESRGAN model (real-esrgan.cpp)
  • RealESRGAN_x4plus.pth
  • Scale up 4x (448x640x3 -> 1792x2560x3)
  • Comparison of calculation execution time of 100 iteration and GPU memory usage
  • [update] RealESRGAN_x2plus model (set OUT_SCALE=2)
PytorchPytorchTensorRTTensorRT
PrecisionFP32FP16FP32FP16
Avg Duration time [ms] 4109 ms 1936 ms 2139 ms 737 ms
FPS [frame/sec] 0.24 fps 0.52 fps 0.47 fps 1.35 fps
Memory [GB] 5.029 GB 4.407 GB 3.807 GB 3.311 GB

Object Detection model 2

  • Yolov6s model (yolov6.cpp)
  • Comparison of calculation execution time of 1000 iteration and GPU memory usage (with preprocess, without nms, 536 x 640 x 3)
PytorchTensorRTTensorRTTensorRT
PrecisionFP32FP32FP16Int8(PTQ)
Avg Duration time [ms] 20.7 ms 10.3 ms 3.54 ms 2.58 ms
FPS [frame/sec] 48.14 fps 96.21 fps 282.26 fps 387.89 fps
Memory [GB] 1.582 GB 1.323 GB 0.956 GB 0.913 GB

Object Detection model 3 (in progress)

  • Yolov7 model (yolov7.cpp)

Using C TensoRT model in Python using dll



A typical TensorRT model creation sequence using TensorRT API

  1. Prepare the trained model in the training framework (generate the weight file to be used in TensorRT).
  2. Implement the model using the TensorRT API to match the trained model structure.
  3. Extract weights from the trained model.
  4. Make sure to pass the weights appropriately to each layer of the prepared TensorRT model.
  5. Build and run.
  6. After the TensorRT model is built, the model stream is serialized and generated as an engine file.
  7. Inference by loading only the engine file in the subsequent task(if model parameters or layers are modified, re-execute the previous (4) task).

reference

Releases

No releases published

Packages

No packages published

Languages