Skip to content

inference with the structured sparsity and quantization

License

Notifications You must be signed in to change notification settings

yester31/TensorRT_Sparse

Repository files navigation

TensorRT_ASP

0. Introduction

  • Goal : model compression using by Structured Sparsity
  • Base Model : Resnet18
  • Dataset : Imagenet100
  • Pruning Process :
    1. Train base model with Imagenet100 dataset
    2. Prune the model in a 2:4 sparse pattern for the FC and convolution layers.
    3. Retrain the pruned model
    4. Convert to TensorRT PTQ int8 Model

1. Development Environment

  • Device
    • MSI laptop
    • CPU i7-11375H
    • GPU RTX-3060
  • Dependency
    • WSL(Ubuntu 22.04)
    • cuda 12.1
    • cudnn 8.9.2
    • tensorrt 8.6.1
    • pytorch 2.1.0+cu121

2. Code Scheme

    Quantization_EX/
    ├── calibrator.py       # calibration class for TensorRT PTQ
    ├── common.py           # utils for TensorRT
    ├── onnx_export.py      # onnx export ASP model
    ├── train.py            # base model train with ASP 
    ├── trt_infer_2.py      # TensorRT model build using Polygraphy
    ├── trt_infer_acc.py    # TensorRT model accuracy check
    ├── trt_infer.py        # TensorRT model infer
    ├── utils.py            # utils
    ├── LICENSE
    └── README.md

3. Performance Evaluation

  • Calculation 10000 iteration with one input data [1, 3, 224, 224]
TensorRT PTQ TensorRT PTQ with ASP
Precision Int8 Int8
Avg Latency [ms] 0.418 ms 0.388 ms
Avg FPS [frame/sec] 2388.33 fps 2572.17 fps
Gpu Memory [MB] 123 MB 119 MB

4. Guide

  • train -> onnx_export -> trt_infer -> trt_infer_acc

5. Reference