Benchmark of TVM quantized model on CUDA
This repository contains benchmark code of int8 inference speed of TVM for the blog post Automating Optimization of Quantized Deep Learning Models on CUDA. The benchmark of MXNet and TensorRT is provided as baseline.
How to Run
The benchmark is conducted using tvm@e22b58. (This is an outdated version. Please checkout this branch to run with recent tvm version.)
LLVM and CUDA need to be enabled.
Compute Capability 6.1 CUDA device is required to support the
We only provide auto-tuning logs on NVIDIA GTX 1080. To run on other devices, you can follow the AutoTVM tutorial to run auto-tuning.
python3 run_tvm.py --log_file logs/history_best_1080.log
MXNet 1.4, cuDNN 7.3+ are required.
TensorRT 5 is required. We use onnx models as input. The onnx models will be generated from MXNet when running the benchmark script.
cd TensorRT; make; cd -; python3 run_tensorrt.py