Open
Description
Describe the issue
I'm seeing poor performance using int8 vs fp32 on CUDAExecutionProvider (see below)
My timings are:
~ 0.3s for fp32
~ 4s for int8
Disclaimer:
I build onnxruntime-gpu for ppc64le (Power9 with NvidiaV100) using the following parameters:
# CUDA12.4
#70;75 supports P9 V100 & T4 GPUs
./build.sh --config Release --use_cuda --cuda_home=/usr/local/cuda --cudnn_home=/usr/local/cuda --skip_submodule_sync --parallel --build_shared_lib --build_wheel --update --build --allow_running_as_root --cmake_extra_defines=CMAKE_CUDA_ARCHITECTURES=70;75
Is there a flag to enable int8? I feel it might be missing during compilation.
To reproduce
Minimal onnxruntime CUDA example (fp32 vs int8)
Get fp32 and int8 model for comparison
mkdir -p /tmp/models
#fp32
wget https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/fcn/model/fcn-resnet50-12.onnx -O /tmp/models/fcn-resnet50-12.onnx
#int8
wget https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/fcn/model/fcn-resnet50-12-int8.onnx -O /tmp/models/fcn-resnet50-12-int8.onnx
#demo.jpg
wget https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/faster-rcnn/dependencies/demo.jpg -O /tmp/models/demo.jpg
Run inference
import onnxruntime as ort
import onnx
import time
from PIL import Image
import numpy as np
def preprocess_image(path):
image = Image.open(path)
img = image.resize((1200, 1200), Image.Resampling.BILINEAR)
img_data = np.array(img)
img_data = np.transpose(img_data, [2, 0, 1])
img_data = np.expand_dims(img_data, 0)
mean_vec = np.array([0.485, 0.456, 0.406])
stddev_vec = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(img_data.shape).astype('float32')
for i in range(img_data.shape[1]):
norm_img_data[:,i,:,:] = (img_data[:,i,:,:]/255 - mean_vec[i]) / stddev_vec[i]
return norm_img_data
img_data = preprocess_image("/tmp/models/demo.jpg")
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
#fp32
model = onnx.load("/tmp/models/fcn-resnet50-12.onnx")
session = ort.InferenceSession(model.SerializeToString(), providers=providers)
ort_inputs = {session.get_inputs()[0].name: img_data}
st = time.time()
preds = session.run(None, ort_inputs)
ed = time.time()
print(f"Time taken for fp32: {ed-st:.6f} seconds")
#int8
model = onnx.load("/tmp/models/fcn-resnet50-12-int8.onnx")
session = ort.InferenceSession(model.SerializeToString(), providers=providers)
ort_inputs = {session.get_inputs()[0].name: img_data}
st = time.time()
preds = session.run(None, ort_inputs)
ed = time.time()
print(f"Time taken for int8: {ed-st:.6f} seconds")
Urgency
No response
Platform
Linux
OS Version
AlmaLinux 8.10
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.22
ONNX Runtime API
Python
Architecture
IBM Power
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.4
Model File
No response
Is this a quantized model?
Yes