Skip to content

[Performance] Slow inference on CUDA V100 int8 #24807

@mgiessing

Description

@mgiessing

Describe the issue

I'm seeing poor performance using int8 vs fp32 on CUDAExecutionProvider (see below)

My timings are:

~ 0.3s for fp32
~ 4s for int8

Disclaimer:
I build onnxruntime-gpu for ppc64le (Power9 with NvidiaV100) using the following parameters:

# CUDA12.4
#70;75 supports P9 V100 & T4 GPUs

./build.sh --config Release --use_cuda --cuda_home=/usr/local/cuda --cudnn_home=/usr/local/cuda --skip_submodule_sync --parallel --build_shared_lib --build_wheel --update --build --allow_running_as_root --cmake_extra_defines=CMAKE_CUDA_ARCHITECTURES=70;75

Is there a flag to enable int8? I feel it might be missing during compilation.

To reproduce

Minimal onnxruntime CUDA example (fp32 vs int8)

Get fp32 and int8 model for comparison

mkdir -p /tmp/models

#fp32
wget https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/fcn/model/fcn-resnet50-12.onnx -O /tmp/models/fcn-resnet50-12.onnx

#int8
wget https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/fcn/model/fcn-resnet50-12-int8.onnx -O /tmp/models/fcn-resnet50-12-int8.onnx

#demo.jpg
wget https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/faster-rcnn/dependencies/demo.jpg -O /tmp/models/demo.jpg

Run inference

import onnxruntime as ort
import onnx
import time
from PIL import Image
import numpy as np

def preprocess_image(path):
    image = Image.open(path)
    img = image.resize((1200, 1200), Image.Resampling.BILINEAR)
    img_data = np.array(img)
    img_data = np.transpose(img_data, [2, 0, 1])
    img_data = np.expand_dims(img_data, 0)
    mean_vec = np.array([0.485, 0.456, 0.406])
    stddev_vec = np.array([0.229, 0.224, 0.225])
    norm_img_data = np.zeros(img_data.shape).astype('float32')
    for i in range(img_data.shape[1]):
        norm_img_data[:,i,:,:] = (img_data[:,i,:,:]/255 - mean_vec[i]) / stddev_vec[i]
    return norm_img_data


img_data = preprocess_image("/tmp/models/demo.jpg")
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']

#fp32
model = onnx.load("/tmp/models/fcn-resnet50-12.onnx")
session = ort.InferenceSession(model.SerializeToString(), providers=providers)
ort_inputs = {session.get_inputs()[0].name: img_data}

st = time.time()
preds = session.run(None, ort_inputs)
ed = time.time()

print(f"Time taken for fp32: {ed-st:.6f} seconds")


#int8
model = onnx.load("/tmp/models/fcn-resnet50-12-int8.onnx")
session = ort.InferenceSession(model.SerializeToString(), providers=providers)
ort_inputs = {session.get_inputs()[0].name: img_data}

st = time.time()
preds = session.run(None, ort_inputs)
ed = time.time()
print(f"Time taken for int8: {ed-st:.6f} seconds")

Urgency

No response

Platform

Linux

OS Version

AlmaLinux 8.10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.22

ONNX Runtime API

Python

Architecture

IBM Power

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.4

Model File

No response

Is this a quantized model?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:CUDAissues related to the CUDA execution providerperformanceissues related to performance regressions

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions