[Performance] Slow inference on CUDA V100 int8

### Describe the issue

I'm seeing poor performance using int8 vs fp32 on CUDAExecutionProvider (see below)

My timings are:

~ 0.3s for fp32
~ 4s for int8

> Disclaimer:
I build onnxruntime-gpu for ppc64le (Power9 with NvidiaV100) using the following parameters:

```bash
# CUDA12.4
#70;75 supports P9 V100 & T4 GPUs

./build.sh --config Release --use_cuda --cuda_home=/usr/local/cuda --cudnn_home=/usr/local/cuda --skip_submodule_sync --parallel --build_shared_lib --build_wheel --update --build --allow_running_as_root --cmake_extra_defines=CMAKE_CUDA_ARCHITECTURES=70;75
```

Is there a flag to enable int8? I feel it might be missing during compilation.

### To reproduce


# Minimal onnxruntime CUDA example (fp32 vs int8)

## Get fp32 and int8 model for comparison

```bash
mkdir -p /tmp/models

#fp32
wget https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/fcn/model/fcn-resnet50-12.onnx -O /tmp/models/fcn-resnet50-12.onnx

#int8
wget https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/fcn/model/fcn-resnet50-12-int8.onnx -O /tmp/models/fcn-resnet50-12-int8.onnx

#demo.jpg
wget https://github.com/onnx/models/raw/main/validated/vision/object_detection_segmentation/faster-rcnn/dependencies/demo.jpg -O /tmp/models/demo.jpg
```

## Run inference


```python
import onnxruntime as ort
import onnx
import time
from PIL import Image
import numpy as np

def preprocess_image(path):
    image = Image.open(path)
    img = image.resize((1200, 1200), Image.Resampling.BILINEAR)
    img_data = np.array(img)
    img_data = np.transpose(img_data, [2, 0, 1])
    img_data = np.expand_dims(img_data, 0)
    mean_vec = np.array([0.485, 0.456, 0.406])
    stddev_vec = np.array([0.229, 0.224, 0.225])
    norm_img_data = np.zeros(img_data.shape).astype('float32')
    for i in range(img_data.shape[1]):
        norm_img_data[:,i,:,:] = (img_data[:,i,:,:]/255 - mean_vec[i]) / stddev_vec[i]
    return norm_img_data


img_data = preprocess_image("/tmp/models/demo.jpg")
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']

#fp32
model = onnx.load("/tmp/models/fcn-resnet50-12.onnx")
session = ort.InferenceSession(model.SerializeToString(), providers=providers)
ort_inputs = {session.get_inputs()[0].name: img_data}

st = time.time()
preds = session.run(None, ort_inputs)
ed = time.time()

print(f"Time taken for fp32: {ed-st:.6f} seconds")


#int8
model = onnx.load("/tmp/models/fcn-resnet50-12-int8.onnx")
session = ort.InferenceSession(model.SerializeToString(), providers=providers)
ort_inputs = {session.get_inputs()[0].name: img_data}

st = time.time()
preds = session.run(None, ort_inputs)
ed = time.time()
print(f"Time taken for int8: {ed-st:.6f} seconds")
```

### Urgency

_No response_

### Platform

Linux

### OS Version

AlmaLinux 8.10

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

1.22

### ONNX Runtime API

Python

### Architecture

IBM Power

### Execution Provider

CUDA

### Execution Provider Library Version

CUDA 12.4

### Model File

_No response_

### Is this a quantized model?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Slow inference on CUDA V100 int8 #24807

Describe the issue

To reproduce

Minimal onnxruntime CUDA example (fp32 vs int8)

Get fp32 and int8 model for comparison

Run inference

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Slow inference on CUDA V100 int8 #24807

Description

Describe the issue

To reproduce

Minimal onnxruntime CUDA example (fp32 vs int8)

Get fp32 and int8 model for comparison

Run inference

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions