# Accelerated inference on NVIDIA GPUs

By default, ONNX Runtime runs inference on CPU devices. However, it is possible to place supported operations on an NVIDIA GPU, while leaving any unsupported ones on CPU. In most cases, this allows costly operations to be placed on GPU and significantly accelerate inference.

This guide will show you how to run inference on two execution providers that ONNX Runtime supports for NVIDIA GPUs:

<ul><li>CUDAExecutionProvider: Generic acceleration on NVIDIA CUDA-enabled GPUs.
<li>TensorrtExecutionProvider: Uses NVIDIA’s TensorRT inference engine and generally provides the best runtime performance.</ul>

## CUDAExecutionProvider

### CUDA installation


Provided the CUDA and cuDNN requirements are satisfied, install the additional dependencies by running

To avoid conflicts between onnxruntime and onnxruntime-gpu, make sure the package onnxruntime is not installed by running pip uninstall onnxruntime prior to installing Optimum.

In [1]:
import numpy as np
import tensorrt
import torch

In [2]:
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer



In [3]:
tokenizer = AutoTokenizer.from_pretrained("philschmid/tiny-bert-sst2-distilled")
inputs = tokenizer("expectations were low, actual enjoyment was high", return_tensors="pt", padding=True)

2024-01-24 11:48:23,801 urllib3.connectionpool [DEBUG] - Starting new HTTPS connection (1): huggingface.co:443
2024-01-24 11:48:24,189 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /philschmid/tiny-bert-sst2-distilled/resolve/main/tokenizer_config.json HTTP/1.1" 200 0


In [5]:
ort_model = ORTModelForSequenceClassification.from_pretrained(
  "philschmid/tiny-bert-sst2-distilled",
  export=True,
  provider="CUDAExecutionProvider",
)

2024-01-24 11:49:01,739 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /philschmid/tiny-bert-sst2-distilled/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:49:02,055 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/philschmid/tiny-bert-sst2-distilled/tree/main?recursive=True&expand=False HTTP/1.1" 200 1245
Framework not specified. Using pt to export to ONNX.
2024-01-24 11:49:02,293 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/philschmid/tiny-bert-sst2-distilled HTTP/1.1" 200 1964
2024-01-24 11:49:02,540 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /philschmid/tiny-bert-sst2-distilled/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:49:02,802 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /philschmid/tiny-bert-sst2-distilled/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:49:03,053 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /ph

In [6]:
outputs = ort_model(**inputs)
assert ort_model.providers == ["CUDAExecutionProvider", "CPUExecutionProvider"]

In case this code runs gracefully, congratulations, the installation is successful! If you encounter the following error or similar,


then something is wrong with the CUDA or ONNX Runtime installation.

### Use CUDA execution provider with floating-point models

For non-quantized models, the use is straightforward. Simply specify the provider argument in the ORTModel.from_pretrained() method. Here’s an example:

In [8]:
from optimum.onnxruntime import ORTModelForSequenceClassification

In [9]:
ort_model = ORTModelForSequenceClassification.from_pretrained(
  "distilbert-base-uncased-finetuned-sst-2-english",
  export=True,
  provider="CUDAExecutionProvider",
)

2024-01-24 11:53:48,133 urllib3.connectionpool [DEBUG] - Resetting dropped connection: huggingface.co
2024-01-24 11:53:48,471 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:53:48,742 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/distilbert-base-uncased-finetuned-sst-2-english/tree/main?recursive=True&expand=False HTTP/1.1" 200 2448
Framework not specified. Using pt to export to ONNX.
2024-01-24 11:53:49,027 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/distilbert-base-uncased-finetuned-sst-2-english HTTP/1.1" 200 20067
2024-01-24 11:53:49,279 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:53:49,538 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /distilbert-base-unc

The model can then be used with the common 🤗 Transformers API for inference and evaluation, such as pipelines. When using Transformers pipeline, note that the device argument should be set to perform pre- and post-processing on GPU, following the example below:

In [10]:
from optimum.pipelines import pipeline
from transformers import AutoTokenizer

In [11]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

2024-01-24 11:54:50,696 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /distilbert-base-uncased-finetuned-sst-2-english/resolve/main/tokenizer_config.json HTTP/1.1" 200 0


In [12]:
pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
result = pipe("Both the music and visual were astounding, not to mention the actors performance.")

2024-01-24 11:54:57.390759710 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-24 11:54:57.390771289 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.


In [13]:
print(result)

[{'label': 'POSITIVE', 'score': 0.9997727274894714}]


Additionally, you can pass the session option log_severity_level = 0 (verbose), to check whether all nodes are indeed placed on the CUDA execution provider or not:

In [14]:
import onnxruntime

In [15]:
session_options = onnxruntime.SessionOptions()
session_options.log_severity_level = 0

In [16]:
ort_model = ORTModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
    export=True,
    provider="CUDAExecutionProvider",
    session_options=session_options
)

2024-01-24 11:55:58,433 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:55:58,678 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/distilbert-base-uncased-finetuned-sst-2-english/tree/main?recursive=True&expand=False HTTP/1.1" 200 2448
Framework not specified. Using pt to export to ONNX.
2024-01-24 11:55:58,966 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/distilbert-base-uncased-finetuned-sst-2-english HTTP/1.1" 200 20067
2024-01-24 11:55:59,224 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:55:59,481 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:55:59,737 urllib3.

## Reduce memory footprint with IOBinding

IOBinding is an efficient way to avoid expensive data copying when using GPUs. By default, ONNX Runtime will copy the input from the CPU (even if the tensors are already copied to the targeted device), and assume that outputs also need to be copied back to the CPU from GPUs after the run. These data copying overheads between the host and devices are expensive, and can lead to worse inference latency than vanilla PyTorch especially for the decoding process.

To avoid the slowdown, 🤗 Optimum adopts the IOBinding to copy inputs onto GPUs and pre-allocate memory for outputs prior the inference. When instanciating the ORTModel, set the value of the argument use_io_binding to choose whether to turn on the IOBinding during the inference. use_io_binding is set to True by default, if you choose CUDA as execution provider.

In [17]:
#And if you want to turn off IOBinding:


from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM

In [18]:
# Load the model from the hub and export it to the ONNX format
model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", export=True, use_io_binding=False)
tokenizer = AutoTokenizer.from_pretrained("t5-small")

2024-01-24 11:58:13,208 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:58:13,219 filelock [DEBUG] - Attempting to acquire lock 139964265352016 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/378f74060b20f6a9a1ea41b03fc0504466206255.lock
2024-01-24 11:58:13,221 filelock [DEBUG] - Lock 139964265352016 acquired on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/378f74060b20f6a9a1ea41b03fc0504466206255.lock
2024-01-24 11:58:13,470 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /t5-small/resolve/main/config.json HTTP/1.1" 200 1206


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

2024-01-24 11:58:13,486 filelock [DEBUG] - Attempting to release lock 139964265352016 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/378f74060b20f6a9a1ea41b03fc0504466206255.lock
2024-01-24 11:58:13,486 filelock [DEBUG] - Lock 139964265352016 released on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/378f74060b20f6a9a1ea41b03fc0504466206255.lock
2024-01-24 11:58:13,738 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/t5-small/tree/main?recursive=True&expand=False HTTP/1.1" 200 3839
Framework not specified. Using pt to export to ONNX.
2024-01-24 11:58:14,057 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/t5-small HTTP/1.1" 200 12717
2024-01-24 11:58:14,297 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/config.json HTTP/1.1" 200 0
2024-01-24 11:58:14,541 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/config.json HTTP/1.1"

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

2024-01-24 11:59:25,261 filelock [DEBUG] - Attempting to release lock 139964177537040 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/bd944e5f1b3ad9b70dd9d00010a517059e19265671076b8b0a4a58d9491842bc.lock
2024-01-24 11:59:25,262 filelock [DEBUG] - Lock 139964177537040 released on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/bd944e5f1b3ad9b70dd9d00010a517059e19265671076b8b0a4a58d9491842bc.lock
2024-01-24 11:59:25,935 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/generation_config.json HTTP/1.1" 200 0
2024-01-24 11:59:25,938 filelock [DEBUG] - Attempting to acquire lock 139963927716496 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/d52815623b46b7db1c4b957b5a83a8ad30b0146a.lock
2024-01-24 11:59:25,939 filelock [DEBUG] - Lock 139963927716496 acquired on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/d52815623b46b7db1c4b957b5a83a8ad30b0146a.lock
2024-01-24 11:59:26,193 urllib3.connectionpool [DE

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

2024-01-24 11:59:26,213 filelock [DEBUG] - Attempting to release lock 139963927716496 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/d52815623b46b7db1c4b957b5a83a8ad30b0146a.lock
2024-01-24 11:59:26,214 filelock [DEBUG] - Lock 139963927716496 released on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/d52815623b46b7db1c4b957b5a83a8ad30b0146a.lock
2024-01-24 11:59:26,533 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
2024-01-24 11:59:26,535 filelock [DEBUG] - Attempting to acquire lock 139963936207440 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/4fd6c49c32ecf41886dba213d40030f063c72b07.lock
2024-01-24 11:59:26,537 filelock [DEBUG] - Lock 139963936207440 acquired on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/4fd6c49c32ecf41886dba213d40030f063c72b07.lock
2024-01-24 11:59:26,785 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /t5-small/

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

2024-01-24 11:59:26,809 filelock [DEBUG] - Attempting to release lock 139963936207440 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/4fd6c49c32ecf41886dba213d40030f063c72b07.lock
2024-01-24 11:59:26,809 filelock [DEBUG] - Lock 139963936207440 released on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/4fd6c49c32ecf41886dba213d40030f063c72b07.lock
2024-01-24 11:59:27,058 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/spiece.model HTTP/1.1" 200 0
2024-01-24 11:59:27,060 filelock [DEBUG] - Attempting to acquire lock 139963924264016 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/4e28ff6ebdf584f5372d9de68867399142435d9a.lock
2024-01-24 11:59:27,062 filelock [DEBUG] - Lock 139963924264016 acquired on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/4e28ff6ebdf584f5372d9de68867399142435d9a.lock
2024-01-24 11:59:27,308 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /t5-small/resolve/m

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

2024-01-24 11:59:27,555 filelock [DEBUG] - Attempting to release lock 139963924264016 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/4e28ff6ebdf584f5372d9de68867399142435d9a.lock
2024-01-24 11:59:27,556 filelock [DEBUG] - Lock 139963924264016 released on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/4e28ff6ebdf584f5372d9de68867399142435d9a.lock
2024-01-24 11:59:27,800 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/tokenizer.json HTTP/1.1" 200 0
2024-01-24 11:59:27,803 filelock [DEBUG] - Attempting to acquire lock 139963924264016 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/41911369be79611d82d152725113e66206bd7b79.lock
2024-01-24 11:59:27,804 filelock [DEBUG] - Lock 139963924264016 acquired on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/41911369be79611d82d152725113e66206bd7b79.lock
2024-01-24 11:59:28,045 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /t5-small/resolve

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

2024-01-24 11:59:28,424 filelock [DEBUG] - Attempting to release lock 139963924264016 on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/41911369be79611d82d152725113e66206bd7b79.lock
2024-01-24 11:59:28,425 filelock [DEBUG] - Lock 139963924264016 released on /home/ranga/.cache/huggingface/hub/.locks/models--t5-small/41911369be79611d82d152725113e66206bd7b79.lock
2024-01-24 11:59:28,687 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/added_tokens.json HTTP/1.1" 404 0
2024-01-24 11:59:28,932 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/special_tokens_map.json HTTP/1.1" 404 0
2024-01-24 11:59:29,234 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/preprocessor_config.json HTTP/1.1" 404 0
2024-01-24 11:59:29,482 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /t5-small/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
2024-01-24 11:59:29,

In [19]:
# Create a pipeline
onnx_translation = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer, device="cuda:0")

2024-01-24 12:00:06.835635247 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-24 12:00:06.835649568 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-01-24 12:00:07.098517772 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-24 12:00:07.098532500 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-01-24 12:00:07.402265301 [W:onnxrun

For the time being, IOBinding is supported for task-defined ORT models, if you want us to add support for custom models, file us an issue on the Optimum’s repository.