# End-to-End tutorial on accelerating RoBERTa for Question-Answering including quantization and optimization
From https://github.com/huggingface/blog/blob/main/optimum-inference.md

In this End-to-End tutorial on accelerating RoBERTa for question-answering, you will learn how to:

1. Install Optimum for ONNX Runtime
2. Convert a Hugging Face Transformers model to ONNX for inference
3. Use the ORTOptimizer to optimize the model
4. Use the ORTQuantizer to apply dynamic quantization
5. Run accelerated inference using Transformers pipelines
6. Evaluate the performance and speed
   
Let’s get started 🚀

This tutorial was created and run on an m5.xlarge AWS EC2 Instance and also works on the SUTD cluster.

---

###  1. Transformer Model
###  2. ONNX Model
### 3. Optimized ONNX Model
### 4. Quantized Optimized ONNX Model

---

## Install Optimum for Onnxruntime
Our first step is to install Optimum with the onnxruntime utilities.


In [12]:
# !pip install optimum[onnxruntime]


## 3.2 Convert a Hugging Face Transformers model to ONNX for inference
Before we can start optimizing we need to convert our vanilla transformers model to the onnx format. The model we are using is the deepset/roberta-base-squad2 a fine-tuned RoBERTa model on the SQUAD2 question answering dataset.

In [None]:
from pathlib import Path # Library for os-independent file paths
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering


model_id = "deepset/roberta-base-squad2" # A fine-tuned RoBERTa model trained for question-answering on the SQuAD2 dataset
onnx_path = Path("onnx") # Path to save the onnx model
task = "question-answering" # The task for the pipeline


# Loads the original RoBERTa model from Hugging Face and automatically converts it to ONNX
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
# Loads the pre-trained tokenizer for RoBERTa.
tokenizer = AutoTokenizer.from_pretrained(model_id)


# Saves the ONNX-optimized version of the model
model.save_pretrained(onnx_path)
# Saves the tokenizer files
tokenizer.save_pretrained(onnx_path)


# Creates a Hugging Face pipeline for question-answering
# Uses the optimized ONNX model
optimum_qa = pipeline(task, model=model, tokenizer=tokenizer, handle_impossible_answer=True)


prediction = optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")
print(prediction)
# {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}

The argument `from_transformers` is deprecated, and will be removed in optimum 2.0.  Use `export` instead
Device set to use mps:0


{'score': 0.904166042804718, 'start': 11, 'end': 18, 'answer': 'Philipp'}


We successfully converted our vanilla transformers to onnx and used the model with the transformers.pipelines to run the first prediction. Now let's optimize it.

## Use the ORTOptimizer to optimize the model
After we saved our onnx checkpoint to onnx/ we can now use the ORTOptimizer to apply graph optimization, such as operator fusion and constant folding to accelerate latency and inference.

### Optimization

1. Constant Folding: Precomputes constant expressions during model conversion.
For example: If the model has 3 + 5, it replaces it with 8, reducing runtime computation.

2. Operator Fusion: Merges multiple operations into one to reduce computation overhead.
Example: Instead of MatMul → Add → Activation, it creates a single FusedLayer.

3. Graph Pruning: Removes unused or redundant nodes to make inference faster.

4. Memory Optimization: Rearranges tensor allocations to reduce memory usage.

5. Hardware-Specific Optimizations: Uses CPU/GPU-specific instructions (e.g., AVX, Tensor Cores) for speedup.

### Why Isn't Optimization Applied from the Beginning?

The main reasons involve trade-offs between flexibility, generalization, hardware compatibility, and debugging ease.

for example:
- Inference only needs forward passes, so operations can be fused.
- Unoptimized models keep each operation separate and readable, which makes debugging easier.

In [None]:
from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig


# Loads the previously saved ONNX model from the onnx/ directory.
optimizer = ORTOptimizer.from_pretrained(onnx_path)


# Defines how aggressively the model should be optimized.
# The optimization level 99 applies all possible optimizations.
optimization_config = OptimizationConfig(optimization_level=99)


# Apply the optimization configuration to the model
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)

[0;93m2025-02-20 14:07:47.393799 [W:onnxruntime:, inference_session.cc:2048 Initialize] Serializing optimized model with Graph Optimization level greater than ORT_ENABLE_EXTENDED and the NchwcTransformer enabled. The generated model may contain hardware specific optimizations, and should only be used in the same environment the model was optimized in.[m


PosixPath('onnx')

To test performance we can use the ORTModelForQuestionAnswering class again and provide an additional file_name parameter to load our optimized model. 

In [15]:
from optimum.onnxruntime import ORTModelForQuestionAnswering

# load optimized model
opt_model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model_optimized.onnx")

# test the quantized model with using transformers pipeline
opt_optimum_qa = pipeline(task, model=opt_model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = opt_optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")
print(prediction)
# {'score': 0.9041661620140076, 'start': 11, 'end': 18, 'answer': 'Philipp'}

Device set to use mps:0


{'score': 0.9041661620140076, 'start': 11, 'end': 18, 'answer': 'Philipp'}


## Use the ORTQuantizer to apply dynamic quantization
Another option to reduce model size and accelerate inference is by quantizing the model using the ORTQuantizer.

In [None]:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig


# Load the ONNX Model for Quantization
quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model.onnx")
# Define the Quantization Configuration
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)


# Quantize it!
quantizer.quantize(save_dir=onnx_path, quantization_config=qconfig)

PosixPath('onnx')

We can now compare this model size as well as some latency performance



In [17]:
import os
# get model file size
size = os.path.getsize(onnx_path / "model.onnx")/(1024*1024)
print(f"Vanilla Onnx Model file size: {size:.2f} MB")
size = os.path.getsize(onnx_path / "model_quantized.onnx")/(1024*1024)
print(f"Quantized Onnx Model file size: {size:.2f} MB")

# Vanilla Onnx Model file size: 473.51 MB
# Quantized Onnx Model file size: 119.15 MB

Vanilla Onnx Model file size: 473.54 MB
Quantized Onnx Model file size: 119.61 MB


## Run accelerated inference using pipelines

Optimum has built-in support for transformers pipelines. This allows us to leverage the same API that we know from using PyTorch and TensorFlow models.

In [18]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering

quant_model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model_quantized.onnx")

quantized_optimum_qa = pipeline("question-answering", model=quant_model, tokenizer=tokenizer)
prediction = quantized_optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")

print(prediction)
# {'score': 0.806605339050293, 'start': 11, 'end': 18, 'answer': 'Philipp'}

Device set to use mps:0


{'score': 0.8086288571357727, 'start': 11, 'end': 18, 'answer': 'Philipp'}


In addition to this optimum has a pipelines API which guarantees more safety for your accelerated models.

In [19]:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained(onnx_path, file_name="model_quantized.onnx")
quant_model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model_quantized.onnx")
                                                     
quantized_optimum_qa = pipeline("question-answering", model=quant_model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = quantized_optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")

print(prediction)
# {'score': 0.806605339050293, 'start': 11, 'end': 18, 'answer': 'Philipp'}

Device set to use mps:0


{'score': 0.8086288571357727, 'start': 11, 'end': 18, 'answer': 'Philipp'}


## Evaluate performance and speed
As the last step, we want to take a detailed look at the performance and accuracy of our model. Applying optimization techniques, like graph optimizations or quantization not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.

Let's evaluate our models. Our transformers model deepset/roberta-base-squad2 was fine-tuned on the SQUAD2 dataset. This will be the dataset we use to evaluate our models.
To safe time, we only load 10% of the dataset.

In [20]:
from datasets import load_dataset

import evaluate
metric = evaluate.load("squad_v2")

# load 10% of the data to safe time
# metric = load_metric("squad_v2")
dataset = load_dataset("squad_v2", split="validation[:10%]")

print(f"length of dataset {len(dataset)}")
#length of dataset 1187

length of dataset 1187


In [21]:
def evaluate(example):
  default = optimum_qa(question=example["question"], context=example["context"])
  optimized = opt_optimum_qa(question=example["question"], context=example["context"])
  quantized = quantized_optimum_qa(question=example["question"], context=example["context"])
  return {
      'reference': {'id': example['id'], 'answers': example['answers']},
      'default': {'id': example['id'],'prediction_text': default['answer'], 'no_answer_probability': 0.},
      'optimized': {'id': example['id'],'prediction_text': optimized['answer'], 'no_answer_probability': 0.},
      'quantized': {'id': example['id'],'prediction_text': quantized['answer'], 'no_answer_probability': 0.},
      }

result = dataset.map(evaluate)

Map: 100%|██████████| 1187/1187 [05:19<00:00,  3.72 examples/s]


Now lets compare the results



In [22]:
default_acc = metric.compute(predictions=result["default"], references=result["reference"])
optimized = metric.compute(predictions=result["optimized"], references=result["reference"])
quantized = metric.compute(predictions=result["quantized"], references=result["reference"])

print(f"vanilla model: exact={default_acc['exact']}% f1={default_acc['f1']}%")
print(f"optimized model: exact={quantized['exact']}% f1={optimized['f1']}%")
print(f"quantized model: exact={quantized['exact']}% f1={quantized['f1']}%")

# vanilla model: exact=81.12889637742207% f1=83.27089343306695%
# quantized model: exact=80.6234203875316% f1=82.86541222514259%

vanilla model: exact=81.12889637742207% f1=83.27089343306695%
optimized model: exact=80.45492839090143% f1=83.27089343306695%
quantized model: exact=80.45492839090143% f1=82.63289411141132%


The quantized model achived an exact match of 80.62% and an f1 score of 82.86% which is 99% of the original model.

Okay, let's test the performance (latency) of our optimized and quantized model.

But first, let’s extend our context and question to a more meaningful sequence length of 128.

In [23]:
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
question="As what is Philipp working?"

To keep it simple, we are going to use a python loop and calculate the avg/mean latency for our vanilla model and for the optimized and quantized model.



In [24]:
from time import perf_counter
import numpy as np

def measure_latency(pipe):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(question=question, context=context)
    # Timed run
    for _ in range(100):
        start_time = perf_counter()
        _ =  pipe(question=question, context=context)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Vanilla model {measure_latency(optimum_qa)}")
print(f"Optimized model {measure_latency(opt_optimum_qa)}")
print(f"Quantized model {measure_latency(quantized_optimum_qa)}")

# Vanilla model Average latency (ms) 102
# Optimized model Average latency (ms) 101
# Quantized model Average latency (ms) 46

Vanilla model Average latency (ms) - 86.95 +\- 7.75
Optimized model Average latency (ms) - 85.69 +\- 7.62
Quantized model Average latency (ms) - 70.74 +\- 2.71


We managed to reduce our model latency from 102ms to 46ms or by 55%, while keeping 99% of the accuracy. 