## Apply optimizations to ONNX model

Now that we have an ONNX model, we can apply some basic optimizations. After completing this section, you should be able to apply:

-   graph optimizations, e.g. fusing operations
-   post-training quantization (dynamic and static)
-   and hardware-specific execution providers

to improve inference performance.

You will execute this notebook *in a Jupyter container running on a compute instance*, not on the general-purpose Chameleon Jupyter environment from which you provision resources.

Since we are going to evaluate several models, we’ll define a benchmark function here to help us compare them:

In [1]:
import os
import time
import numpy as np
import torch
import onnx
import onnxruntime as ort
from torch.utils.data import Dataset, DataLoader

In [2]:
from utilities import pad_or_truncate

class SequentialEvalDataset(Dataset):
    def __init__(self, filepath, seq_max_len=100):
        self.user_sequences = {}
        with open(filepath, "r") as f:
            for line in f:
                uid, iid = map(int, line.strip().split("\t"))
                self.user_sequences.setdefault(uid, []).append(iid)

        self.samples = []
        for uid, seq in self.user_sequences.items():
            if len(seq) < 2:
                continue
            self.samples.append((uid, seq[:-1], seq[-1]))  # label 可选

        self.seq_max_len = seq_max_len

        print(f"Loaded {len(self.samples)} valid user sequences from {filepath}")

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        uid, seq, _ = self.samples[idx]
        return (
            torch.tensor(uid, dtype=torch.long),
            torch.tensor(pad_or_truncate(seq, self.seq_max_len), dtype=torch.long)
        )

dataset = SequentialEvalDataset("/mnt/data/evaluation/movielens_192m_eval.txt", seq_max_len=100)
loader = DataLoader(dataset, batch_size=64, shuffle=False)


Loaded 1374159 valid user sequences from /mnt/data/evaluation/movielens_192m_eval.txt


In [3]:
def benchmark_session(ort_session):
    print(f"Execution provider: {ort_session.get_providers()}")

    user_input_name = ort_session.get_inputs()[0].name
    seq_input_name = ort_session.get_inputs()[1].name

    # Single sample latency
    user_tensor, seq_tensor = dataset[0]
    u = user_tensor.unsqueeze(0).numpy()
    s = seq_tensor.unsqueeze(0).numpy()

    ort_session.run(None, {user_input_name: u, seq_input_name: s})  # warmup

    latencies = []
    for _ in range(100):
        start = time.time()
        ort_session.run(None, {user_input_name: u, seq_input_name: s})
        latencies.append(time.time() - start)

    print(f"Inference Latency (median): {np.percentile(latencies, 50)*1000:.2f} ms")
    print(f"Inference Latency (95th): {np.percentile(latencies, 95)*1000:.2f} ms")
    print(f"Inference Latency (99th): {np.percentile(latencies, 99)*1000:.2f} ms")
    print(f"Inference Throughput (single sample): {100/np.sum(latencies):.2f} FPS")

    # Batch throughput
    user_tensor, seq_tensor = next(iter(loader))
    u = user_tensor.numpy()
    s = seq_tensor.numpy()

    ort_session.run(None, {user_input_name: u, seq_input_name: s})  # warmup

    batch_times = []
    for _ in range(50):
        start = time.time()
        ort_session.run(None, {user_input_name: u, seq_input_name: s})
        batch_times.append(time.time() - start)

    batch_fps = (len(user_tensor) * 50) / np.sum(batch_times)
    print(f"Batch Throughput: {batch_fps:.2f} FPS")


### Apply basic graph optimizations

In [4]:
onnx_model_path = "models/SSE_PT10kemb.onnx"
optimized_model_path = "models/SSE_PT10kemb_optimized.onnx"

session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
session_options.optimized_model_filepath = optimized_model_path

ort_session = ort.InferenceSession(
    onnx_model_path,
    sess_options=session_options,
    providers=['CPUExecutionProvider']
)

print(f"Optimized ONNX model saved to {optimized_model_path}")


Optimized ONNX model saved to models/SSE_PT10kemb_optimized.onnx


Next, evaluate the optimized model. The graph optimizations may improve the inference performance, may have negligible effect, OR they can make it worse, depending on the model and the hardware environment in which the model is executed.

In [5]:
optimized_session = ort.InferenceSession(optimized_model_path, providers=['CPUExecutionProvider'])
benchmark_session(optimized_session)

Execution provider: ['CPUExecutionProvider']
Inference Latency (median): 1.91 ms
Inference Latency (95th): 1.93 ms
Inference Latency (99th): 2.19 ms
Inference Throughput (single sample): 522.37 FPS
Batch Throughput: 1072.27 FPS


<!--

On gigaio AMD EPYC:


Execution provider: ['CPUExecutionProvider']
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 8.70 ms
Inference Latency (single sample, 95th percentile): 8.88 ms
Inference Latency (single sample, 99th percentile): 9.24 ms
Inference Throughput (single sample): 114.63 FPS
Batch Throughput: 1153.63 FPS

On liqid Intel:

Execution provider: ['CPUExecutionProvider']
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 4.63 ms
Inference Latency (single sample, 95th percentile): 4.67 ms
Inference Latency (single sample, 99th percentile): 4.75 ms
Inference Throughput (single sample): 214.45 FPS
Batch Throughput: 2488.54 FPS

-->

### Apply post training quantization


#### Dynamic quantization

We will start with dynamic quantization. No calibration dataset is required.

In [13]:
import onnxruntime as ort
import neural_compressor
from neural_compressor import quantization
from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor.model.onnx_model import ONNXModel

fp32_model_path = "models/SSE_PT10kemb.onnx"
fp32_model = ONNXModel(fp32_model_path)

config = PostTrainingQuantConfig(approach="dynamic")

q_model = quantization.fit(
    model=fp32_model,
    conf=config
)

quantized_model_path = "models/SSE_PT10kemb_quant_dynamic.onnx"
q_model.save_model_to_file(quantized_model_path)
print(f"Quantized model saved to: {quantized_model_path}")

model_size = os.path.getsize(quantized_model_path)
print(f"Quantized Model Size on Disk: {model_size / 1e6:.2f} MB")

ort_session = ort.InferenceSession(quantized_model_path, providers=['CPUExecutionProvider'])
benchmark_session(ort_session)


2025-05-13 11:51:15 [INFO] Start auto tuning.
2025-05-13 11:51:15 [INFO] Quantize model without tuning!
2025-05-13 11:51:15 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2025-05-13 11:51:15 [INFO] Adaptor has 5 recipes.
2025-05-13 11:51:15 [INFO] 0 recipes specified by user.
2025-05-13 11:51:15 [INFO] 3 recipes require future tuning.
2025-05-13 11:51:15 [INFO] *** Initialize auto tuning
2025-05-13 11:51:15 [INFO] {
2025-05-13 11:51:15 [INFO]     'PostTrainingQuantConfig': {
2025-05-13 11:51:15 [INFO]         'AccuracyCriterion': {
2025-05-13 11:51:15 [INFO]             'criterion': 'relative',
2025-05-13 11:51:15 [INFO]             'higher_is_better': True,
2025-05-13 11:51:15 [INFO]             'tolerable_loss': 0.01,
2025-05-13 11:51:15 [INFO]             'absolute': None,
2025-05-13 11:51:15 [INFO]     

Quantized model saved to: models/SSE_PT10kemb_quant_dynamic.onnx
Quantized Model Size on Disk: 18.85 MB
Execution provider: ['CPUExecutionProvider']
Inference Latency (median): 4.93 ms
Inference Latency (95th): 5.08 ms
Inference Latency (99th): 5.36 ms
Inference Throughput (single sample): 202.33 FPS
Batch Throughput: 940.05 FPS


<!-- 

On liqid AMD EPYC

Model Size on Disk: 2.42 MB
Execution provider: ['CPUExecutionProvider']
Accuracy: 82.04% (2746/3347 correct)
Inference Latency (single sample, median): 22.32 ms
Inference Latency (single sample, 95th percentile): 22.97 ms
Inference Latency (single sample, 99th percentile): 23.14 ms
Inference Throughput (single sample): 44.71 FPS
Batch Throughput: 38.34 FPS

On liqid Intel

Execution provider: ['CPUExecutionProvider']
Accuracy: 84.58% (2831/3347 correct)
Inference Latency (single sample, median): 28.29 ms
Inference Latency (single sample, 95th percentile): 29.00 ms
Inference Latency (single sample, 99th percentile): 29.07 ms
Inference Throughput (single sample): 35.28 FPS

-->

#### Static quantization

Next, we will try static quantization with a calibration dataset.

First, let’s prepare the calibration dataset. This dataset will also be used to evaluate the quantized model, to see if it meets the accuracy criterion we will set.

In [None]:
def collate_fn(batch):
    user_batch = np.vstack([b["user_ids"] for b in batch]).astype(np.int64)
    seq_batch = np.vstack([b["item_seqs"] for b in batch]).astype(np.int64)
    print(f"user_ids shape: {user_batch.shape}, item_seqs shape: {seq_batch.shape}")
    return {
        "user_ids": user_batch,
        "item_seqs": seq_batch
    }


from torch.utils.data import DataLoader

calib_loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=False,
    collate_fn=collate_fn
)

sample = next(iter(calib_loader))
print("Returned keys:", sample.keys())

user_ids shape: (64, 1), item_seqs shape: (64, 100)
Returned keys: dict_keys(['user_ids', 'item_seqs'])


In [None]:
from neural_compressor.config import PostTrainingQuantConfig

dataset = SequentialEvalDataset("/mnt/data/evaluation/movielens_192m_eval.txt", seq_max_len=100)

def collate_fn(batch):
    user_batch = np.vstack([b["user_ids"] for b in batch]).astype(np.int64)
    seq_batch = np.vstack([b["item_seqs"] for b in batch]).astype(np.int64)
    print(f"user_ids shape: {user_batch.shape}, item_seqs shape: {seq_batch.shape}")
    return {
        "user_ids": user_batch,
        "item_seqs": seq_batch
    }

eval_loader = DataLoader(dataset, batch_size=64, shuffle=False, collate_fn=collate_fn)

fp32_model_path = "models/SSE_PT10kemb.onnx"
fp32_model = ONNXModel(fp32_model_path)

config_ptq = PostTrainingQuantConfig(
    approach="static",
    device="cpu",
    quant_level=1,
    quant_format="QOperator",
    recipes={"graph_optimization_level": "ENABLE_EXTENDED"},
    calibration_sampling_size=128
)

q_model = quantization.fit(
    model=fp32_model,
    conf=config_ptq,
    calib_dataloader=eval_loader  
)

quantized_model_path = "models/SSE_PT10kemb_quant_static.onnx"
if q_model:
    q_model.save_model_to_file(quantized_model_path)
    print(f"Static quantized model saved to: {quantized_model_path}")

    model_size = os.path.getsize(quantized_model_path)
    print(f"Quantized Model Size on Disk: {model_size / 1e6:.2f} MB")

    ort_session = ort.InferenceSession(quantized_model_path, providers=["CPUExecutionProvider"])
    benchmark_session(ort_session)
else:
    print("Quantization fail")


<pre style="font-size:84%; line-height:1.3em; font-family:monospace;">
Model Size on Disk: 6.01 MB
Accuracy: 90.20% (3019/3347 correct)
Inference Latency (single sample, median): 10.20 ms
Inference Latency (single sample, 95th percentile): 10.39 ms
Inference Latency (single sample, 99th percentile): 10.66 ms
Inference Throughput (single sample): 97.87 FPS
Batch Throughput: 277.23 FPS

On intel

Execution provider: ['CPUExecutionProvider']
Accuracy: 90.44% (3027/3347 correct)
Inference Latency (single sample, median): 6.60 ms
Inference Latency (single sample, 95th percentile): 6.66 ms
Inference Latency (single sample, 99th percentile): 6.68 ms
Inference Throughput (single sample): 151.36 FPS
Batch Throughput: 540.19 FPS
</pre>

<!--

on AMD EPYC

Model Size on Disk: 6.01 MB
Accuracy: 90.20% (3019/3347 correct)
Inference Latency (single sample, median): 10.20 ms
Inference Latency (single sample, 95th percentile): 10.39 ms
Inference Latency (single sample, 99th percentile): 10.66 ms
Inference Throughput (single sample): 97.87 FPS
Batch Throughput: 277.23 FPS

On intel

Execution provider: ['CPUExecutionProvider']
Accuracy: 90.44% (3027/3347 correct)
Inference Latency (single sample, median): 6.60 ms
Inference Latency (single sample, 95th percentile): 6.66 ms
Inference Latency (single sample, 99th percentile): 6.68 ms
Inference Throughput (single sample): 151.36 FPS
Batch Throughput: 540.19 FPS

-->
<!--


::: {.cell .markdown}

### Quantization aware training

To achieve the best of both worlds - high accuracy, but the small model size and faster inference time of a quantized model - we can try quantization aware training. In QAT, the effect of quantization is "simulated" during training, so that we learn weights that are more robust to quantization. Then, when we quantize the model, we can achieve better accuracy.

:::

-->