## Measure inference performance of ONNX model on CPU

To squeeze even more inference performance out of our model, we are going to convert it to ONNX format, which allows models from different frameworks (PyTorch, Tensorflow, Keras), to be deployed on a variety of different hardware platforms (CPU, GPU, edge devices), using many optimizations (graph optimizations, quantization, target device-specific implementations, and more).

In [1]:
import os
import time
import numpy as np
import torch
import onnx
import onnxruntime as ort
import pandas as pd
from torch.utils.data import DataLoader, Dataset

from utilities import build_model_from_ckpt, pad_or_truncate

In [2]:
onnx_model_path = "models/SSE_PT10kemb.onnx"
onnx_model = onnx.load(onnx_model_path)
onnx.checker.check_model(onnx_model)
ort_session = ort.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])


First, let’s load our saved PyTorch model, and convert it to ONNX using PyTorch’s built-in `torch.onnx.export`:

In [11]:
from utilities import get_max_item_id

class SequentialEvalDataset(Dataset):
    def __init__(self, filepath, seq_max_len, return_label=True):
        print(">>> Entered __init__")

        self.user_sequences = {}
        with open(filepath, "r") as f:
            for line in f:
                uid, iid = map(int, line.strip().split("\t"))
                self.user_sequences.setdefault(uid, []).append(iid)          

        self.samples = []
        self.seq_max_len = seq_max_len
        self.return_label = return_label

        for uid, seq in self.user_sequences.items():
            if len(seq) < 2:
                continue
            sequence = seq[:-1]
            label = seq[-1]
            self.samples.append((uid, sequence, label))

        print(f"Loaded {len(self.samples)} valid sequences from {filepath}")

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        uid, seq, label = self.samples[idx]
        seq_tensor = torch.tensor(pad_or_truncate(seq, self.seq_max_len), dtype=torch.long)
        uid_tensor = torch.tensor(uid, dtype=torch.long)
        if self.return_label:
            return uid_tensor, seq_tensor, torch.tensor(label, dtype=torch.long)
        else:
            return uid_tensor, seq_tensor

seq_max_len = 100
dataset = SequentialEvalDataset("/mnt/data/evaluation/movielens_192m_eval.txt", seq_max_len, return_label=False)
loader = DataLoader(dataset, batch_size=64, shuffle=False)


>>> Entered __init__
Loaded 1374159 valid sequences from /mnt/data/evaluation/movielens_192m_eval.txt


## Create an inference session

Now, we can evaluate our model! To use an ONNX model, we create an *inference session*, and then use the model within that session. Let’s start an inference session:

In [12]:
onnx_model = onnx.load(onnx_model_path)
onnx.checker.check_model(onnx_model)

ort_session = ort.InferenceSession(onnx_model_path, providers=["CPUExecutionProvider"])

print("Inputs:", ort_session.get_inputs())
print("Outputs:", ort_session.get_outputs())


Inputs: [<onnxruntime.capi.onnxruntime_pybind11_state.NodeArg object at 0x7e78cc2bcb70>, <onnxruntime.capi.onnxruntime_pybind11_state.NodeArg object at 0x7e78cc2bf730>]
Outputs: [<onnxruntime.capi.onnxruntime_pybind11_state.NodeArg object at 0x7e78cc2bc630>]


#### Model size

We are also concerned with the size of the ONNX model on disk. It will be similar to the equivalent PyTorch model size (to start!)

In [14]:
model_size = os.path.getsize(onnx_model_path) 
print(f"Model Size on Disk: {model_size/ (1e6) :.2f} MB")

Model Size on Disk: 72.53 MB


#### Inference latency

Now, we’ll measure how long it takes the model to return a prediction for a single sample. We will run 100 trials, and then compute aggregate statistics.

In [15]:
num_trials = 100  

user_tensor, seq_tensor = next(iter(loader))
user_input = user_tensor[:1].numpy()
seq_input = seq_tensor[:1].numpy()

# Warm-up
ort_session.run(None, {
    ort_session.get_inputs()[0].name: user_input,
    ort_session.get_inputs()[1].name: seq_input
})

latencies = []
for _ in range(num_trials):
    start = time.time()
    ort_session.run(None, {
        ort_session.get_inputs()[0].name: user_input,
        ort_session.get_inputs()[1].name: seq_input
    })
    latencies.append(time.time() - start)

print(f"Inference Latency (single sample, median): {np.percentile(latencies, 50) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 95th percentile): {np.percentile(latencies, 95) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 99th percentile): {np.percentile(latencies, 99) * 1000:.2f} ms")
print(f"Inference Throughput (single sample): {num_trials / np.sum(latencies):.2f} FPS")

Inference Latency (single sample, median): 2.05 ms
Inference Latency (single sample, 95th percentile): 3.86 ms
Inference Latency (single sample, 99th percentile): 4.46 ms
Inference Throughput (single sample): 430.97 FPS


#### Batch throughput

Finally, we’ll measure the rate at which the model can return predictions for batches of data.

In [16]:
num_batches = 50  

user_tensor, seq_tensor = next(iter(loader))
user_input = user_tensor.numpy()
seq_input = seq_tensor.numpy()

# Warm-up
ort_session.run(None, {
    ort_session.get_inputs()[0].name: user_input,
    ort_session.get_inputs()[1].name: seq_input
})

batch_times = []
for _ in range(num_batches):
    start_time = time.time()
    ort_session.run(None, {
        ort_session.get_inputs()[0].name: user_input,
        ort_session.get_inputs()[1].name: seq_input
    })
    batch_times.append(time.time() - start_time)

batch_fps = (user_input.shape[0] * num_batches) / np.sum(batch_times)
print(f"Batch Throughput: {batch_fps:.2f} FPS")


Batch Throughput: 914.38 FPS


#### Summary of results

In [17]:
print(f"Model Size on Disk: {model_size/ (1e6) :.2f} MB")
print(f"Inference Latency (single sample, median): {np.percentile(latencies, 50) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 95th percentile): {np.percentile(latencies, 95) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 99th percentile): {np.percentile(latencies, 99) * 1000:.2f} ms")
print(f"Inference Throughput (single sample): {num_trials / np.sum(latencies):.2f} FPS")
print(f"Batch Throughput: {batch_fps:.2f} FPS")

Model Size on Disk: 72.53 MB
Inference Latency (single sample, median): 2.05 ms
Inference Latency (single sample, 95th percentile): 3.86 ms
Inference Latency (single sample, 99th percentile): 4.46 ms
Inference Throughput (single sample): 430.97 FPS
Batch Throughput: 914.38 FPS


<!-- summary for mobilenet

Model Size on Disk: 8.92 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 8.92 ms
Inference Latency (single sample, 95th percentile): 9.15 ms
Inference Latency (single sample, 99th percentile): 9.41 ms
Inference Throughput (single sample): 112.06 FPS
Batch Throughput: 993.48 FPS

Model Size on Disk: 8.92 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.64 ms
Inference Latency (single sample, 95th percentile): 10.57 ms
Inference Latency (single sample, 99th percentile): 11.72 ms
Inference Latency (single sample, std error): 0.04 ms
Inference Throughput (single sample): 102.52 FPS
Batch Throughput: 1083.57 FPS

Accuracy: 90.59% (3032/3347 correct)
Model Size on Disk: 8.92 MB
Inference Latency (single sample, median): 16.24 ms
Inference Latency (single sample, 95th percentile): 18.06 ms
Inference Latency (single sample, 99th percentile): 18.72 ms
Inference Throughput (single sample): 63.51 FPS
Batch Throughput: 1103.28 FPS


-->
<!-- summary for mobilenet with graph optimization

Model Size on Disk: 8.91 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.31 ms
Inference Latency (single sample, 95th percentile): 9.47 ms
Inference Latency (single sample, 99th percentile): 9.71 ms
Inference Throughput (single sample): 107.22 FPS
Batch Throughput: 1091.58 FPS

Model Size on Disk: 8.91 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.95 ms
Inference Latency (single sample, 95th percentile): 10.14 ms
Inference Latency (single sample, 99th percentile): 10.70 ms
Inference Latency (single sample, std error): 0.02 ms
Inference Throughput (single sample): 100.18 FPS
Batch Throughput: 1022.77 FPS

Model Size on Disk: 8.91 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.55 ms
Inference Latency (single sample, 95th percentile): 10.58 ms
Inference Latency (single sample, 99th percentile): 11.14 ms
Inference Latency (single sample, std error): 0.04 ms
Inference Throughput (single sample): 102.97 FPS
Batch Throughput: 1079.81 FPS


-->
<!-- 

(Intel CPU)

Accuracy: 90.59% (3032/3347 correct)
Model Size on Disk: 8.92 MB
Inference Latency (single sample, median): 4.53 ms
Inference Latency (single sample, 95th percentile): 4.63 ms
Inference Latency (single sample, 99th percentile): 4.99 ms
Inference Throughput (single sample): 218.75 FPS
Batch Throughput: 2519.80 FPS


-->

When you are done, download the fully executed notebook from the Jupyter container environment for later reference. (Note: because it is an executable file, and you are downloading it from a site that is not secured with HTTPS, you may have to explicitly confirm the download in some browsers.)

Also download the `food11.onnx` model from inside the `models` directory.