## Measure inference performance of ONNX model on CPU

To squeeze even more inference performance out of our model, we are going to convert it to ONNX format, which allows models from different frameworks (PyTorch, Tensorflow, Keras), to be deployed on a variety of different hardware platforms (CPU, GPU, edge devices), using many optimizations (graph optimizations, quantization, target device-specific implementations, and more).

In [None]:
import os
import time
import numpy as np
import torch
import onnx
import onnxruntime as ort
import pandas as pd
from torch.utils.data import DataLoader, Dataset

from utilities import build_model_from_ckpt, pad_or_truncate

In [None]:
# Prepare test dataset
SEQ_LEN = 50 
class MovieLensTestDataset(Dataset):
    def __init__(self, csv_path):
        self.df = pd.read_csv(csv_path)
        self.data = self.df.to_dict(orient="records")

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data[idx]
        user_id = int(row["user_id"])
        sequence = eval(row["sequence"]) if isinstance(row["sequence"], str) else row["sequence"]
        sequence = pad_or_truncate(sequence, SEQ_LEN)
        return user_id, sequence

# load data
movielens_data_dir = os.getenv("MOVIELENS_DATA_DIR", "/mnt/data")
test_dataset = MovieLensTestDataset(os.path.join(movielens_data_dir, "test.csv"))
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# ONNX model
onnx_model_path = "models/SSE_PT10kemb.onnx" 
onnx_model = onnx.load(onnx_model_path)
onnx.checker.check_model(onnx_model)

ort_session = ort.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])

inputs = ort_session.get_inputs()
user_input_name = inputs[0].name 
seq_input_name = inputs[1].name 

## Create an inference session

To use an ONNX model, we create an *inference session*, and then use the model within that session. 
Let’s start an inference session:

In [None]:
onnx_model_path = "models/SSE_PT10kemb.onnx" 
onnx_model = onnx.load(onnx_model_path)
onnx.checker.check_model(onnx_model)

ort_session = ort.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])

inputs = ort_session.get_inputs()
user_input_name = inputs[0].name 
seq_input_name = inputs[1].name 

#### Test accuracy

First, let’s measure accuracy on the test set:

In [None]:
correct, total = 0, 0
for user_ids, sequences in test_loader:
    u = np.array(user_ids)
    s = np.stack(sequences)
    outputs = ort_session.run(None, {
        user_input_name: u,
        seq_input_name: s
    })[0]
    preds = np.argmax(outputs, axis=1)
    total += len(preds)
    correct += len(preds) 

accuracy = correct / total * 100
print(f"Accuracy (dummy): {accuracy:.2f}%")

#### Model size

We are also concerned with the size of the ONNX model on disk. It will be similar to the equivalent PyTorch model size 

In [None]:
model_size = os.path.getsize(onnx_model_path)
print(f"Model Size on Disk: {model_size / 1e6:.2f} MB")

#### Inference latency

Now, we’ll measure how long it takes the model to return a prediction for a single sample. We will run 100 trials, and then compute aggregate statistics.

In [None]:
user, seq = test_dataset[0]
u = np.array([user])
s = np.array([seq])

# warm-up
ort_session.run(None, {user_input_name: u, seq_input_name: s})

latencies = []
for _ in range(100):
    start = time.time()
    ort_session.run(None, {user_input_name: u, seq_input_name: s})
    latencies.append(time.time() - start)

print(f"Inference Latency (median): {np.percentile(latencies, 50)*1000:.2f} ms")
print(f"Inference Latency (95th): {np.percentile(latencies, 95)*1000:.2f} ms")
print(f"Inference Latency (99th): {np.percentile(latencies, 99)*1000:.2f} ms")
print(f"Inference Throughput (single sample): {100/np.sum(latencies):.2f} FPS")


#### Batch throughput

Finally, we’ll measure the rate at which the model can return predictions for batches of data.

In [None]:
batch_user, batch_seq = next(iter(test_loader))
u = np.array(batch_user)
s = np.stack(batch_seq)

# warm-up
ort_session.run(None, {user_input_name: u, seq_input_name: s})

batch_times = []
for _ in range(50):
    start = time.time()
    ort_session.run(None, {user_input_name: u, seq_input_name: s})
    batch_times.append(time.time() - start)

batch_fps = (len(batch_user) * 50) / np.sum(batch_times)
print(f"Batch Throughput: {batch_fps:.2f} FPS")

#### Summary of results

In [None]:
print(f"Model Size on Disk: {model_size / 1e6:.2f} MB")
print(f"Accuracy: {accuracy:.2f}% ({correct}/{total} dummy)")
print(f"Inference Latency (single sample, median): {np.percentile(latencies, 50) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 95th percentile): {np.percentile(latencies, 95) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 99th percentile): {np.percentile(latencies, 99) * 1000:.2f} ms")
print(f"Inference Throughput (single sample): {100/np.sum(latencies):.2f} FPS")
print(f"Batch Throughput: {batch_fps:.2f} FPS")

<!-- summary for mobilenet

Model Size on Disk: 8.92 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 8.92 ms
Inference Latency (single sample, 95th percentile): 9.15 ms
Inference Latency (single sample, 99th percentile): 9.41 ms
Inference Throughput (single sample): 112.06 FPS
Batch Throughput: 993.48 FPS

Model Size on Disk: 8.92 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.64 ms
Inference Latency (single sample, 95th percentile): 10.57 ms
Inference Latency (single sample, 99th percentile): 11.72 ms
Inference Latency (single sample, std error): 0.04 ms
Inference Throughput (single sample): 102.52 FPS
Batch Throughput: 1083.57 FPS

Accuracy: 90.59% (3032/3347 correct)
Model Size on Disk: 8.92 MB
Inference Latency (single sample, median): 16.24 ms
Inference Latency (single sample, 95th percentile): 18.06 ms
Inference Latency (single sample, 99th percentile): 18.72 ms
Inference Throughput (single sample): 63.51 FPS
Batch Throughput: 1103.28 FPS


-->
<!-- summary for mobilenet with graph optimization

Model Size on Disk: 8.91 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.31 ms
Inference Latency (single sample, 95th percentile): 9.47 ms
Inference Latency (single sample, 99th percentile): 9.71 ms
Inference Throughput (single sample): 107.22 FPS
Batch Throughput: 1091.58 FPS

Model Size on Disk: 8.91 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.95 ms
Inference Latency (single sample, 95th percentile): 10.14 ms
Inference Latency (single sample, 99th percentile): 10.70 ms
Inference Latency (single sample, std error): 0.02 ms
Inference Throughput (single sample): 100.18 FPS
Batch Throughput: 1022.77 FPS

Model Size on Disk: 8.91 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.55 ms
Inference Latency (single sample, 95th percentile): 10.58 ms
Inference Latency (single sample, 99th percentile): 11.14 ms
Inference Latency (single sample, std error): 0.04 ms
Inference Throughput (single sample): 102.97 FPS
Batch Throughput: 1079.81 FPS


-->
<!-- 

(Intel CPU)

Accuracy: 90.59% (3032/3347 correct)
Model Size on Disk: 8.92 MB
Inference Latency (single sample, median): 4.53 ms
Inference Latency (single sample, 95th percentile): 4.63 ms
Inference Latency (single sample, 99th percentile): 4.99 ms
Inference Throughput (single sample): 218.75 FPS
Batch Throughput: 2519.80 FPS


-->