## Measure inference performance of ONNX model on CPU

To squeeze even more inference performance out of our model, we are going to convert it to ONNX format, which allows models from different frameworks (PyTorch, Tensorflow, Keras), to be deployed on a variety of different hardware platforms (CPU, GPU, edge devices), using many optimizations (graph optimizations, quantization, target device-specific implementations, and more).

After finishing this section, you should know:

-   how to convert a PyTorch model to ONNX
-   how to measure the inference latency and batch throughput of the ONNX model

and then you will use it to evaluate the optimized models you develop in the next section.

You will execute this notebook *in a Jupyter container running on a compute instance*, not on the general-purpose Chameleon Jupyter environment from which you provision resources.

In [None]:
import os
import time
import numpy as np
import torch
import onnx
import onnxruntime as ort
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from data_utils import build_sample_batch 

In [None]:
model_path = "model/SSE_PT10kemb.pth"
device = torch.device("cpu")

model = SSEPTModel()
model.load_state_dict(torch.load(model_path, map_location=device))
model.eval()

onnx_model_path = "/mnt/models/ssept_dynamic.onnx"

dummy_input = (
    torch.tensor([[1]], dtype=torch.long),                 # userId
    torch.tensor([[101]], dtype=torch.long),               # movieId
    torch.randint(0, 1000, (1, 10), dtype=torch.long),      # cast
    torch.randint(0, 20, (1, 5), dtype=torch.long),         # genre
    torch.randn(1, 768),                                   # transcript_embedding
    torch.randn(1, 512)                                    # audio_embedding
)

# 导出为 ONNX 模型
torch.onnx.export(model, dummy_input, onnx_model_path,
                  export_params=True,
                  opset_version=20,
                  do_constant_folding=True,
                  input_names=['userId', 'movieId', 'cast', 'genre', 'transcript_embedding', 'audio_embedding'],
                  output_names=['output'],
                  dynamic_axes={'userId': {0: 'batch_size'},
                                'movieId': {0: 'batch_size'},
                                'cast': {0: 'batch_size'},
                                'genre': {0: 'batch_size'},
                                'transcript_embedding': {0: 'batch_size'},
                                'audio_embedding': {0: 'batch_size'},
                                'output': {0: 'batch_size'}})

print(f"ONNX model saved to: {onnx_model_path}")

onnx_model = onnx.load(onnx_model_path)
onnx.checker.check_model(onnx_model)
print("ONNX model structure is valid.")


## Create an inference session



In [None]:
# 模型路径
model_path ="/model/SSE_PT10kemb.pth"
onnx_model_path = "/mnt/movielens/models/ssept_dynamic.onnx"

# 加载模型
device = torch.device("cpu")
model = SSEPTModel()
model.load_state_dict(torch.load(model_path, map_location=device))
model.eval()

构造 dummy 输入并导出为 ONNX

In [None]:
dummy_input = build_sample_batch(batch_size=1)  # 返回 Dict[str, Tensor]

torch.onnx.export(
    model,
    (dummy_input,),
    onnx_model_path,
    export_params=True,
    opset_version=20,
    do_constant_folding=True,
    input_names=list(dummy_input.keys()),
    output_names=["output"],
    dynamic_axes={k: {0: "batch_size"} for k in dummy_input.keys()}
)
print(f"ONNX model saved to {onnx_model_path}")
onnx_model = onnx.load(onnx_model_path)
onnx.checker.check_model(onnx_model)


创建 ONNX Inference Session

In [None]:
ort_session = ort.InferenceSession(onnx_model_path, providers=["CPUExecutionProvider"])
print("Execution Providers:", ort_session.get_providers())


#### Test accuracy


In [None]:
def get_test_batch(batch_size=32):
    return build_sample_batch(batch_size=batch_size)

test_loader = [get_test_batch(32) for _ in range(10)] 


In [None]:
correct = 0
total = 0
for inputs in test_loader:
    inputs_numpy = {k: v.numpy() for k, v in inputs.items()}
    outputs = ort_session.run(None, inputs_numpy)[0]
    predicted = np.argmax(outputs, axis=1)
    labels = inputs["movie_id"].numpy().flatten()
    total += labels.shape[0]
    correct += (predicted == labels).sum()

accuracy = (correct / total) * 100
print(f"Accuracy: {accuracy:.2f}% ({correct}/{total} correct)")


#### Model size

We are also concerned with the size of the ONNX model on disk. It will be similar to the equivalent PyTorch model size (to start!)

In [None]:
model_size = os.path.getsize(onnx_model_path) 
print(f"Model Size on Disk: {model_size/ (1e6) :.2f} MB")

#### Inference latency

Now, we’ll measure how long it takes the model to return a prediction for a single sample. We will run 100 trials, and then compute aggregate statistics.

In [None]:
num_trials = 100  # Number of trials

# Get a single sample from the test data

single_sample, _ = next(iter(test_loader))  
single_sample = single_sample[:1].numpy()

ort_session.run(None, {ort_session.get_inputs()[0].name: single_sample})

latencies = []
for _ in range(num_trials):
    start_time = time.time()
    ort_session.run(None, {ort_session.get_inputs()[0].name: single_sample})
    latencies.append(time.time() - start_time)

In [None]:
print(f"Inference Latency (single sample, median): {np.percentile(latencies, 50) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 95th percentile): {np.percentile(latencies, 95) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 99th percentile): {np.percentile(latencies, 99) * 1000:.2f} ms")
print(f"Inference Throughput (single sample): {num_trials/np.sum(latencies):.2f} FPS")

#### Batch throughput

Finally, we’ll measure the rate at which the model can return predictions for batches of data.

In [None]:
num_batches = 50  

# Get a batch from the test data
batch_input_np = {k: v.numpy() for k, v in batch_input.items()}

ort_session.run(None, {ort_session.get_inputs()[0].name: batch_input})

batch_times = []
for _ in range(num_batches):
    start_time = time.time()
    ort_session.run(None, {ort_session.get_inputs()[0].name: batch_input})
    batch_times.append(time.time() - start_time)

In [None]:
batch_fps = (batch_input.shape[0] * num_batches) / np.sum(batch_times) 
print(f"Batch Throughput: {batch_fps:.2f} FPS")

#### Summary of results

In [None]:
print(f"Accuracy: {accuracy:.2f}% ({correct}/{total} correct)")
print(f"Model Size on Disk: {model_size/ (1e6) :.2f} MB")
print(f"Inference Latency (single sample, median): {np.percentile(latencies, 50) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 95th percentile): {np.percentile(latencies, 95) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 99th percentile): {np.percentile(latencies, 99) * 1000:.2f} ms")
print(f"Inference Throughput (single sample): {num_trials/np.sum(latencies):.2f} FPS")
print(f"Batch Throughput: {batch_fps:.2f} FPS")

<!-- summary for mobilenet

Model Size on Disk: 8.92 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 8.92 ms
Inference Latency (single sample, 95th percentile): 9.15 ms
Inference Latency (single sample, 99th percentile): 9.41 ms
Inference Throughput (single sample): 112.06 FPS
Batch Throughput: 993.48 FPS

Model Size on Disk: 8.92 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.64 ms
Inference Latency (single sample, 95th percentile): 10.57 ms
Inference Latency (single sample, 99th percentile): 11.72 ms
Inference Latency (single sample, std error): 0.04 ms
Inference Throughput (single sample): 102.52 FPS
Batch Throughput: 1083.57 FPS

Accuracy: 90.59% (3032/3347 correct)
Model Size on Disk: 8.92 MB
Inference Latency (single sample, median): 16.24 ms
Inference Latency (single sample, 95th percentile): 18.06 ms
Inference Latency (single sample, 99th percentile): 18.72 ms
Inference Throughput (single sample): 63.51 FPS
Batch Throughput: 1103.28 FPS


-->
<!-- summary for mobilenet with graph optimization

Model Size on Disk: 8.91 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.31 ms
Inference Latency (single sample, 95th percentile): 9.47 ms
Inference Latency (single sample, 99th percentile): 9.71 ms
Inference Throughput (single sample): 107.22 FPS
Batch Throughput: 1091.58 FPS

Model Size on Disk: 8.91 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.95 ms
Inference Latency (single sample, 95th percentile): 10.14 ms
Inference Latency (single sample, 99th percentile): 10.70 ms
Inference Latency (single sample, std error): 0.02 ms
Inference Throughput (single sample): 100.18 FPS
Batch Throughput: 1022.77 FPS

Model Size on Disk: 8.91 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 9.55 ms
Inference Latency (single sample, 95th percentile): 10.58 ms
Inference Latency (single sample, 99th percentile): 11.14 ms
Inference Latency (single sample, std error): 0.04 ms
Inference Throughput (single sample): 102.97 FPS
Batch Throughput: 1079.81 FPS


-->
<!-- 

(Intel CPU)

Accuracy: 90.59% (3032/3347 correct)
Model Size on Disk: 8.92 MB
Inference Latency (single sample, median): 4.53 ms
Inference Latency (single sample, 95th percentile): 4.63 ms
Inference Latency (single sample, 99th percentile): 4.99 ms
Inference Throughput (single sample): 218.75 FPS
Batch Throughput: 2519.80 FPS


-->

When you are done, download the fully executed notebook from the Jupyter container environment for later reference. (Note: because it is an executable file, and you are downloading it from a site that is not secured with HTTPS, you may have to explicitly confirm the download in some browsers.)

Also download the `food11.onnx` model from inside the `models` directory.

In [None]:

# SSE-PT ONNX Inference Performance Evaluation for Movie Recommendation

import onnxruntime as ort
import numpy as np
import time

# Load ONNX model
onnx_model_path = "/mnt/models/ssept_dynamic.onnx"
session = ort.InferenceSession(onnx_model_path, providers=["CPUExecutionProvider"])

# Dummy input for SSEPT 
input_tensor = {
    'user_id': np.array([[1]], dtype=np.int64),
    'movie_id': np.array([[101]], dtype=np.int64),
    'cast': np.random.randint(0, 1000, size=(1, 10), dtype=np.int64),
    'genre': np.random.randint(0, 20, size=(1, 5), dtype=np.int64),
    'transcript_embedding': np.random.rand(1, 768).astype(np.float32),
    'audio_embedding': np.random.rand(1, 512).astype(np.float32)
}

for _ in range(10):
    _ = session.run(None, input_tensor)

# Measure inference latency
start = time.time()
for _ in range(100):
    outputs = session.run(None, input_tensor)
end = time.time()

avg_latency = (end - start) / 100
print(f"Average inference time for SSEPT ONNX (CPU): {avg_latency:.6f} sec/sample")
