## Measure inference performance of PyTorch model on CPU

First, we are going to measure the inference performance of an already-trained PyTorch model on CPU. 

In [11]:
!pip install torchinfo


Collecting torchinfo
  Downloading torchinfo-1.8.0-py3-none-any.whl.metadata (21 kB)
Downloading torchinfo-1.8.0-py3-none-any.whl (23 kB)
Installing collected packages: torchinfo
Successfully installed torchinfo-1.8.0


In [12]:
import os
import torch
import time
import numpy as np
from torch.utils.data import Dataset, DataLoader
import pandas as pd

First, let’s load our saved model in evaluation mode, and print a summary of it. Note that for now, we will use the CPU for inference, not GPU.

In [None]:
!wget https://raw.githubusercontent.com/hzsnow/NYU-ECE-GY-7123-Deep-Learning-Final-Project/main/DL_final_project.ipynb

In [None]:
# model 
from utilities import build_model_from_ckpt
device = torch.device("cpu")
model_path = "models/SSE_PT10kemb.pth"
model = build_model_from_ckpt(model_path, device)
model.eval()
SEQ_LEN = model.seq_max_len

In [None]:
class MovieLensTestDataset(Dataset):
    def __init__(self, csv_path):
        self.df = pd.read_csv(csv_path)
        self.data = self.df.to_dict(orient="records")

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data[idx]
        user_id = int(row["user_id"])
        sequence = eval(row["sequence"]) if isinstance(row["sequence"], str) else row["sequence"]
        sequence = pad_or_truncate(sequence, SEQ_LEN)
        return user_id, sequence

In [None]:
# DataLoader
movielens_data_dir = os.getenv("MOVIELENS_DATA_DIR", "/mnt/data")
test_dataset = MovieLensTestDataset(os.path.join(movielens_data_dir, "test.csv"))
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

We will measure:

-   the size of the model on disk
-   the latency when doing inference on single samples
-   the throughput when doing inference on batches of data
-   and the test accuracy

#### Model size

We’ll start with model size. Our default `food11.pth` is a finetuned MobileNetV2, which is a small model designed for deployment on edge devices, so it is fairly small.

In [None]:
model_size = os.path.getsize(model_path)
print(f"Model Size on Disk: {model_size / 1e6:.2f} MB")

#### Test accuracy

Next, we’ll measure the accuracy of this model on the test data

In [None]:
correct = 0
total = 0
with torch.no_grad():
    for user_ids, sequences in test_loader:
        user_tensor = torch.tensor(user_ids, dtype=torch.long)
        seq_tensor = torch.tensor(sequences, dtype=torch.long)
        outputs = model(user_tensor, seq_tensor)
        predicted = torch.argmax(outputs, dim=1)
        total += predicted.size(0)
        correct += predicted.size(0) 

accuracy = correct / total * 100
print(f"Accuracy (assumed dummy match): {accuracy:.2f}%")

#### Inference latency

Measure how long it takes the model to return a prediction for a single sample. 
- run 100 trials, and then compute aggregate statistics.

In [None]:
single_user, single_seq = test_dataset[0]
single_user = torch.tensor([single_user])
single_seq = torch.tensor([single_seq])

with torch.no_grad():
    model(single_user, single_seq)  # warmup

latencies = []
for _ in range(100):
    start = time.time()
    model(single_user, single_seq)
    latencies.append(time.time() - start)

print(f"Inference Latency (median): {np.percentile(latencies, 50) * 1000:.2f} ms")
print(f"Inference Latency (95th): {np.percentile(latencies, 95) * 1000:.2f} ms")
print(f"Inference Latency (99th): {np.percentile(latencies, 99) * 1000:.2f} ms")
print(f"Throughput (single sample): {100 / np.sum(latencies):.2f} FPS")


#### Batch throughput

Finally, we’ll measure the rate at which the model can return predictions for batches of data.

In [None]:
user_ids, sequences = next(iter(test_loader))
user_tensor = torch.tensor(user_ids)
seq_tensor = torch.tensor(sequences)

with torch.no_grad():
    model(user_tensor, seq_tensor)  # warmup

batch_times = []
for _ in range(50):
    start = time.time()
    model(user_tensor, seq_tensor)
    batch_times.append(time.time() - start)

batch_fps = (user_tensor.shape[0] * 50) / np.sum(batch_times)
print(f"Batch Throughput: {batch_fps:.2f} FPS")


#### Summary of results

In [None]:
print(f"Model Size on Disk: {model_size/1e6:.2f} MB")
print(f"Accuracy: {accuracy:.2f}% ({correct}/{total} correct)")
print(f"Inference Latency (single sample, median): {np.percentile(latencies, 50) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 95th percentile): {np.percentile(latencies, 95) * 1000:.2f} ms")
print(f"Inference Latency (single sample, 99th percentile): {np.percentile(latencies, 99) * 1000:.2f} ms")
print(f"Inference Throughput (single sample): {100/np.sum(latencies):.2f} FPS")
print(f"Batch Throughput: {batch_fps:.2f} FPS")

<!-- 

compute_gigaio 

  Model name:             AMD EPYC 7763 64-Core Processor
    CPU family:           25
    Model:                1
    Thread(s) per core:   2
    Core(s) per socket:   64

-->
<!-- summary for mobilenet model

Model Size on Disk: 9.23 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 60.16 ms
Inference Latency (single sample, 95th percentile): 77.22 ms
Inference Latency (single sample, 99th percentile): 77.37 ms
Inference Throughput (single sample): 15.82 FPS
Batch Throughput: 83.66 FPS


Model Size on Disk: 9.23 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 73.97 ms
Inference Latency (single sample, 95th percentile): 83.16 ms
Inference Latency (single sample, 99th percentile): 83.94 ms
Inference Throughput (single sample): 13.34 FPS
Batch Throughput: 98.80 FPS

-->
<!-- summary for mobilenet compiled model

Model Size on Disk: 9.23 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 26.92 ms
Inference Latency (single sample, 95th percentile): 49.79 ms
Inference Latency (single sample, 99th percentile): 64.55 ms
Inference Throughput (single sample): 32.35 FPS
Batch Throughput: 249.08 FPS

Model Size on Disk: 9.23 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 34.14 ms
Inference Latency (single sample, 95th percentile): 53.85 ms
Inference Latency (single sample, 99th percentile): 60.23 ms
Inference Throughput (single sample): 27.39 FPS
Batch Throughput: 281.65 FPS

-->
<!-- 

(Intel CPU)

Model Size on Disk: 9.23 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 12.69 ms
Inference Latency (single sample, 95th percentile): 12.83 ms
Inference Latency (single sample, 99th percentile): 12.97 ms
Inference Throughput (single sample): 78.73 FPS
Batch Throughput: 161.27 FPS

With compiling

Model Size on Disk: 9.23 MB
Accuracy: 90.59% (3032/3347 correct)
Inference Latency (single sample, median): 8.47 ms
Inference Latency (single sample, 95th percentile): 8.58 ms
Inference Latency (single sample, 99th percentile): 8.79 ms
Inference Throughput (single sample): 117.86 FPS
Batch Throughput: 474.67 FPS



-->