## Exercise 3: Deploy a compressed model to mobile with ExecuTorch and ONNX

In this exercise, you'll prepare and deploy a compressed model using two popular mobile deployment frameworks: PyTorch's Executorch and ONNX Runtime. You'll compare their performance characteristics to make an informed deployment decision.

> **Task**: Convert a compressed language model to both Executorch and ONNX formats, and compare their performance for mobile deployment.
> 
> **Goal**: By the end of this exercise, you'll understand the tradeoffs between different mobile deployment frameworks and learn how to optimize models for specific target platforms.
> 
> **Scenario**:  Your team has developed a image classification model for that can run directly on mobile devices to protect customer privacy and ensure responsiveness even with poor connectivity. You've been tasked with preparing the compressed model for deployment to various Android and iOS devices.
> 
> **Tools**: pytorch, executorch, onnx, onnxruntime, matplotlib
> <br> _Prior experience recommended!_
> 
> **Estimated Time**: 15 minutes

## Step 1: Setup

First, let's set up our environment with the necessary libraries.

In [1]:
# # Uncomment to install necessary libraries, then comment out the cell block again and restart the notebook
# ! pip install --upgrade torch torchvision executorch onnx onnxruntime flatbuffers typing_extensions

In [2]:
# Import libraries
import torch
from torch.utils.mobile_optimizer import optimize_for_mobile
import torchvision
import torchvision.transforms as transforms
from torchvision.models import mobilenet_v3_small, MobileNet_V3_Small_Weights
from torch.export import Dim, export
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.runtime import Runtime
import onnx
import onnxruntime as ort
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
import time
import os
import json
from pathlib import Path

# For reproducibility
torch.manual_seed(42)
np.random.seed(42)

# TODO: Set up the device to CPU for simulating mobile conditions
# Hint: Use torch.device with the appropriate string argument
device = torch.device("cpu")  # Add your code here

# Create output directory
output_dir = Path("assets/exercise3")
output_dir.mkdir(parents=True, exist_ok=True)

# Ensure libraries are in path
os.environ["PATH"] = f"{os.path.expanduser('~/.local/bin')}:" + os.environ["PATH"]

print(f"Device: {device}")
print(f"PyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")
print(f"ONNX Runtime version: {ort.__version__}")
print("Libraries imported successfully!")

Device: cpu
PyTorch version: 2.7.0+cu126
Torchvision version: 0.22.0+cu126
ONNX Runtime version: 1.22.0
Libraries imported successfully!


> **Why are we using CPU instead of GPU?**
> We are deliberately using CPU for benchmarking to simulate our mobile device conditions. By using CPU, we're creating a more realistic test environment that aligns with the constraints you'll face in the real-world deployment.

# Step 2: Load the pre-compressed model

We'll use a pre-trained and compressed MobileNetV3-Small model, which is already optimized for mobile deployment.

In [3]:
# Load pre-trained model
weights = MobileNet_V3_Small_Weights.DEFAULT
model = mobilenet_v3_small(weights=weights)

# TODO: Move the model to the device 
# Hint: Use one of the PyTorch model's built-in methods with the `device` variable set up in Step 1.
# See: https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html
model = model.to(device)  # Add your code here

# TODO: Set the model to evaluation mode
# Hint: PyTorch models have an evaluation mode for inference
# See: https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html
model.eval()  # Add your code here

# Calculate model size and parameters
param_size = sum(p.numel() * p.element_size() for p in model.parameters()) / (1024 * 1024)
buffer_size = sum(b.numel() * b.element_size() for b in model.buffers()) / (1024 * 1024)
model_size = param_size + buffer_size

print(f"Model: MobileNetV3-Small")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Model size: {model_size:.2f} MB")

# Load ImageNet class labels
with open(output_dir / "imagenet_classes.json", "w") as f:
    class_idx = json.dumps({str(i): weights.meta["categories"][i] for i in range(1000)})
    f.write(class_idx)

# Define image preprocessing pipeline
preprocess = weights.transforms()

Downloading: "https://download.pytorch.org/models/mobilenet_v3_small-047dcff4.pth" to /home/student/.cache/torch/hub/checkpoints/mobilenet_v3_small-047dcff4.pth


100%|██████████| 9.83M/9.83M [00:00<00:00, 80.7MB/s]


Model: MobileNetV3-Small
Number of parameters: 2,542,856
Model size: 9.75 MB


> **Why MobileNetV3-Small?** MobileNetV3 represents a family of models specifically designed for mobile deployment. These models use several compression techniques:
> 
> - *Depthwise Separable Convolutions* - Factorizes standard convolutions into depthwise and pointwise operations, reducing computation by 8-9x
> - *Squeeze-and-Excitation* - Adds channel attention mechanisms with minimal overhead
> - *Architectural Search* - The model architecture was optimized using Neural Architecture Search (NAS)
> - *Activation Functions* - Uses h-swish activation, which is a hardware-friendly approximation of swish
> 
> These built-in optimizations make it an excellent starting point for mobile deployment, requiring less manual compression compared to standard architectures.

## Step 3: Define sample inputs for testing

Let's load and prepare a sample image for testing our model conversions.

In [4]:
def prepare_sample_image(size=224, batch_size=1):
    """Create sample image data for inference testing."""
    # Create a random RGB image tensor
    image = torch.rand(batch_size, 3, size, size, device=device)
    return image

# Create sample input
sample_input = prepare_sample_image(size=224, batch_size=1)
print(f"Sample input shape: {sample_input.shape}")

# TODO: Move the input tensor to CPU for compatibility with deployment frameworks
# Hint: PyTorch tensors have a built-in method for this functionality
# See: https://docs.pytorch.org/docs/stable/tensors.html 
sample_input = sample_input.cpu()  # Add your code here

# Test the original model with the sample input
with torch.no_grad():
    original_output = model(sample_input)
    
print(f"Original model output shape: {original_output.shape}")
print(f"Top prediction: {original_output.argmax(dim=1).item()}")

Sample input shape: torch.Size([1, 3, 224, 224])
Original model output shape: torch.Size([1, 1000])
Top prediction: 21


> **Should we use random data?** In the real world, the answer is better not to! While synthetic data works well for pure performance benchmarking, you would also test with representative real-world data to validate model accuracy. 

## Step 4: Export model to ONNX format
Now, let's export the PyTorch model to ONNX format. 

In our example, we have assumed the model to already have been compressed. In practice, you'd typically run post-training quantization (to INT8!) and/or graph optimizations as part of the export.

**Note**: We don't need to export to a different format for ExecuTorch which, unlike the previous PyTorch Mobile, does not require conversion to TorchScript!

In [5]:
def export_to_onnx(model, sample_input, onnx_path):
    """Export model to ONNX format."""
    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(onnx_path), exist_ok=True)
    
    # TODO: Step 1. Define dynamic axes for batch and input dimensions
    # Hint: The dictionary should define which dimensions can vary (like batch size)
    # See: https://pytorch.org/docs/stable/onnx.html#torch.onnx.export
    dynamic_axes = {
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }  # Add your code here 
    
    # Step 2. Export the model to ONNX format
    with torch.no_grad():
        torch.onnx.export(
            model,               # model being run
            sample_input,        # model input (or a tuple for multiple inputs)
            onnx_path,           # where to save the model
            export_params=True,  # store the trained parameter weights inside the model file
            opset_version=12,    # the ONNX version to export the model to
            input_names=['input'],     # the model's input names
            output_names=['output'],   # the model's output names
            dynamic_axes=dynamic_axes   # variable length axes
        )
    
    # TODO: Step 3: Verify the model structure and check for errors
    # Hint: You only need one single call to a built-in onnx method!
    # See: https://onnx.ai/onnx/api/checker.html
    onnx_model = onnx.load(onnx_path)
    onnx.checker.check_model(onnx_model)  # Add your code here 
    
    print(f"ONNX model saved to {onnx_path}")
    print(f"ONNX model size: {os.path.getsize(onnx_path) / (1024 * 1024):.2f} MB")
    
    return onnx_path

# Export the model to ONNX
onnx_path = str(output_dir / "models/mobilenetv3_small.onnx")
exported_onnx_path = export_to_onnx(model, sample_input, onnx_path)

ONNX model saved to assets/exercise3/models/mobilenetv3_small.onnx
ONNX model size: 9.71 MB


> **Understanding ONNX export parameters**: Export parameters are crucial for handling the real-world conditions you expect for your application once deployed.
> 
> - *dynamic_axes*: Allows the model to handle varying input dimensions (like different batch sizes)
> - *opset_version*: Defines which ONNX operations are available (higher versions support more operations)
> - *do_constant_folding*: Pre-computes constant expressions to optimize inference
> - *input_names/output_names*: Names tensors for easier integration with inference engines
> 
> _**IMPORTANT**_: Operators not supported by your chosen opset version can cause export failures, a common issue when deploying to mobile.

## Step 5: Prepare model for mobile deployment

Let's optimize both models specifically for mobile deployment. 

Mobile devices have unique constraints like limited battery life, restricted memory, and diverse hardware capabilities. Frameworks can address these specific constraints out-of-the-box.

In [6]:
def export_to_executorch(model, sample_input, output_path):
    """Convert PyTorch model to Executorch format."""
    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    
    print("Exporting model to Executorch...")

    # TODO: Step 1: Define dynamic shapes for varying input sizes
    # Hint: Use Dim objects to define min/max ranges for each dimension
    # See: https://pytorch.org/docs/stable/export.html    
    dynamic_shapes = {
        # For MobileNetV3, let's assume height and width can vary between 224 and 640
        "x": {
            2: Dim("h", min=224, max=640),
            3: Dim("w", min=224, max=640),
        }
    }  # Add your code here 
    
    # Step 2: Export the model using torch.export
    exported_program = export(model, (sample_input,), dynamic_shapes=dynamic_shapes)
    
    # TODO: Step 3: Transform to mobile and lower with XNNPACK partitioner
    # Hint: The required methods have already been imported in Step 1.!
    # See: https://docs.pytorch.org/executorch/stable/export-to-executorch-api-reference.html
    executorch_program = to_edge_transform_and_lower(
        exported_program,
        partitioner=[XnnpackPartitioner()]
    ).to_executorch()
    
    # Step 4: Save the program to a file
    with open(output_path, "wb") as f:
        f.write(executorch_program.buffer)
        
    print(f"Executorch model saved to {output_path}")
    print(f"Executorch model size: {os.path.getsize(output_path) / (1024 * 1024):.2f} MB")

# Export the model to ExecuTorch for mobile deployment
executorch_path = str(output_dir / "models/mobilenetv3_small.pte")
export_to_executorch(
    model, sample_input, executorch_path
)

Exporting model to Executorch...
Executorch model saved to assets/exercise3/models/mobilenetv3_small.pte
Executorch model size: 9.76 MB


In [7]:
def optimize_onnx_for_mobile(onnx_path):
    """Optimize ONNX model for mobile deployment."""
    # Load the ONNX model
    onnx_model = onnx.load(onnx_path)
    
    # Create optimized model path
    optimized_path = onnx_path.replace(".onnx", "_optimized.onnx")
    
    # TODO: Configure session options for mobile-specific optimizations
    # Hint: You should set `.graph_optimization_level` at minimum. Also, how do you save the converted model?
    # See: https://onnxruntime.ai/docs/api/python/api_summary.html#sessionoptions
    sess_options = ort.SessionOptions()
    # Add your code here
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
    sess_options.optimized_model_filepath = optimized_path
    
    # Create a session with the options (this will save the optimized model)
    _ = ort.InferenceSession(onnx_path, sess_options)
    
    print(f"Optimized ONNX model saved to {optimized_path}")
    print(f"Optimized ONNX model size: {os.path.getsize(optimized_path) / (1024 * 1024):.2f} MB")
    
    return optimized_path

# Optimize the ONNX model for mobile
optimized_onnx_path = optimize_onnx_for_mobile(exported_onnx_path)

Optimized ONNX model saved to assets/exercise3/models/mobilenetv3_small_optimized.onnx
Optimized ONNX model size: 9.71 MB


## Step 6: Benchmark the model with both frameworks

Now, let's compare the performance of the models with both frameworks.

**IMPORTANT**: While mathematically equivalent, the reorderings applied by model conversion to ONNX and ExecuTorch can cause small numerical differences. As long as these differences are extremely small (in the 10^-5 to 10^-6 range), this level of variation can be ignored.

In [8]:
def benchmark_executorch(model_path, sample_input, n_runs=5):
    """Benchmark Executorch model performance."""
    
    # TODO: Load and initialize the Executorch model
    # Hint: Think of Executorch like a minimal runtime engine for mobile inference.
    # See: https://docs.pytorch.org/executorch/stable/getting-started.html#testing-the-model
    print(f"Loading Executorch model from {model_path}...")
    runtime = Runtime.get()  # Add your code here 
    program = runtime.load_program(model_path)  # Add your code here 
    method = program.load_method("forward")  # Add your code here 
    
    # Warmup
    print("Running warmup...")
    for _ in range(10):
        _ = method.execute([sample_input])
    
    # Benchmark
    print("Running benchmark...")
    latencies = []
    start_time = time.time()
    
    with torch.no_grad():
        for _ in range(n_runs):
            start = time.time()
            ts_output = method.execute([sample_input])
            latencies.append((time.time() - start) * 1000)  # ms
    
    total_time = time.time() - start_time

    # Compare outputs
    is_output_close = torch.allclose(original_output, ts_output[0], rtol=1e-3, atol=1e-5)
    max_accuracy_diff = (original_output - ts_output[0]).abs().max().item()
    
    # Calculate metrics
    metrics = {
        "framework": "Executorch",
        "accuracy_check": is_output_close,
        "accuracy_diff": max_accuracy_diff,
        "avg_latency_ms": np.mean(latencies),
        "p90_latency_ms": np.percentile(latencies, 90),
        "max_latency_ms": np.max(latencies),
        "min_latency_ms": np.min(latencies),
        "throughput": n_runs / total_time,
        "model_size_mb": os.path.getsize(model_path) / (1024 * 1024)
    }
    
    return metrics

def benchmark_onnx(model_path, sample_input, n_runs=5):
    """Benchmark ONNX model performance."""
    
    # TODO: Define the execution configuration for ONNX to simulate a mobile environment
    # Hint: 
    # - How many CPU threads are typically available for an app?
    # - Would you want to use GPU acceleration here?
    # - What kind of optimization level balances speed and compatibility?
    # - How should execution behave on low-core devices?
    # See: https://onnxruntime.ai/docs/performance/tune-performance/threading.html
    session_config = {
        "threads": 2,  # Add your code here
        "providers": ['CPUExecutionProvider'],  # Add your code here
        # ORT_ENABLE_ALL is also a typicalle safe good option for more aggressive inference; let's keep ORT_ENABLE_EXTENDED for easier debugging as a first run
        "optimization_level": ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED,  # Add your code here
        "execution_mode": ort.ExecutionMode.ORT_SEQUENTIAL  # Add your code here
    }

    # Create ONNX Runtime session
    sess_options = ort.SessionOptions()
    sess_options.inter_op_num_threads = session_config["threads"]
    sess_options.execution_mode = session_config["execution_mode"]
    sess_options.graph_optimization_level = session_config['optimization_level']
    session = ort.InferenceSession(model_path, sess_options, providers=session_config['providers'])
    
    # Prepare input on CPU (simulating mobile environment)
    sample_input_np = sample_input.numpy()
    
    # Get input name
    input_name = session.get_inputs()[0].name
    
    # Warmup
    for _ in range(10):
        session.run(None, {input_name: sample_input_np})
    
    # Benchmark
    latencies = []
    start_time = time.time()
    
    for _ in range(n_runs):
        start = time.time()
        onnx_output = session.run(None, {input_name: sample_input_np})
        latencies.append((time.time() - start) * 1000)  # ms
    
    total_time = time.time() - start_time

    # Compare outputs
    onnx_output_tensor = torch.from_numpy(onnx_output[0])
    is_output_close = torch.allclose(original_output, onnx_output_tensor, rtol=1e-3, atol=1e-5)
    max_accuracy_diff = (original_output - onnx_output_tensor).abs().max().item()
    
    # Calculate metrics
    metrics = {
        "framework": "ONNX Runtime",
        "accuracy_check": is_output_close,
        "accuracy_diff": max_accuracy_diff,
        "avg_latency_ms": np.mean(latencies),
        "p90_latency_ms": np.percentile(latencies, 90),
        "max_latency_ms": np.max(latencies),
        "min_latency_ms": np.min(latencies),
        "throughput": n_runs / total_time,
        "model_size_mb": os.path.getsize(model_path) / (1024 * 1024)
    }
    
    return metrics

# Benchmark both frameworks
print("Benchmarking ExecuTorch model...")
executorch_metrics = benchmark_executorch(executorch_path, sample_input)

print("Benchmarking ONNX Runtime model...")
onnx_metrics = benchmark_onnx(optimized_onnx_path, sample_input)

# Compare results
results_df = pd.DataFrame([executorch_metrics, onnx_metrics])
print("\nBenchmark Results:")
print(results_df.set_index('framework'))

[program.cpp:135] InternalConsistency verification requested but not available


Benchmarking ExecuTorch model...
Loading Executorch model from assets/exercise3/models/mobilenetv3_small.pte...
Running warmup...
Running benchmark...
Benchmarking ONNX Runtime model...

Benchmark Results:
              accuracy_check  accuracy_diff  avg_latency_ms  p90_latency_ms  \
framework                                                                     
Executorch              True       0.000014     2239.517164     4557.667303   
ONNX Runtime            True       0.000007      121.779823      162.108755   

              max_latency_ms  min_latency_ms  throughput  model_size_mb  
framework                                                                
Executorch       5396.149397      701.857805    0.446519       9.758087  
ONNX Runtime      199.659586       99.669218    8.211110       9.712790  


> **Why is ExecuTorch performing so poorly?** ExecuTorch is designed for mobile and edge devices, not desktop CPUs. Running it on a development machine—especially without hardware acceleration—can result in much slower performance compared to ONNX Runtime, which is optimized for desktop/server inference. Always test on representative target devices rather than relying solely on development machines. 
> <br> And, when it comes to mobile, test on a variety of devices! This is because performance varies significantly across device models: high-end phones might show minimal differences between frameworks, while budget devices often reveal larger gaps. 
> 
> Running the inference benchmark here is just to demo the functionality. You could consider using [ExecuTorch with CMake for more advanced profiling](https://docs.pytorch.org/executorch/stable/tutorial-xnnpack-delegate-lowering.html#running-the-xnnpack-model-with-cmake).
>
> **Brainstorming question**: Different environments have different capabilities. How would you set the session configuration `session_config` for cloud and edge?

## Step 7: Analyze mobile-specific considerations

Let's analyze some mobile-specific considerations for each framework that impact deployment decisions.

In [11]:
# Ensure the output is fully printed 
pd.set_option('display.max_colwidth', None)

def analyze_mobile_considerations(executorch_metrics, onnx_metrics):
    """Analyze mobile-specific considerations for both frameworks."""
    # Calculate differences
    latency_diff_pct = ((executorch_metrics["avg_latency_ms"] - onnx_metrics["avg_latency_ms"]) / 
                      onnx_metrics["avg_latency_ms"]) * 100
    size_diff_pct = ((executorch_metrics["model_size_mb"] - onnx_metrics["model_size_mb"]) / 
                   onnx_metrics["model_size_mb"]) * 100
    
    # Create comparison table
    considerations = {
        "Category": [
            "App Size Impact", 
            "Battery Usage", 
            "Integration Complexity",
            "Runtime Compatibility",
            "Update Flexibility"
        ],
        "ExecuTorch": [
            "May increase app size due to bundled PyTorch Mobile runtime",
            "Depends on backend optimizations (e.g., XNNPACK)",
            "Easier if already using PyTorch; tight integration",
            "Requires PyTorch Mobile runtime integration",
            "Supports updating .pte files without app rebuild"
        ],
        "ONNX Runtime": [
            "Smaller runtime possible; better for minimal builds",
            "Also backend-dependent; supports efficient threading",
            "More flexible; supports multiple languages/frameworks",
            "Requires packaging ONNX Runtime with app",
            "Model updates require ONNX export + possible conversion"
        ]
    }

    return  pd.DataFrame(considerations)

# Analyze mobile considerations
considerations_df = analyze_mobile_considerations(executorch_metrics, onnx_metrics)
print("\nMobile-Specific Considerations:")
print(considerations_df)


Mobile-Specific Considerations:
                 Category  \
0         App Size Impact   
1           Battery Usage   
2  Integration Complexity   
3   Runtime Compatibility   
4      Update Flexibility   

                                                    ExecuTorch  \
0  May increase app size due to bundled PyTorch Mobile runtime   
1             Depends on backend optimizations (e.g., XNNPACK)   
2           Easier if already using PyTorch; tight integration   
3                  Requires PyTorch Mobile runtime integration   
4             Supports updating .pte files without app rebuild   

                                              ONNX Runtime  
0      Smaller runtime possible; better for minimal builds  
1     Also backend-dependent; supports efficient threading  
2    More flexible; supports multiple languages/frameworks  
3                 Requires packaging ONNX Runtime with app  
4  Model updates require ONNX export + possible conversion  


> **Beyond performance: Mobile ML success factors**: While benchmarks provide valuable data, successful mobile ML deployment requires considering many other factors:
> 
> - *First-run latency*: Initial startup time can be significantly longer than steady-state
> - *Memory footprint*: Runtime memory usage (not just model size) affects app stability
> - *Battery consumption*: Especially important for background processing
> - *App size increase*: Each MB added to your app reduces installation rates (estimated at ~0.5% loss per MB)
> - *OS compatibility*: Newer ML features may require recent OS versions
> - *Versioning strategy*: How you'll update models without requiring app updates
> 
> These considerations often outweigh small performance differences between frameworks. A slightly slower model that uses less memory or enables easier updates may be preferable in real-world scenarios.

## Conclusion

In this exercise, you've learned how to:

- Export a pre-trained MobileNetV3 model to both ExecuTorch and ONNX formats
- Optimize these models specifically for mobile deployment
- Benchmark and compare performance across different deployment frameworks
- Analyze mobile-specific considerations like battery usage and app size
- Make data-driven decisions about deployment frameworks and optimization techniques

These skills are essential for mobile ML engineers working on vision applications where efficient deployment is critical for user experience. By understanding the trade-offs between different frameworks and optimization techniques, you can make informed decisions for your specific use case.