# Exercise 3: Apply cross-platform hardware optimizations with ONNX Runtime

[ONNX Runtime](https://onnxruntime.ai/) stands as the industry standard for hardware-agnostic AI deployment, enabling developers to optimize models across diverse hardware platforms—from NVIDIA GPUs to Intel CPUs to mobile processors—using a single framework with consistent APIs, while still leveraging hardware-specific acceleration libraries for production-grade performance.

> **Overview:** You need to deploy a model across multiple cloud providers (AWS, Azure, GCP) and on-premise servers with different hardware configurations, with consistent performance and without vendor lock-in.
> 
> **Scenario:** Your startup's image classification service processes 100,000+ images daily across heterogeneous infrastructure. The DevOps team reports suboptimal performance: your PyTorch model achieves only 60% GPU utilization, while CPU-only deployments are bottlenecked by poor threading configuration. You need to maximize performance on each platform while avoiding vendor lock-in and maintaining deployment simplicity.
> 
> **Goal:** Explore ONNX Runtime's cross-platform optimization capabilities including I/O binding, thread management, memory optimization, and execution provider selection to achieve consistent performance across diverse hardware environments.
> 
> **Tools:** torch, torchvision, onnx, onnxruntime-gpu, numpy, pillow
> 
> **Estimated Time:** 15 minutes

## Step 1: Setup

Let's establish your cross-platform testing environment.

In [1]:
# # Uncomment to install necessary libraries, then comment out and restart
# ! pip install onnx onnxruntime-gpu==1.19.2 torchvision pillow

In [2]:
import os
import warnings
warnings.filterwarnings("ignore")

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.models as models
import onnx
import onnxruntime as ort
import numpy as np
import time
from PIL import Image
import json
from datetime import datetime

# Create output directory
output_dir = "assets/exercise3"
os.makedirs(output_dir, exist_ok=True)

In [3]:
print("=== CROSS-PLATFORM DEPLOYMENT ENVIRONMENT ===")

# Check PyTorch and ONNX versions
print(f"PyTorch version: {torch.__version__}")
print(f"ONNX version: {onnx.__version__}")
print(f"ONNX Runtime version: {ort.__version__}")

# Verify GPU availability for cross-platform testing
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_properties = torch.cuda.get_device_properties(0)
    
    print(f"\n=== NVIDIA T4 HARDWARE ANALYSIS ===")
    print(f"GPU: {gpu_name}")
    print(f"Compute Capability: {gpu_properties.major}.{gpu_properties.minor}")
    print(f"Total Memory: {gpu_properties.total_memory / 1e9:.1f} GB")
    print(f"CUDA Cores: ~2,560")
    print(f"Tensor Cores: ~320 (2nd gen)")
    print(f"Memory Bandwidth: ~320 GB/s")
    
    # Check available execution providers
    print(f"\n=== AVAILABLE EXECUTION PROVIDERS ===")
    available_providers = ort.get_available_providers()
    for provider in available_providers:
        print(f"✓ {provider}")
    
    # Verify TensorRT availability
    tensorrt_available = 'TensorrtExecutionProvider' in available_providers
    print(f"\nTensorRT Support: {'✓ Available' if tensorrt_available else '✗ Not Available'}")
    
else:
    print("CUDA not available - some optimizations will be CPU-only")

print("\n✓ Environment setup complete!")

=== CROSS-PLATFORM DEPLOYMENT ENVIRONMENT ===
PyTorch version: 2.5.0+cu124
ONNX version: 1.18.0
ONNX Runtime version: 1.19.2

=== NVIDIA T4 HARDWARE ANALYSIS ===
GPU: Tesla T4
Compute Capability: 7.5
Total Memory: 15.6 GB
CUDA Cores: ~2,560
Tensor Cores: ~320 (2nd gen)
Memory Bandwidth: ~320 GB/s

=== AVAILABLE EXECUTION PROVIDERS ===
✓ TensorrtExecutionProvider
✓ CUDAExecutionProvider
✓ CPUExecutionProvider

TensorRT Support: ✓ Available

✓ Environment setup complete!


> **What do we mean with execution providers?** ONNX Runtime's execution provider system enables hardware abstraction by separating model optimization logic from hardware-specific implementation. 
> 
> Each provider _(CPU, CUDA, TensorRT, ...)_ uses the same ONNX graph but applies different optimization strategies—CPU providers focus on vectorization and threading, CUDA providers leverage parallel processing, while TensorRT providers add graph-level optimization and kernel fusion.

## Step 2: Load model

For this exercise, we'll use [EfficientNet-B0](https://docs.pytorch.org/vision/main/models/generated/torchvision.models.efficientnet_b0.html) as it represents modern CNN architectures commonly deployed across platforms.

In [4]:
# Load pre-trained EfficientNet model
print("Loading EfficientNet-B0 for cross-platform optimization...")

# Load pre-trained EfficientNet-B0 model
efficientnet_model = models.efficientnet_b0(pretrained=True)  # Add your code here

# Set model to evaluation mode for inference
efficientnet_model.eval()

print(f"Model loaded: {efficientnet_model.__class__.__name__}")
print(f"Total parameters: {sum(p.numel() for p in efficientnet_model.parameters()):,}")
print(f"Model size: {sum(p.numel() * p.element_size() for p in efficientnet_model.parameters()) / 1024**2:.1f} MB")

# Prepare sample input for testing
# EfficientNet-B0 expects 224x224 RGB images
batch_size = 32
input_shape = (batch_size, 3, 224, 224)
sample_input = torch.randn(input_shape)

print(f"\nSample input shape: {sample_input.shape}")
print(f"Input tensor size: {sample_input.numel() * sample_input.element_size() / 1024**2:.1f} MB")

# Verify model works with sample input
with torch.no_grad():
    output = efficientnet_model(sample_input)
    print(f"Output shape: {output.shape}")
    print(f"Model successfully processes input ✓")

Loading EfficientNet-B0 for cross-platform optimization...
Model loaded: EfficientNet
Total parameters: 5,288,548
Model size: 20.2 MB

Sample input shape: torch.Size([32, 3, 224, 224])
Input tensor size: 18.4 MB
Output shape: torch.Size([32, 1000])
Model successfully processes input ✓


> **EfficientNet hardware-architecture characteristics**: EfficientNet's compound scaling and mobile-optimized blocks make it an excellent test case for cross-platform optimization, as different hardware platforms will benefit from different optimization strategies.

## Step 3: Convert to ONNX with optimization analysis

Convert the PyTorch model to ONNX format and analyze the computational graph for cross-platform deployment.

In [5]:
def convert_to_onnx_with_analysis(model, sample_input, output_path):
    """Convert PyTorch model to ONNX with detailed analysis"""
    
    print("=== PYTORCH TO ONNX CONVERSION ===")
    
    # TODO: Export PyTorch model to ONNX format
    # HINT: ONNX export creates a hardware-agnostic intermediate representation
    # Think about: How do you enable variable batch sizes for flexible deployment?
    # Key considerations: opset version compatibility, dynamic shape support, operator coverage
    # Reference: https://pytorch.org/docs/stable/onnx.html

    # Add your code here
    torch.onnx.export(
        model,
        sample_input, 
        output_path,
        export_params=True,
        opset_version=17,
        do_constant_folding=True,
        input_names=['input'],
        output_names=['output'],
        dynamic_axes={
            'input': {0: 'batch_size'},
            'output': {0: 'batch_size'}
        }
    )
    
    # Verify ONNX model
    onnx_model = onnx.load(output_path)
    onnx.checker.check_model(onnx_model)
    
    # Analyze model structure for cross-platform insights
    print(f"✓ ONNX export successful")
    print(f"ONNX model size: {os.path.getsize(output_path) / 1024**2:.1f} MB")
    print(f"ONNX opset version: {onnx_model.opset_import[0].version}")
    print(f"Total nodes: {len(onnx_model.graph.node)}")
    
    # Count operator types for provider compatibility analysis
    op_counts = {}
    for node in onnx_model.graph.node:
        op_counts[node.op_type] = op_counts.get(node.op_type, 0) + 1
    
    print(f"\nTop operator types:")
    for op_type, count in sorted(op_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"  {op_type}: {count} nodes")
    
    return output_path

# Convert model to ONNX
onnx_model_path = os.path.join(output_dir, "efficientnet_b0.onnx")
onnx_model_path = convert_to_onnx_with_analysis(efficientnet_model, sample_input, onnx_model_path)

=== PYTORCH TO ONNX CONVERSION ===
✓ ONNX export successful
ONNX model size: 20.2 MB
ONNX opset version: 17
Total nodes: 239

Top operator types:
  Conv: 81 nodes
  Sigmoid: 65 nodes
  Mul: 65 nodes
  GlobalAveragePool: 17 nodes
  Add: 9 nodes


> **What insights can we gather from the operator analysis?** The operator breakdown reveals EfficientNet's optimization-friendly architecture—Conv operations dominate the computational graph (ideal for GPU acceleration), while Sigmoid and Mul operations are well-supported across all execution providers. 
> 
> This operator distribution indicates the model will benefit significantly from TensorRT's convolution fusion optimizations while maintaining broad compatibility for CPU fallback scenarios.

## Step 4: Execution provider performance comparison with default values

Let's test different execution providers to understand hardware abstraction trade-offs.

> **IMPORTANT**: You may need to install different onnxruntime versions depending on your chosen execution providers. By default, the notebook installs `onnxruntime-gpu` to support CUDA 12.x. For more details and other installation paths, please refer to the [Install ONNX Runtime](https://onnxruntime.ai/docs/install/) guide.

In [None]:
def benchmark_execution_provider(model_path, input_data, provider_config, num_warmup=5, num_runs=20):
    """Benchmark specific execution provider configuration"""
    
    # TODO: Create ONNX Runtime session with specified provider
    # HINT: We are passing the providers as a function input
    # Reference: https://onnxruntime.ai/docs/execution-providers/
    session = ort.InferenceSession(model_path, providers=provider_config)  # Add your code here
    
    # Get input/output names
    input_name = session.get_inputs()[0].name
    output_name = session.get_outputs()[0].name
    
    # Warm up the session
    input_dict = {input_name: input_data}
    for _ in range(num_warmup):
        _ = session.run([output_name], input_dict)
    
    # Benchmark inference
    times = []
    for _ in range(num_runs):
        start_time = time.perf_counter()
        outputs = session.run([output_name], input_dict)
        end_time = time.perf_counter()
        times.append(end_time - start_time)
    
    avg_time = np.mean(times)
    std_time = np.std(times)
    throughput = input_data.shape[0] / avg_time  # samples/sec
    
    # Get actual execution provider used
    used_providers = session.get_providers()
    
    return {
        'provider_config': provider_config,
        'used_providers': used_providers,
        'avg_latency_ms': avg_time * 1000,
        'std_latency_ms': std_time * 1000,
        'throughput_samples_sec': throughput,
        'output_shape': outputs[0].shape
    }

def run_provider_comparison():
    """Compare performance across different execution providers"""
    
    print("=== EXECUTION PROVIDER PERFORMANCE COMPARISON ===")
    
    # Prepare input data
    test_input = sample_input.numpy().astype(np.float32)
    
    # TODO: Define execution provider configurations to test
    # Hint: Each provider configuration is a list of execution providers to try in order; it creates a fallback chains - if 1st provider fails, it falls back to 2nd, then 3rd...
    # IMPORTANT: You may need to install specific onnxruntime packages / library versions to test beyond CPUExecutionProvider and CUDAExecutionProvider
    # Reference: https://onnxruntime.ai/docs/execution-providers/
    provider_configs = [
        ['CPUExecutionProvider'],
        ['CUDAExecutionProvider', 'CPUExecutionProvider']  # GPU first, fallback to CPU
    ] # Add your code here
    
    results = {}
    
    for i, provider_config in enumerate(provider_configs):
        provider_name = provider_config[0].replace('ExecutionProvider', '')
        print(f"\nTesting {provider_name}...")
        
        try:
            result = benchmark_execution_provider(onnx_model_path, test_input, provider_config)
            results[provider_name] = result
            
            print(f"  Used: {result['used_providers'][0]}")
            print(f"  Latency: {result['avg_latency_ms']:.1f} ± {result['std_latency_ms']:.1f} ms")
            print(f"  Throughput: {result['throughput_samples_sec']:.1f} samples/sec")
            
        except Exception as e:
            print(f"  ✗ Failed: {str(e)}")
            results[provider_name] = None
    
    return results

# Run provider comparison
provider_results = run_provider_comparison()

=== EXECUTION PROVIDER PERFORMANCE COMPARISON ===

Testing CPU...
  Used: CPUExecutionProvider
  Latency: 457.5 ± 12.6 ms
  Throughput: 69.9 samples/sec

Testing CUDA...
  Used: CUDAExecutionProvider
  Latency: 36.7 ± 0.1 ms
  Throughput: 871.7 samples/sec


> **Execution provider deep-dive for NVIDIA GPUs**: ONNX Runtime provides some official guidance on how to [choose execution providers](https://pkreg101.github.io/onnxruntime/docs/performance/choosing-execution-providers.html). For NVIDIA hardware, ONNX Runtime offers three execution providers with different optimization focuses:
> 
> - **CUDA EP**: Basic GPU acceleration using cuDNN/cuBLAS libraries
> - **TensorRT EP**: Advanced optimization with graph analysis and kernel fusion
> - **TensorRT RTX EP**: Consumer RTX-optimized version with faster compile times

## Step 5: Implement ONNX Runtime optimizations across platforms

Let's now implement targeted optimization strategies for each deployment environment. First we’ll optimize each environment separately before unifying into a single cross-platform configuration. Why this order?
- **GPU optimization**: Focus on maximizing expensive GPU instance utilization by eliminating I/O bottlenecks.  
- **CPU optimization**: Focus on threading & memory efficiency where resources are constrained.  
- **Unified strategy**: Combine both approaches into a flexible configuration that adapts to the hardware at runtime.

This progression helps you see *why* GPU and CPU need different strategies, before you merge them into one deployment recipe.

> **IMPORTANT**: For this exercise, the focus is on session-level optimizations that provide 80% of performance benefits with a simple configuration. BUT, each execution provider offers additional configuration parameters for fine-tuning performance:
> - [**CUDA Provider**](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html): gpu_mem_limit, cudnn_conv_algo_search, arena_extend_strategy for memory and convolution optimization
> - [**TensorRT Providers**](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html): Workspace size, precision modes, engine caching for compilation optimization
> - ...

> **EXPERT TIP**: ONNX Runtime provides [automated performance tuning](https://pkreg101.github.io/onnxruntime/docs/performance/performance-tuning-tools.html) tools. Here, you experiment manually to get hands-on understanding of how different configurations utilize specific hardware.


### A. Maximize GPU utilization

GPU utilization can be boosted by implementing I/O binding and GPU-focused optimizations. This is because memory transfers and suboptimal threading can leave GPU cores idle.

In [7]:
def optimize_for_gpu(model_path):
    """Optimize configuration for GPU - focus on maximizing GPU utilization"""
    
    print("=== GPU OPTIMIZATION (Target: >90% GPU utilization) ===")
    
    if 'CUDAExecutionProvider' not in ort.get_available_providers():
        print("CUDA not available - skipping GPU optimization")
        return None
    
    # Prepare input on GPU
    input_tensor = torch.randn(batch_size, 3, 224, 224, device='cuda', dtype=torch.float32)
    
    # TODO: Configure session for maximum GPU utilization
    # Hint: GPU optimization focuses on eliminating CPU↔GPU transfers and leveraging acceleration
    # Reference: https://onnxruntime.ai/docs/api/python/api_summary.html
    sess_options = ort.SessionOptions()
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL  # Add your code here
    sess_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL  # Add your code here
    
    # Create session with inputs and outputs

    # TODO: Define providers for the optimized session
    # HINT: What execution providers supports GPUs? What can you fallback to?
    providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']  # Add your code here
    session = ort.InferenceSession(model_path, sess_options, providers=providers)  # Add your code here
    
    input_name = session.get_inputs()[0].name
    output_name = session.get_outputs()[0].name

    # Forward once to get output shape
    ort_input = input_tensor.cpu().numpy().astype(np.float32)
    dummy_out = session.run([output_name], {input_name: ort_input})
    output_shape = dummy_out[0].shape
    
    # Create input OrtValue on GPU  
    input_ortvalue = ort.OrtValue.ortvalue_from_numpy(input_tensor.cpu().numpy(), 'cuda', 0)
    output_ortvalue = ort.OrtValue.ortvalue_from_shape_and_type(output_shape, np.float32, 'cuda', 0)
    
    # TODO: Implement I/O binding for GPU memory optimization
    # Hint: You need to define binds for both input and output
    # Reference: https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding
    io_binding = session.io_binding()  
    
    # Add your code here
    io_binding.bind_input(input_name, 'cuda', 0, np.float32, input_ortvalue.shape(), input_ortvalue.data_ptr())
    io_binding.bind_output(output_name, 'cuda', 0, np.float32, output_shape, output_ortvalue.data_ptr())
    
    # Warmup
    for _ in range(5):
        session.run_with_iobinding(io_binding)
        torch.cuda.synchronize()

    # Benchmark optimized configuration
    times = []
    for _ in range(15):
        torch.cuda.synchronize()
        start_time = time.perf_counter()
        session.run_with_iobinding(io_binding)
        torch.cuda.synchronize()
        end_time = time.perf_counter()
        times.append(end_time - start_time)
    
    avg_time = np.mean(times)
    throughput = input_tensor.shape[0] / avg_time
    
    print(f"Baseline CUDA latency: {provider_results['CUDA']['avg_latency_ms']:.1f} ms "
        f"({provider_results['CUDA']['throughput_samples_sec']:.1f} samples/sec)")
    print(f"✓ Optimized latency: {avg_time * 1000:.1f} ms "
        f"({throughput:.1f} samples/sec)")
    print(f"✓ Improvement vs baseline: {throughput / provider_results['CUDA']['throughput_samples_sec']:.2f}x throughput")
    
    return {
        'latency_ms': avg_time * 1000,
        'throughput_samples_sec': throughput,
        'optimization': 'I/O binding + GPU-focused configuration'
    }

# Optimize for GPU scenario
gpu_result = optimize_for_gpu(onnx_model_path)

=== GPU OPTIMIZATION (Target: >90% GPU utilization) ===
Baseline CUDA latency: 36.7 ms (871.7 samples/sec)
✓ Optimized latency: 33.6 ms (951.3 samples/sec)
✓ Improvement vs baseline: 1.09x throughput


> **TODO: Examine the GPU benchmark results.** How does I/O binding and session optimization affect throughput compared to the raw CUDA baseline? Did latency improve, worsen, or stay about the same?
> 
> _Write your answer here:_ The GPU optimization using I/O binding and session-level tuning improved throughput from 871.7 to 951.3 samples/sec, a ~9% increase. Latency also decreased slightly (36.7 → 33.6 ms). This shows that optimizing memory transfers and leveraging GPU-specific session options can significantly improve utilization and overall performance.

### B. Optimize CPU performance

Address poor CPU threading configuration for Standard instances via i) efficient thread utilization without oversubscription, and ii) memory optimization.

In [8]:
def optimize_for_cpu(model_path):
    """Optimize configuration for CPU instances - focus on threading efficiency"""
    
    print("\n=== CPU OPTIMIZATION (Target: Optimal threading for available vCPUs) ===")
    
    # Define session
    sess_options = ort.SessionOptions()

    # Prepare input
    input_data = np.random.randn(batch_size, 3, 224, 224).astype(np.float32)

    # TODO: Configure optimal threading for limited CPU resources
    # Hint: Look for session parameters with the 'thread' pattern
    # Note that thread overhead can exceed benefits when CPU cores are limited. How many vCPUs does your environment have?
    # Key insight: intra-op = threads per operation, inter-op = parallel operations (needs branches in model)
    # Reference: https://onnxruntime.ai/docs/api/python/api_summary.html#sessionoptions

    # Add your code here
    sess_options.intra_op_num_threads = 2
    sess_options.inter_op_num_threads = 1
    sess_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL  # Add your code here
    
    # TODO: Configure memory optimization for CPU efficiency
    # Hint: Look for session parameters with the 'mem' pattern
    # Reference: https://onnxruntime.ai/docs/api/python/api_summary.html#sessionoptions

    # Add your code here
    sess_options.enable_cpu_mem_arena = True
    sess_options.enable_mem_pattern = True
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL  # Add your code here
    
    # TODO: Create CPU-optimized session
    # HINT: You only need one provider for this session
    session = ort.InferenceSession(model_path, sess_options, providers=['CPUExecutionProvider'])  # Add your code here
    
    # Benchmark threading configuration
    input_name = session.get_inputs()[0].name
    output_name = session.get_outputs()[0].name
    input_dict = {input_name: input_data}
    
    # Warmup
    for _ in range(5):
        _ = session.run([output_name], input_dict)

    times = []
    for _ in range(15):
        start_time = time.perf_counter()
        outputs = session.run([output_name], input_dict)
        end_time = time.perf_counter()
        times.append(end_time - start_time)
    
    avg_time = np.mean(times)
    throughput = input_data.shape[0] / avg_time
    
    print(f"Baseline CPU latency: {provider_results['CPU']['avg_latency_ms']:.1f} ms "
        f"({provider_results['CPU']['throughput_samples_sec']:.1f} samples/sec)")
    print(f"✓ Optimized latency: {avg_time * 1000:.1f} ms "
        f"({throughput:.1f} samples/sec)")
    print(f"✓ Improvement vs baseline: {throughput / provider_results['CPU']['throughput_samples_sec']:.2f}x throughput")
    
    return {
        'latency_ms': avg_time * 1000,
        'throughput_samples_sec': throughput,
        'optimization': 'Threading + memory optimization for CPU'
    }

# Optimize for CPU scenario  
cpu_result = optimize_for_cpu(onnx_model_path)


=== CPU OPTIMIZATION (Target: Optimal threading for available vCPUs) ===
Baseline CPU latency: 457.5 ms (69.9 samples/sec)
✓ Optimized latency: 456.2 ms (70.1 samples/sec)
✓ Improvement vs baseline: 1.00x throughput


> **TODO: Examine the CPU benchmark results.** How did threading and memory optimizations affect throughput and latency compared to the CPU baseline? Were the improvements significant? Why did you choose these configuration values?
> 
> _Write your answer here_: CPU optimization using threading and memory tuning had a minimal impact: throughput improved only slightly from 69.9 to 70.1 samples/sec, and latency remained nearly unchanged (457.5 → 456.2 ms). This indicates that the default ONNX Runtime configuration was already near-optimal for this CPU environment, and manual threading adjustments provided little benefit for this workload and batch size. 
> 
> The selected conservative threading often outperforms aggressive parallelization due to reduced context switching overhead and thermal management on a CPU with few vCPUs.

### C. Define a balanced multi-platform strategy

Let's imagine you tried to create a balanced session configuration that works efficiently across different hardware environments. This **single configuration** finds the sweet spot between:
- Using GPU optimizations (with I/O binding),  
- Setting CPU optimizations (with threading & memory optimizations)

**TODO: Explain what would happen if you tried to use the same session options for both CPU and GPU.** Consider the impact on CPU throughput, GPU throughput, and overall deployment portability.

_Write your answers here:_ If we tried to use the same session options for both CPU and GPU, the GPU would likely be fine because it can tolerate defaults like sequential execution and memory arena usage. However, CPU performance could degrade significantly because it relies on proper threading and parallel execution to maximize throughput. Using GPU-optimized options on CPU could reduce throughput, increase latency, and underutilize CPU cores.

The best practice is to keep a single ONNX model export for portability, but allow platform-specific session options when performance matters. This ensures the model runs everywhere while still leveraging CPU and GPU optimizations independently.

-----

> **TODO: What if we wanted to also deploy on mobile?**
> 
> Reflect on how ONNX Runtime's session should be configured for memory- and power-constrained devices like mobile phones or edge devices. Consider memory usage, optimization level, threading, and power efficiency vs. peak performance.
> 
> _Write your answers here:_ When deploying ONNX Runtime models to mobile devices, memory and power efficiency are critical. To optimize for these constraints:
> 
> - **Disable memory arena** (`enable_cpu_mem_arena=False`) to reduce peak memory usage on devices with limited RAM
> - **Use basic optimization level** to ensure broad operator support and avoid device-specific optimizations that may increase memory or compilation overhead
> - **Run in single-threaded mode** to preserve battery life, reduce thermal impact, and avoid context-switching overhead
> 
> Overall, mobile deployment prioritizes consistency, stability, and energy efficiency over maximum throughput, making conservative ONNX Runtime configurations the optimal choice.

## Conclusion

In this exercise, you have uncovered ONNX Runtime's cross-platform optimization capabilities for AI. 

You started from a simple CPU vs GPU baseline, and step by step explored how to optimize sessions for platform-specific execution (GPU utilization, threading/memory on CPU).

**Key insight**: A single ONNX model is sufficient for cross-platform deployment; session options should be tuned per platform to achieve optimal performance, while portability is maintained.

##### **Next cross-platform optimization challenges to explore:**

- **Advanced Provider Configuration**: Explore provider-specific optimization options for specialized scenarios.
- **Execution Providers Experimentation**: Explore other execution providers (e.g., TensorRT, OpenVINO, DirectML) for specialized hardware.  
- **On-device Benchmark**: Benchmark on **different hardware** (edge device, larger CPU server, or alternative GPU).