# UdaciMed | Notebook 3: Hardware Acceleration & Production Deployment

Welcome to the final phase of UdaciMed's optimization pipeline! In this notebook, you will implement hardware acceleration techniques and deploy your optimized model using production-grade inference infrastructure.

## Recap: Optimization Journey

In [Notebook 2](02_architecture_optimization.ipynb), you have implemented architectural optimizations that brougth you closer to your optimization targets.

Now, it is time to unlock further performance opportunities with hardware acceleration.

> **Your mission**: Transform your optimized model into a production-ready deployment that serves multiple healthcare systems simultaneously while maintaining clinical safety standards.

### Hardware acceleration

You will implement and evaluate **2 core deployment techniques\***:

1. **Mixed Precision (FP16)** - GPU tensor core acceleration
2. **Dynamic Batching** - Multi-tenant optimization via NVIDIA Triton

with recommended **TensorRT Optimization**.

Additionally, you will analyze two other deployment scenarios: CPU (OpenVINO) and Edge deployment considerations.

_\* Note that while you are expected to implement both deployment techniques, you can decide whether to keep either or both in your final deployment strategy to best achieve targets._

---

Through this notebook, you will:

- **Convert PyTorch model to ONNX** for cross-platform deployment
- **Deploy via NVIDIA Triton Inference Server** with optional TensorRT optimization
- **Benchmark end-to-end performance** using Triton's metrics APIs and Triton Model Analyzer
- **Validate clinical safety** across the deployment pipeline
- **Analyze alternative deployment strategies** for diverse hardware environments

**Let's create a production-ready, hardware-accelerated diagnostic deployment!**

## Step 1: Setup the environment

First, let's set up the environment and understand our hardware capabilities.

In [None]:
# Make sure that libraries are dynamically re-loaded if changed
%load_ext autoreload
%autoreload 2

In [None]:
# Import core libraries
import torch
import torch.nn as nn
import torch.cuda.amp as amp
import tensorrt as trt
import tritonclient.http as httpclient
from prometheus_client.parser import text_string_to_metric_families
import numpy as np
import onnx
import pickle
from pprint import pprint
import json
import time
import socket
import subprocess
import requests
import docker
import tempfile
import shutil
from pathlib import Path
import psutil
from typing import Dict, List, Optional, Tuple, Any, Literal
import warnings
warnings.filterwarnings('ignore')

# Import project utilities
from utils.data_loader import (
    load_pneumoniamnist,
    get_sample_batch
)
from utils.model import (
    create_baseline_model,
    get_model_info
)
from utils.evaluation import (
    evaluate_with_multiple_thresholds
)
from utils.profiling import (
    PerformanceProfiler,
    measure_time
)
from utils.visualization import (
    plot_performance_profile,
    plot_batch_size_comparison
)
from utils.architecture_optimization import (
    create_optimized_model
)

In [None]:
# Set device and analyze hardware capabilities
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    
    # Check tensor core support for mixed precision - crucial for FP16 acceleration
    gpu_compute = torch.cuda.get_device_properties(0).major
    tensor_core_support = gpu_compute >= 7  # Volta+ architecture
    print(f"Tensor Core Support: {tensor_core_support}")
else:
    print("WARNING: CUDA not available - hardware acceleration will be limited")

print("Hardware acceleration environment ready!")

> **Hardware check**: Understanding your GPU's tensor core capabilities is crucial for mixed precision decisions. Tensor cores provide significant FP16 acceleration but require Volta+ architecture (compute capability ≥7.0).

## Step 2: Load test data and optimized model with configuration

The model is needed for deployment, and the optimization results for comparison.

Test data is needed for both conversion and final performance testing.

In [None]:
# Define dataset loading parameters
img_size = 64
batch_size = 32

# Load test dataset for final evaluation
test_loader = load_pneumoniamnist(
    split="test", 
    download=True, 
    size=img_size,
    batch_size=batch_size,
    subset_size=None
)

# Get sample batch for profiling
sample_images, sample_labels = get_sample_batch(test_loader)
sample_images = sample_images.to(device)
sample_labels = sample_labels.to(device)

print(f"Test data loaded: {sample_images.shape} batch for hardware acceleration profiling")

> **Batch size strategy**: Your batch size choice impacts memory usage, latency, and throughput. 
> 
> Consider: What batch size balances efficiency gains with memory constraints for multi-tenant deployment? Don't forget to review the batch analysis plot from Notebook 2!

In [None]:
# Load optimized model and results from notebook 2

# TODO: Define the experiment name
experiment_name = # String - Add your value here

with open(f'../results/optimization_results_{experiment_name}.pkl', 'rb') as f:
    optimization_results = pickle.load(f)

print("Loaded optimization results from Notebook 2:")
print(f"   Model: {optimization_results['model_name']}")
print(f"   Clinical Performance: {optimization_results['clinical_performance']['optimized']['sensitivity']:.1%} sensitivity")
print(f"   Architecture Speedup: {optimization_results['performance_improvements']['latency_speedup']:.2f}x")
print(f"   Memory Reduction: {optimization_results['performance_improvements']['memory_reduction_percent']:.1f}%")

> **HINT: Finding your optimization results**
> 
> Your optimization results from Notebook 2 should be saved as:
> - Results file: `../results/optimization_results_{experiment_name}.pkl`
> - Model weights: `../results/optimized_model.pth`
> 
> The experiment name typically combines your optimization techniques, like:
> - `"interpolation-removal_depthwise-separable"`
> - `"channel-reduction_grouped-conv"`

In [None]:
# Get the optimization configuration
opt_config = optimization_results['optimization_config']
optimized_model = None  

# TODO: Load the optimized model in the optimized_model variable
# HINT: This involves:
# > 1. Recreate the baseline model
# > 2. Applying the same architectural modifications using the saved optimization configuration
# > 3. Loading the trained weights
# See https://docs.pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference for inspiration

# Add your code here

## Step 3: Convert model for production deployment

Convert the optimized model to [ONNX (Open Neural Network Exchange)](https://onnx.ai/). ONNX is the industry standard for model deployment because:
 - **Cross-platform compatibility**: Works with different inference engines
 - **Hardware optimization**: Enables automatic optimizations (TensorRT, OpenVINO)
 - **Production readiness**: Stable format for deployment pipelines

Review https://docs.pytorch.org/docs/stable/onnx.html for more details.

In [None]:
# TODO: Define your deployment configuration
use_fp16 =  # Float; Whether or not to use mixed precision - Add your value here
backend =  # String; One of onnxruntime or tensorrt - Add your value here

> **Mixed precision decision point**: You can implement FP16 mixed precision at multiple stages:
> 1. **PyTorch level**: Convert model weights before ONNX export
> 2. **ONNX level**: Use FP16 data types in ONNX graph
> 3. **TensorRT level**: Apply FP16 during engine optimization
> 
> Each approach has different memory and performance implications. Consider which aligns with your deployment targets.

In [None]:
# Convert PyTorch model to ONNX format (for cross-platform deployment)

def export_model_to_onnx(model: nn.Module, input_tensor: torch.Tensor, 
                        export_path: str, model_name: str = "pneumonia_detection", fp16_mode: bool = use_fp16) -> str:
    """
    Export PyTorch model to ONNX format for production deployment.
    
    Args:
        model: PyTorch model to export
        input_tensor: Sample input tensor for shape inference
        export_path: Directory to save the ONNX model
        model_name: Name for the exported model
        fp16_mode: Whether to convert to fp16
        
    Returns:
        Path to exported ONNX model
    """
    # Define output path, and ensure it exists
    onnx_path = f"{export_path}/{model_name}.onnx"
    Path(export_path).mkdir(parents=True, exist_ok=True)
    
    # TODO: Convert PyTorch model to ONNX format for cross-platform deployment
    # HINT: ONNX provides compatibility with TensorRT, OpenVINO, and other inference engines
    # Use torch.onnx.export with proper input shapes (how do you enable dynamic batching?) and opset version
    # If implementing fp16, think about whether conversion should happen in PyTorch, ONNX, and/or TensorRT

    # Add your code here

    # Verify ONNX model integrity - sanity check
    try:
        onnx_model = onnx.load(onnx_path)
        onnx.checker.check_model(onnx_model)
        print("   ONNX model verification passed")
    except Exception as e:
        print(f"   WARNING: ONNX verification failed: {str(e)}")

    return onnx_path

# Export the mixed precision model to ONNX
onnx_model_path = export_model_to_onnx(
    model=optimized_model,
    input_tensor=sample_images,
    export_path="../results/onnx_models",
    model_name="udacimed_pneumonia_optimized"
)

## Step 4: Deploy with NVIDIA Triton Inference Server

Set up production-grade inference server with dynamic batching for multi-tenant scenarios and [TensorRT](https://developer.nvidia.com/tensorrt) acceleration.

[NVIDIA Triton Inference Server](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html) is the industry standard for AI model deployment because:
 - **Multi-framework support**: ONNX, TensorRT, PyTorch, TensorFlow
 - **Automatic optimization**: Built-in TensorRT acceleration for ONNX models
 - **Production features**: Dynamic batching, model versioning, metrics
 - **Scalability**: Multi-GPU, multi-model serving

Triton provides the infrastructure needed for UdaciMed's multi-tenant requirements, including automatic batching, load balancing, and performance monitoring that individual healthcare systems need.

### 1: Ensure docker is installed

Docker should be installed in your environment already! So, you can just check if the software is installed by calling it.

If **docker** is not installed, you can install it following [official instructions](https://docs.docker.com/engine/install/) but only if you have root access to the environment. 

In [None]:
# Test docker is installed
! docker

### 2: Create the Triton model repository structure

Triton requires a specific directory structure with model configs. See https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html for details.

**Key configuration elements:**
- `platform`: Tells Triton which backend to use
- `input/output`: Tensor specifications (name, data type, dimensions). _Should these be FP32 or FP16?_
- `dynamic_batching`: Enables automatic batching for efficiency
- `optimization`: TensorRT acceleration settings.

Find additional information specific to the chosen platform in the [`Backends`](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/backend/README.html) documentation section of the Triton Inference server docs.

_While it is more common to create the model structure manually, we provide here a guided programming solution._

In [None]:
# Define the model name in a variable, as it will be required across deployment steps
model_name = "udacimed_pneumonia_production" 

# Create the Triton model repository structure programmatically
def create_triton_model_repository(
    model_name: str = "udacimed_pneumonia",
    source_model_path: str = None,
    backend: str = "tensorrt",
    fp16_mode: bool = use_fp16,
    repository_base: str = "../deployment/"
) -> str:
    """
    Create NVIDIA Triton model repository with ONNX or TensorRT support.
    
    Args:
        model_name: Name for the model in Triton
        source_model_path: Path to ONNX model or TensorRT engine file
        backend: Backend to use ("tensorrt" or "onnxruntime")
        fp16_mode: Whether to convert to fp16
        repository_base: Base path for model repository
        
    Returns:
        Path to created model repository
    """
    print("Creating triton model repository...")
    
    # Validate backend argument
    valid_backends = ["tensorrt", "onnxruntime"]
    if backend not in valid_backends:
        raise ValueError(f"Backend must be one of {valid_backends}, got: {backend}")

    # Check that the model is in the provided repository
    if Path(source_model_path).exists():
        print(f"Model source available: {source_model_path}")
    else:
        raise FileNotFoundError(f"No valid model files found (ONNX or TensorRT) at {source_model_path}")
    
    # Create repository directory structure
    model_file = "model.onnx"
    repo_path = f"{repository_base}/{backend}/triton_model_repository"
    model_path = f"{repo_path}/{model_name}"
    version_path = f"{model_path}/1"
    
    print(f"Creating repository structure: {repository_base}")
    Path(version_path).mkdir(parents=True, exist_ok=True)
    
    # Copy model file to repository
    dest_path = f"{version_path}/{model_file}"
    shutil.copy(source_model_path, dest_path)
    
    # Verify copy
    source_size = Path(source_model_path).stat().st_size
    dest_size = Path(dest_path).stat().st_size
    if source_size != dest_size:
        raise RuntimeError(f"Model file copy failed: {source_size} != {dest_size} bytes")
    print(f"Model copied: {source_size / 1024 / 1024:.1f}MB")
    
    # TODO: Complete the Triton configuration file
    # HINT: Find info at https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html.
    # At minimum, the configuration should contain these fields: name, platform, max_batch_size, version_policy, input, output, and instance_group. 
    # Consider advanced settings like dynamic_batching and model_warmup too.
    # For dynamic_bathching, the `preferred_batch_size` array should reflect your expected traffic patterns and correctly take advantage of the GPU hardware (remember: power of two). 
    # For model warmup, this can help prevens cold start latency during the first inference requests. Including multiple batch sizes ensures optimal 
    # performance across different traffic patterns in a multi-tenant environment.
    config = f''''''  # Add your code here

    # TensorRT optimization (only for TensorRT backend)
    if backend=="tensorrt":
        optimization_config = f'''
optimization {{
  execution_accelerators {{
    gpu_execution_accelerator [
      {{
        name: "tensorrt"
        parameters {{
          key: "precision_mode"
          value: "{'FP16' if fp16_mode else 'FP32'}"
        }}
        parameters {{
          key: "max_workspace_size_bytes"
          value: "100000000"
        }}
      }}
    ]
  }}
}}
'''
        config+=optimization_config
    
    # Save configuration file
    config_path = f"{model_path}/config.pbtxt"
    with open(config_path, 'w') as f:
        f.write(config)
    print(f"Configuration created at {config_path}")
    
    # Display final configuration
    print(f"\nMODEL REPOSITORY SUMMARY:")
    print(f"   Repository: {repository_base}")
    print(f"   Model: {model_name}")
    print(f"   Platform: {platform}")
    print(f"   Backend: ONNX Runtime with TensorRT acceleration")
    print(f"   Model file: {model_file} ({source_size / 1024 / 1024:.1f}MB)")
    print(f"   Config: config.pbtxt ({Path(config_path).stat().st_size} bytes)")
    
    return repo_path


# Define Triton model repository
print(f"\nCreating repository for Triton...")
repo_path = create_triton_model_repository(
    model_name=model_name,
    source_model_path=onnx_model_path,
    backend=backend
)

### 3: Deploy the Triton server

Deploying the Triton server involves configuring and running an NVIDIA Triton Inference Server container using Docker. The server can handle different types of machine learning models, providing both HTTP and gRPC endpoints for inference requests. 

We provide a starter code in Python here, to avoid you spending too much time setting up with Docker. Instead, your focus will be on how to configure the Triton server.

IMPORTANT: Pay attention to memory limits, GPU access, and port mappings. The configuration should balance resource allocation with the multi-tenant memory budget requirements.

_Note that you could also perform these actions using CLI._

In [None]:
def deploy_triton_server(
    repository_path: str,
    max_wait_time: int = 120
) -> Optional[object]:
    """
    Deploy NVIDIA Triton Inference Server with production-ready configuration.
    
    Args:
        repository_path: Path to Triton model repository
        max_wait_time: Maximum wait time for server startup
        
    Returns:
        container: Docker container object or None if deployment fails
        ports: Dict of port type to value for http_port, grcp_port, and metrics_port
    """
    print("Deploying Triton Inference Server...")
    
    try:
        # Define the docker client
        client = docker.from_env()
    
        # Pre-requisite: Stop any existing Triton containers
        existing_containers = client.containers.list(all=True)
        for container in existing_containers:
            if any('tritonserver' in tag.lower() for tag in container.image.tags):
                try:
                    print(f"   Stopping existing container: {container.name}")
                    container.stop()
                    container.remove()
                except:
                    pass
        
        # TODO: Add missing entries in the configuration below for the container deployment
        # Note that, only for the first run, the function could take up to 5 minutes to download the image
        # HINT: `container_config` defines the configuration for the Docker image of Triton Inference Server. You can find information
        # on available options at https://docker-py.readthedocs.io/en/stable/containers.html#docker.models.containers.ContainerCollection.run.  
        # IMPORTANT: 
        # 1. Choose the right Triton Inference image for the environment by looking at the support matrix [here](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-06.html#rel-25-06). You can find out your CUDA version by running `nvidia-smi` from command line.
        # 2. At minimum, you should populate image, command, ports, volumes, runtime, detach: True
        # 3. Consider a good `mem_limit` for the current T4 machine you are on
        ports = {"http_port": 8000, "grpc_port": 8001, "metrics_port": 8002}
        container_config = {}  # Add your code here

        # Start the container with configuration
        print(f"Starting container...")
        container = client.containers.run(**container_config)
        print(f"    Container started: {container.short_id}")
        
        # Wait for the server to be ready by checking the health endpoint
        # Note that if you didn't set `strict-readiness=true`, you'd also need to wait for the `/models` endpoint to become complete
        print(f"Waiting for server to be ready...")
        for i in range(max_wait_time):
            try:
                health_endpoint = f"http://localhost:{ports["http_port"]}/v2/health/ready"
                response = requests.get(health_endpoint, timeout=2)
                if response.status_code == 200:
                    print(f"   Server is ready!")
                    return container, ports
            except requests.exceptions.RequestException:
                pass
            time.sleep(1)

            if i % 10 == 9:
                print(f"      Still waiting... ({i+1}s)")
        
        # If we get here, server didn't start properly
        print(f"   ERROR: Server failed to start within {max_wait_time} seconds")
        print(f"   Check container logs for issues:")
        print(f"      docker logs {container.logs().decode('utf-8')}")
        
        # Clean up failed attempt
        try:
            container.stop(timeout=5)
            container.remove()
        except:
            pass
            
        return None
            
    except docker.errors.DockerException as e:
        print(f"ERROR: Docker operation failed")
        print(f"DIAGNOSIS: {str(e)}")
        print(f"SOLUTION: ")
        print(f"   1. Ensure Docker is running")
        print(f"   2. Check if you have GPU runtime installed")
        print(f"   3. Try CLI deployment instead")
        return None

# Deploy Triton server
# TODO: Define the maximum wait time for the server to become ready
# Hint: Duration is dependant on the defined kick-off configuration processes - note that if the container errors, you will have to wait this time before the function exits
# So try to select a good wait time which is not too aggressive nor too leniant (somewhere in between 1-5 minutes)
max_wait_time =  # Int - Add your value here
container, ports = deploy_triton_server(
    repository_path=repo_path,
    max_wait_time=150 
)

### 4: Create the Triton client

The `TritonClient` can be used to handle inference and gather metrics for the created server.

In [None]:
class TritonClient:
    """Client for communicating with NVIDIA Triton Inference Server."""
    
    def __init__(self, server_url: str = "localhost:8000", model_name: str = "udacimed_pneumonia_production"):
        self.server_url = server_url
        self.model_name = model_name
        self.base_url = f"http://{server_url}/v2"
        self.model_endpoint = f"{self.base_url}/models/{self.model_name}"
        
        # Test connection to Triton server through model endpoint
        try:
            response = requests.get(f"{self.model_endpoint}/ready", timeout=5)
            if response.status_code == 200:
                print(f"Connected to Triton server. Model endpoint ready at {self.model_endpoint}")
            else:
                print(f"WARNING: Triton server responded, but the model endpoint at {self.model_endpoint} is not ready: {response.status_code}")
        except requests.exceptions.RequestException as e:
            print(f"ERROR: Cannot connect to the model endpoint of Triton server")
            print(f"DIAGNOSIS: {str(e)}")
            print(f"SOLUTION: ")
            print(f"   1. Ensure Triton server is running")
            print(f"   2. Check if you have set the right port for the `server_url`")
            print(f"   3. Verify model is loaded, or needs explicit loading: {self.model_name}")
    
    def infer(self, input_data: np.ndarray, fp16_mode: bool = False) -> np.ndarray:
        """
        Perform inference via Triton server.
        
        Args:
            input_data: Input tensor as numpy array [batch_size, 3, 64, 64]
            
        Returns:
            Model predictions as numpy array [batch_size, 2]
        """
        # TODO: Prepare inference request
        # HINT: Triton expects JSON format that follows the KServe community standard inference protocol
        # See: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/README.html
        request_data = {}  # Add your code here
        
        # Send inference request using POST to the inference endpoint
        try:
            response = requests.post(
                f"{self.model_endpoint}/infer",
                json=request_data,
                headers={"Content-Type": "application/json"}
            )
            
            if response.status_code != 200:
                raise RuntimeError(f"Inference failed: {response.status_code} - {response.text}")

            # Extract results from response
            result = response.json()
            output_data = np.array(result["outputs"][0]["data"])
            output_shape = result["outputs"][0]["shape"]
            
            return output_data.reshape(output_shape)
            
        except requests.exceptions.RequestException as e:
            print(f"ERROR: Inference request failed")
            print(f"DIAGNOSIS: {str(e)}")
            print(f"SOLUTION: ")
            print(f"   1. Check if Triton server is running")
            print(f"   2. Verify model is loaded and ready")
            print(f"   3. Check input data format and dimensions")
            raise
    
    def get_metrics(self) -> Dict[str, Any]:
        """
        Get Triton server performance metrics, automatically served by Triton's metrics endpoint.
        The endpoint provides Prometheus metrics indicating GPU and request statistics.  

        This function returns araw dictionary of metrics with value, extracted from the Prometheus format.
        """
        try:
            metrics_endpoint = f"http://localhost:{ports["metrics_port"]}/metrics"
            response = requests.get(metrics_endpoint)
            if response.status_code == 200:
                # Parse the raw Prometheus text output
                raw_metrics = {}
                
                for family in text_string_to_metric_families(response.text):
                    for sample in family.samples:
                        metric_name = sample.name
                        raw_metrics[metric_name] = sample.value

                if not raw_metrics:
                    print("WARNING: No metrics available from Triton server")
                    return {}

                return raw_metrics
            else:
                return {}
        except:
            print("WARNING: Metrics endpoint not available! Check the `get_metrics()` method from TritonClient")
            return {}

# Create Triton client
# Uses the same model name as defined in the repository creation
triton_client = TritonClient(model_name=model_name)

# Step 5: Benchmark model performance on all metrics

Now it's time to perform benchmarking of the complete optimization and deployment pipeline.

While PyTorch local metrics used so far measure pure model inference (GPU compute only), Triton Server metrics include network overhead, queueing, batching, and model execution. As such, Triton metrics are more representative of real-world deployment performance and are preferrable for benchmarking as they provide:
- **Accuracy**: Server-side measurements are more precise
- **Standardization**: Industry-standard Prometheus format
- **Production monitoring**: Same metrics used for alerting and scaling
- **Comprehensive data**: Includes queue time, compute time, batch efficiency

TODO: Read about all [available metrics from the Triton Server Inference metrics endpoint](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/metrics.html) before continuing.

In [None]:
# Calculate metrics from Triton Server metrics
def calculate_benchmark_metrics(initial_metrics: Dict, final_metrics: Dict, batch_size: int = 1) -> Dict[str, float]:
    """
    Calculate performance metrics per request from before/after Triton metrics snapshots, i.e., for the current inference run.
    This is necessary because Triton metrics are cumulative across all inference runs.
    
    Args:
        initial_metrics: Raw metrics before benchmark
        final_metrics: Raw metrics after benchmark
        
    Returns:
        Calculated performance metrics for the benchmark period only
    """
    # Calculate deltas between snapshots
    # final_value - initial_value gives metrics for benchmark period only
    # This is because metrics are cumulative from the server across all runs, so you need to extract deltas from before and after benchmark
    delta_requests = final_metrics.get('nv_inference_request_success_total', 0) - initial_metrics.get('nv_inference_request_success_total', 0)
    delta_request_time_us = final_metrics.get('nv_inference_request_duration_us_total', 0) - initial_metrics.get('nv_inference_request_duration_us_total', 0)
    delta_queue_time_us = final_metrics.get('nv_inference_queue_duration_us_total', 0) - initial_metrics.get('nv_inference_queue_duration_us_total', 0)
    delta_compute_input_us = final_metrics.get('nv_inference_compute_input_duration_us_total', 0) - initial_metrics.get('nv_inference_compute_input_duration_us_total', 0)
    delta_compute_infer_us = final_metrics.get('nv_inference_compute_infer_duration_us_total', 0) - initial_metrics.get('nv_inference_compute_infer_duration_us_total', 0)
    delta_compute_output_us = final_metrics.get('nv_inference_compute_output_duration_us_total', 0) - initial_metrics.get('nv_inference_compute_output_duration_us_total', 0)
    
    if delta_requests == 0:
        print("ERROR: No new requests detected between snapshots")
        return {}
    
    # Calculate per-request averages from deltas
    # delta_total_time / delta_requests = average per request for benchmark period (+ transform from us to ms!)
    avg_request_time_ms = (delta_request_time_us / delta_requests) / 1000
    avg_queue_time_ms = (delta_queue_time_us / delta_requests) / 1000
    avg_compute_input_ms = (delta_compute_input_us / delta_requests) / 1000
    avg_compute_infer_ms = (delta_compute_infer_us / delta_requests) / 1000
    avg_compute_output_ms = (delta_compute_output_us / delta_requests) / 1000
    avg_compute_total_ms = avg_compute_input_ms + avg_compute_infer_ms + avg_compute_output_ms
    
    # TODO: Calculate throughput from average latency
    throughput_requests_per_sec =  # Add your code here
    
    # TODO: Get current memory usage
    # HINT: Not delta - this is current state
    gpu_memory_mb =  # Add your code here
    
    metrics = {
        # Request duration metrics (end-to-end latency including network overhead)
        'request_latency_ms': avg_request_time_ms,
        
        # Queue duration (time spent waiting for batching)
        'request_queue_time_ms': avg_queue_time_ms,
        
        # Pure compute time breakdown (closest to PyTorch local measurements)
        'request_compute_input_ms': avg_compute_input_ms,
        'request_compute_infer_ms': avg_compute_infer_ms,
        'request_compute_output_ms': avg_compute_output_ms,
        'request_compute_total_ms': avg_compute_total_ms,
        
        # GPU memory usage (current deployment memory footprint)
        'memory_used_mb': gpu_memory_mb,
        
        # Throughput and request stats
        'throughput_requests_per_sec': throughput_requests_per_sec,
        'total_successful_requests': delta_requests
    }
    
    return metrics

In [None]:
def benchmark_triton_performance(triton_client: TritonClient, 
                                test_data: torch.Tensor,
                                num_single_requests: int = 50,
                                num_batch_requests: int = 25,
                                batch_size: int = None) -> Dict[str, Any]:
    """
    Benchmarking function with separate single-sample and batch performance analysis.
    
    Args:
        triton_client: Triton inference client
        test_data: Test input tensors for benchmarking
        num_single_requests: Number of single-sample requests to send
        num_batch_requests: Number of batch requests to send
        batch_size: Batch size for batch requests (defaults to test_data batch size)
        
    Returns:
        Performance metrics with separate single-sample and batch analysis
    """
    print(f"Benchmarking Triton deployment performance with single/batch analysis...")
    
    # Pre-requisite: Prepare test data for inference

    # TODO: Set up data on the right device and with the right numpy inference type
    # HINT: The single_sample can just be one item from test_batch
    test_batch =  # Add your code here
    single_sample = test_batch[:1]
    
    # Configure batch size
    if batch_size is None:
        batch_size = test_batch.shape[0]
        batch_data = test_batch
    else:
        # Create batch with specified size (repeat if necessary)
        if batch_size <= test_batch.shape[0]:
            batch_data = test_batch[:batch_size]
        else:
            # Repeat samples to reach desired batch size
            repeats = (batch_size + test_batch.shape[0] - 1) // test_batch.shape[0]
            repeated_data = np.tile(test_batch, (repeats, 1, 1, 1))
            batch_data = repeated_data[:batch_size]
    
    print(f"   Single-sample requests: {num_single_requests}")
    print(f"   Batch requests: {num_batch_requests} (batch_size={batch_size})")
    
    # Phase 1 - Initial metrics snapshot
    print("   Taking initial metrics snapshot...")
    initial_metrics = triton_client.get_metrics()
    
    if not initial_metrics:
        return {'benchmark_success': False, 'error': 'Could not collect initial metrics'}
    
    initial_count = initial_metrics.get('nv_inference_request_success_total', 0)
    print(f"   Initial request count: {initial_count}")
    
    try:
        # Phase 2 - Single-sample inference
        print("  Phase 1: Single-sample inference...")
        for i in range(num_single_requests):
            # TODO: Run inference on a single sample
            # HINT: You can use the triton_client object on single_sample
            _ = # Add your code here
            if i % 20 == 19:
                print(f"      Progress: {i+1}/{num_single_requests}")
        time.sleep(2)  # Wait for metrics to update

        print("   Taking post-single-sample metrics snapshot...")
        post_single_metrics = triton_client.get_metrics()
        
        if not post_single_metrics:
            return {'benchmark_success': False, 'error': 'Could not collect post-single metrics'}
        
        post_single_count = post_single_metrics.get('nv_inference_request_success_total', 0)
        single_requests_processed = post_single_count - initial_count
        print(f"   Single-sample requests processed: {single_requests_processed}")
        
        # Phase 3 - Batch inference
        print(f" Phase 2: Batch inference with batch size {batch_size}...")
        for i in range(num_batch_requests):
            # TODO: Run inference on a batch sample
            # HINT: You can use the triton_client object on batch_data
            _ = # Add your code here
            if i % 10 == 9:
                print(f"      Progress: {i+1}/{num_batch_requests}")
        
        time.sleep(2)  # Wait for metrics to update
        
        print("   Taking final metrics snapshot...")
        final_metrics = triton_client.get_metrics()
        
        if not final_metrics:
            return {'benchmark_success': False, 'error': 'Could not collect final metrics'}
        
        final_count = final_metrics.get('nv_inference_request_success_total', 0)
        batch_requests_processed = final_count - post_single_count
        total_requests_processed = final_count - initial_count
        
        print(f"   Batch requests processed: {batch_requests_processed}")
        print(f"   Total requests processed: {total_requests_processed}")
        
        if single_requests_processed == 0 and batch_requests_processed == 0:
            print("WARNING: No requests detected in either phase")
            return {'benchmark_success': False, 'error': 'No metric updates detected'}
        
        # Calculate performance metrics for both phases
        single_sample_metrics = {}
        batch_metrics = {}
        
        # Calculate single-sample metrics
        if single_requests_processed > 0:
            print("   Calculating single-sample performance...")
            single_sample_metrics = calculate_benchmark_metrics(initial_metrics, post_single_metrics)
            if single_sample_metrics:
                single_sample_metrics['phase'] = 'single_sample'
                single_sample_metrics['samples_per_request'] = 1
                
        # Calculate batch metrics  
        if batch_requests_processed > 0:
            print("   Calculating batch performance...")
            batch_metrics = calculate_benchmark_metrics(post_single_metrics, final_metrics)
            if batch_metrics:
                batch_metrics['phase'] = 'batch'
                batch_metrics['samples_per_request'] = batch_size
                if batch_metrics.get('request_latency_ms', 0) > 0:
                    batch_metrics['request_compute_total_ms'] = batch_metrics['request_compute_total_ms'] / batch_size
                    batch_metrics['per_sample_latency_ms'] = batch_metrics['request_latency_ms'] / batch_size
                    batch_metrics['per_sample_throughput'] = batch_metrics['throughput_requests_per_sec'] * batch_size
        
        # Calculate efficiency comparison
        efficiency_analysis = {}
        if single_sample_metrics and batch_metrics:
            single_latency = single_sample_metrics.get('request_latency_ms', 0)
            batch_per_sample_latency = batch_metrics.get('per_sample_latency_ms', 0)
            
            if single_latency > 0 and batch_per_sample_latency > 0:
                efficiency_analysis = {
                    'batch_efficiency_ratio': single_latency / batch_per_sample_latency,
                    'single_sample_latency_ms': single_latency,
                    'batch_per_sample_latency_ms': batch_per_sample_latency,
                    'batch_size': batch_size,
                    'latency_improvement_percent': ((single_latency - batch_per_sample_latency) / single_latency) * 100,
                    'throughput_improvement_ratio': batch_metrics.get('per_sample_throughput', 0) / single_sample_metrics.get('throughput_samples_per_sec', 1)
                }
        
        print(f"Benchmark completed successfully!")
        return {
            'benchmark_success': True,
            'single_sample_metrics': single_sample_metrics,
            'batch_metrics': batch_metrics,
            'efficiency_analysis': efficiency_analysis,
            'summary': {
                'single_requests_processed': single_requests_processed,
                'batch_requests_processed': batch_requests_processed,
                'total_requests_processed': total_requests_processed,
                'configured_batch_size': batch_size
            }
        }
        
    except Exception as e:
        print(f"ERROR: Benchmark failed - {str(e)}")
        return {'benchmark_success': False, 'error': str(e)}

# Run the benchmark
# TODO: Choose the batch size
# HINT: You may want to start with the optimal batch_size from the batch analysis in notebook 2, and then experiment with other values too!
batch_size = # Int - Add your value here
benchmark_results = benchmark_triton_performance(
    triton_client=triton_client,
    test_data=sample_images,
    num_single_requests=30,       # Focused single-sample testing
    num_batch_requests=20,        # Focused batch testing
    batch_size=batch_size         # Parameterized batch size
)
pprint(benchmark_results)

> **Performance on latency and throughput is worse than expected?** Make sure to handle cold start in the `config.pbtxt` for the chosen batch_size! Or, run this cell multiple times.

> **Have you noticed something about GPU memory?**
> 
> Your optimized model should be <100MB, but Triton shows a few hundred megabytes for model size (once benchmarking is completed). What do you think is causing this huge difference? Is this ALL model memory, or something else? 
> 
> HINT: Triton metrics endpoint reports allocated GPU memory for the complete Triton process.

In [None]:
# TODO: Use NVIDIA Triton Model Analyzer to check the memory footprint by model
# Follow the instructions below to install and run model analysis, and manually extract the model memory from generate reports
benchmark_results['model_memory_mb'] = # Float - Add your value here

**---NVIDIA Triton Model Analyzer---**

1. Open a local terminal window.

2. Run the model analyzer using the recommended approach from https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_analyzer/docs/install.html:

```
 # Get the absolute path first
 MODEL_REPO_PATH=$(realpath $(pwd)/../deployment/onnxruntime/triton_model_repository)
 RESULTS_PATH=$(realpath $(pwd)/../deployment/onnxruntime/triton_model_analyzer)

 docker run --rm -it \
     --net=host \
     --gpus all \
     -e MODEL_REPO_PATH="${MODEL_REPO_PATH}" \
     -e RESULTS_PATH="${RESULTS_PATH}" \
     -v /var/run/docker.sock:/var/run/docker.sock \
     -v ${MODEL_REPO_PATH}:${MODEL_REPO_PATH}:ro \
     -v ${RESULTS_PATH}:${RESULTS_PATH} \
     nvcr.io/nvidia/tritonserver:{TODO: Add your chosen triton version}-sdk
```

<br>Note that we are using the same Triton Server image, but with the `-sdk` to ensure model-analyzer is available.

3. Run the model analyzer with relevant flags:

```
 pip install --upgrade requests==2.31.0  # Solve the DockerException for http+docker scheme

 model-analyzer profile \
     --model-repository ${MODEL_REPO_PATH} \
     --profile-models udacimed_pneumonia_production \
     --triton-launch-mode=docker \
     --triton-http-endpoint localhost:9000 \
     --triton-grpc-endpoint localhost:9001 \
     --triton-metrics-url http://localhost:9002/metrics \
     --gpus 0 \
     --output-model-repository-path ${RESULTS_PATH}/output \
     --export-path=${RESULTS_PATH} \
     --override-output-model-repository \
     --run-config-search-disable \
     --concurrency 1 \
     --batch-sizes 1,32
```

<br>Note that:

- Different ports are needed than those from the running container in this notebook (to keep the latter running).
- You should create a `config.yaml` instead for reproducibility in production.
<br>

4. Go to `$RESULTS_PATH/results`, and define model memory usage by subtracting **GPU Memory Usage (MB)** in `metrics-server-only.csv` from **GPU Memory Usage (MB)** in `metrics-model-gpu.csv`.

> **When to use Metrics Endpoint vs Model Analyzer with Triton?**
> 
> Model Analyzer is used for production testing. It Identifies bottlenecks under real-world conditions - concurrent clients, varying loads, resource contention. 
> Metrics Endpoints is used for live monitoring of operational metrics, for production alerting and scaling decisions. Perfect for validating our optimization targets.
> 
> For a real-world deployment, we'd use Model Analyzer for thorough production testing, plus dedicated dev/QA testing to validate performance (and targets being met!) under UdaciMed's multi-tenant concurrent client scenarios.

In [None]:
# Validate clinical performance
def validate_clinical_performance_simple(triton_client: TritonClient,
                                        test_loader,
                                        threshold: float = 0.5) -> Dict[str, Any]:
    """
    Clinical validation using test data.
    
    Args:
        triton_client: Triton inference client
        test_loader: Test dataset loader
        threshold: Decision threshold for classification
        
    Returns:
        Clinical performance metrics from Triton deployment
    """ 
    print("Validating clinical performance on test data...")
    
    triton_predictions = []
    true_labels = []
    samples_processed = 0
    
    # TODO: Collect predictions from Triton inference into the variables initialized above
    # HINT: Process batches from test_loader converted to numpy with the right input type, and collect predictions outputs

    # Add your code here
    
    if len(triton_predictions) == 0:
        print("ERROR: No successful predictions collected")
        return {'clinical_validation_success': False}
    
    # Convert to numpy arrays for metric calculation
    predictions = np.array(triton_predictions)
    labels = np.array(true_labels).flatten()
    
    # Calculate clinical metrics
    pred_classes = (predictions > threshold).astype(int)
    
    # Calculate confusion matrix components
    tp = np.sum((pred_classes == 1) & (labels == 1))
    fn = np.sum((pred_classes == 0) & (labels == 1))
    tn = np.sum((pred_classes == 0) & (labels == 0))
    fp = np.sum((pred_classes == 1) & (labels == 0))
    
    # Calculate clinical metrics
    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0
    
    print(f"Clinical validation completed on {len(labels)} samples")
    return {
        'clinical_validation_success': True,
        'samples_validated': len(labels),
        'sensitivity': sensitivity,
        'specificity': specificity,
        'accuracy': accuracy,
        'true_positives': tp,
        'false_negatives': fn,
        'confusion_matrix': {'tp': tp, 'fn': fn, 'tn': tn, 'fp': fp}
    }

# Validate clinical performance
# TODO: Define threshold for binary classification
clinical_threshold =  # Float - Add your value here
clinical_results = validate_clinical_performance_simple(
    triton_client=triton_client,
    test_loader=test_loader,
    threshold=clinical_threshold
)
pprint(clinical_results)

## Step 6: Assess if production targets are met

Final evaluation against all production deployment requirements. Meeting all targets demonstrates successful optimization for UdaciMed's deployment requirements.

In [None]:
# Define production targets
# Feel free to skip FLOP reduction analysis: if TensorRT is enabled, simply assume 2-10% relative improvement
PRODUCTION_TARGETS = {
    'memory': 100,
    'throughput': 2000, 
    'latency': 3,    
    'sensitivity': 98, 
}

DEPLOYMENT_VALUES = {
    'memory': benchmark_results['model_memory_mb'] or benchmark_results['batch_metrics'].get('memory_used_mb'),
    'throughput': {
        benchmark_results['batch_metrics'].get('phase', 'batch'): benchmark_results['batch_metrics'].get('per_sample_throughput'),
        benchmark_results['single_sample_metrics'].get('phase', 'single_sample'): benchmark_results['single_sample_metrics'].get('throughput_requests_per_sec'),
    },
    'latency': {
        benchmark_results['batch_metrics'].get('phase', 'batch'): benchmark_results['batch_metrics'].get('request_latency_ms'),
        benchmark_results['single_sample_metrics'].get('phase', 'single_sample'): benchmark_results['single_sample_metrics'].get('request_latency_ms'),
    },
    'sensitivity': clinical_results.get('sensitivity', 0)*100
}

print("Production deployment values vs targets:")
for metric, target in PRODUCTION_TARGETS.items():
    print(f"   {metric.replace('_', ' ').title()}: Target={target} --> @Deployment={DEPLOYMENT_VALUES[metric]}")

## Step 7: TODO: Cross-platform deployment analysis

Now that You have mastered GPU deployment with Triton, it's time to analyze how UdaciMed's pneumonia detection model would perform across different deployment environments. Healthcare systems have diverse infrastructure needs - from hospital workstations (CPU-only) to portable clinic devices and mobile health applications.

> **Use case context**: UdaciMed serves hospitals with varying IT infrastructure. Some have modern GPU workstations, others rely on CPU-only systems, and many need portable solutions for rural clinics or emergency response.

### Step 7.1: Optimization strategy for CPU deployment

Hospital workstations often lack dedicated GPUs but need to maintain clinical performance and multi-tenant efficiency. Let's analyze CPU deployment options for UdaciMed's hospital deployment!

> **Numerical precision opportunities with GPU and CPU**: CPUs don't benefit from FP16 (most CPUs only emulate FP16). But CPUs supports another type of numerical optimization, remember?

#### Analyze CPU deployment options

Consider UdaciMed's requirements: <100MB memory budget, >98% sensitivity preservation, multi-tenant hospital deployment, and minimal clinical risk.

_<<TODO: Complete the table below by filling in missing performance expectations, pros/cons for each approach>>_

| Approach | Conversion Path | Memory Footprint | Performance | Clinical Risk | Multi-Tenant Support |
|----------|----------------|------------------|-------------|---------------|---------------------|
| **PyTorch on CPU** | Direct (no conversion) | High | Baseline | **Low** - same model | No batching |
| **ONNX Runtime Default CPU** | | | | | |
| **OpenVINO Runtime CPU** | | | | | |
| **OpenVINO Backend for Triton** | | | | | |

_<<TODO: Briefly answer the questions below based on UdaciMed's hospital deployment requirements>>_

**1. Which approaches meet the memory budget requirement?**

**2. Which approach poses the lowest clinical risk? Why?**

**3. How important is dynamic batching for hospital workstations vs multi-tenant cloud deployment?**

**4. What are the trade-offs between OpenVINO Runtime CPU and Triton+OpenVINO for UdaciMed?**

#### Make your strategic choice

Based on your analysis above, choose the best CPU deployment approach for UdaciMed:

**My recommendation for UdaciMed's hospital CPU deployment:** 

_<<TODO: Choose one approach and justify your decision in 2-3 sentences>>_

### Define an optimal CPU deployment configuration in OpenVINO

Imagine you are testing out now CPU deployment with OpenVINO for UdaciMed, and set up the OpenVINO configuration to balance performance, memory, and clinical safety.

_<<TODO: Complete the OpenVINO configuration below>>_

```yaml
# openvino_hospital_config.yaml
# UdaciMed Hospital Workstation Deployment Configuration

model_optimization:
  input_model: "udacimed_pneumonia_optimized.onnx"
  target_device: "CPU"
  
  # Choose precision strategy
  precision: # TODO - Options: "FP32" (safe), "FP16", or "INT8" (faster, smaller, but clinical risk)
  
  # Set optimization priority  
  optimization_level: # TODO - Options: "ACCURACY" (safe) or "PERFORMANCE" (faster)
  
  # Configure quantization (if using INT8)
  quantization:
    enabled:  # TODO: true/false
    calibration_dataset_size:  # TODO - Number of samples for INT8 calibration (if enabled)

deployment_config:
  # Configure CPU utilization for hospital workstations
  cpu_threads: # TODO - Options: 1, 2, 4, 8 (consider multi-tenancy impact)
  
  # Set memory allocation for multi-tenant deployment
  memory_pool_mb: # TODO - Memory budget per model instance
  
  # Choose batching strategy
  max_batch_size: # TODO - 1 (single patient) or higher (if implementing manual batching)
  
  # Configure for hospital network environment
  inference_timeout_ms: # TODO: Maximum inference time before timeout

clinical_validation:
  # Define validation requirements after CPU deployment
  sensitivity_threshold: # TODO: Minimum acceptable sensitivity (should be >98%)
  validation_dataset_size: # TODO: Number of samples for clinical re-validation
  comparison_baseline: "GPU_Triton_deployment"  # Compare against your GPU results
```

_<<TODO: Justify each configuration choice with one sentence>>_

**Precision choice (FP32):**

**Optimization level (ACCURACY):**

**CPU threads (4):**

**Memory allocation (80MB):**

**Batch size (1):**

**Clinical validation parameters (98.0% sensitivity, 1000 samples):**

### Step 7.2: Optimization strategy for mobile and edge deployment

UdaciMed's vision extends beyond hospital workstations to portable devices and mobile health applications. This enables pneumonia detection in rural clinics, emergency response, and preventive screening programs where traditional infrastructure is limited.

> **Mobile and edge requirementy**: These deployments require lightweight runtimes, offline capability, extended battery life, and often benefit from platform-specific optimizations. However, conversion complexity and clinical validation requirements vary significantly across approaches.

#### Analyze mobile deployment options

 Consider UdaciMed's mobile/edge requirements: <50MB app size for developing markets, >98% sensitivity preservation, cross-platform reach for global health, offline capability for rural clinics, and minimal clinical validation burden.

_<<TODO: Complete the table below by analyzing conversion complexity, clinical risk, and UdaciMed suitability for each approach>>_

| Platform | Conversion Path | Model Size | Clinical Risk | Platform Coverage | Development Complexity | Edge Suitability |
|----------|----------------|------------|---------------|-------------------|----------------------|------------------|
| **ExecuTorch** | PyTorch→TorchScript→Mobile | ~25MB | **Low** - single conversion | Cross-platform | **Low** - familiar workflow | **High** - offline capable |
| **LiteRT** | | | | | | |
| **Core ML (iOS)** | | | | | | |
| **ONNX Runtime Mobile** | | | | | | |

_<<TODO: Answer the questions below based on UdaciMed's mobile and edge deployment strategy>>_

**1. Which approaches meet the app size requirement (<50MB total) and are suitable for offline edge deployment?**

**2. Which conversion path poses the lowest clinical risk for UdaciMed's >98% sensitivity requirement?**

**3. Should UdaciMed prioritize cross-platform reach or platform-specific optimization for global health impact?**

**4. How do development complexity and regulatory validation burden affect UdaciMed's resource allocation?**

**5. Which approaches best support offline deployment in rural clinics and emergency response scenarios?**

**6. What are the power consumption and battery life implications for portable clinic devices?**

#### Make your mobile/edge strategy choice

**My recommendation for UdaciMed's mobile and edge deployment strategy:**

_<<TODO: Choose one approach and justify your decision in 2-3 sentences, considering clinical risk, development resources, and global health reach>>_

-----

## **Congratulations!**

You have successfully implemented a complete hardware-accelerated deployment pipeline! Let's recap the decisions you have made and results you have achieved while transforming an optimized model into a production-ready healthcare solution.

### **Production deployment scorecard**

**Final GPU deployment performance vs UdaciMed targets:**

_<<TODO: Complete final scorecard based on your benchmarking results:>>_

| Metric | Target | Achieved | Status |
|--------|--------|----------|--------|
| **Memory Usage** | <100MB | | |
| **Throughput** | >2,000 samples/sec | | |
| **Latency** | <3ms | | | |
| **FLOP Reduction** | >80% | | | |
| **Clinical Safety** | >98% sensitivity | | | |

_<<TODO: Give yourself a final production score given the number of targets met>>_

**Overall production score: X/5 targets met!**

### **Strategic deployment insights**

_<<TODO: Reflect on the key decisions you made, and why>>_

#### Mixed Precision Strategy
**Your FP16/FP32 choice:** # _(FP32, FP16)_

**Why you made this decision:**

#### Backend Selection
**Your Triton backend choice:**  # _(ONNX Runtime, TensorRT, etc.)_

**Why this backend aligned with UdaciMed's requirements:**

#### Batching Configuration
**Your dynamic batching setup:** # _(preferred batch sizes, queue delay, etc.)_

**How this supports multi-tenant deployment:** 

### Optimization philosophy
**Meeting targets vs maximizing metrics:**

_<<TODO: What did you learn about when to stop optimizing and why?>>_

---

**You have completed the full journey from architectural optimization to production-ready deployment, demonstrating the technical skills and strategic thinking essential for deploying AI in healthcare. Your UdaciMed pneumonia detection system is now ready to serve hospitals worldwide while maintaining the clinical safety standards that save lives.**