# Exercise 1: Accelerate small language model inference with TensorRT optimization

Most PyTorch models run far below GPU hardware capacity due to generic execution patterns that don't leverage hardware-specific optimizations. [TensorRT](https://developer.nvidia.com/tensorrt) addresses this on NVIDIA GPUs by creating optimized execution engines for your model specs.

> **Overview**: An AI-powered platform is struggling with GPU utilization efficiency. While their model provides excellent response quality, the current PyTorch inference pipeline operates at a throughput far below the hardware's theoretical capacity.
> 
> **Scenario**: You work for a customer service platform that processes 50,000+ support inquiries daily across multiple languages. Current infrastructure costs are unsustainable due to poor GPU utilization, and response times during peak hours exceed SLA requirements. Your goal is to achieve 3,000+ samples/sec throughput through TensorRT optimization to reduce infrastructure needs from 10 T4 instances to 3.
> 
> **Goal**: Implement a TensorRT's build-time optimization workflow, leverage mixed precision and dynamic batching, and measure performance improvements to understand how hardware acceleration frameworks unlock production deployment efficiency.
> 
> **Tools**: transformers, torch, tensorrt, onnx, pycuda, datasets
> 
> **Estimated Time**: 15 minutes

## Step 1: Setup

Let's establish our environment and verify T4 capabilities for TensorRT optimization.

In [None]:
# # Uncomment to install necessary libraries, then comment out and restart notebook
# ! pip install transformers torch tensorrt onnx onnxruntime-gpu pycuda datasets

In [None]:
# Import core libraries
import os
import warnings
warnings.filterwarnings("ignore")
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification, set_seed
from datasets import load_dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
from datetime import datetime
import onnx
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
from torch.utils.data import DataLoader, TensorDataset

# Create output directory
output_dir = "assets/exercise1"
os.makedirs(output_dir, exist_ok=True)

In [None]:
# Set seeds for reproducibility  
set_seed(42)
torch.manual_seed(42)
np.random.seed(42)

# Verify T4 setup and TensorRT compatibility
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    gpu_properties = torch.cuda.get_device_properties(0)
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Compute Capability: {gpu_properties.major}.{gpu_properties.minor}")
    print(f"Total Memory: {gpu_properties.total_memory / 1e9:.1f} GB")
    print(f"Multiprocessors: {gpu_properties.multi_processor_count}")
    
    # Check TensorRT compatibility
    tensor_cores_available = gpu_properties.major >= 7
    print(f"Tensor Core Support: {'✓ Available' if tensor_cores_available else '✗ Not Available'}")
    print(f"TensorRT Version: {trt.__version__}")
    
    if tensor_cores_available:
        print("  → Mixed precision (FP16) will provide significant speedup")
else:
    print("CUDA not available - exercise requires GPU")

print("Setup complete!")

> **T4 hardware context:** T4 GPUs feature 2,560 CUDA cores, 320 Tensor Cores, and 320 GB/s memory bandwidth. 
> 
> The Tensor Cores provide specialized acceleration for FP16 matrix operations, which TensorRT leverages through mixed precision optimization.


## Step 2: Load baseline model and data

For this exercise, we'll use the following model and dataset, respectively:
- [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) provides a good balance of model complexity and inference speed for demonstrating TensorRT optimizations
- A 2500-samples subset of [IMDB movie reviews dataset](https://huggingface.co/datasets/stanfordnlp/imdb) provides realistic text with natural length variability for demonstrating TensorRT's dynamic batching capabilities


In [None]:
# Load tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = "[PAD]"

# Load model in evaluation mode
pytorch_model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
pytorch_model.eval()

pytorch_model_params = sum(p.numel() for p in pytorch_model.parameters())
pytorch_model_size = sum(p.numel() * p.element_size() for p in pytorch_model.parameters()) / 1024**2
pytorch_model_memory = torch.cuda.memory_allocated() / 1024**2

print(f"Model loaded: {model_name}")
print(f"Model parameters: {pytorch_model_params:,}")
print(f"Model size: {pytorch_model_size:.1f} MB")
print(f"Model memory: {pytorch_model_memory:.1f} MB")

In [None]:
# Load a subset of IMDB movie reviews for benchmarking
print("Loading IMDB dataset for realistic inference benchmarking...")
dataset = load_dataset("imdb", split="test")
sample_texts = dataset['text'][:2500]  # Subsample from 25K test samples for efficient benchmarking

print(f"Dataset loaded: {len(sample_texts)} movie reviews (subsampled from {len(dataset)} total)")

# Analyze length distribution
review_lengths = [len(tokenizer.encode(text)) for text in sample_texts[:100]]
print(f"Sample length distribution:")
print(f"  Min length: {min(review_lengths)} tokens")
print(f"  Max length: {max(review_lengths)} tokens") 
print(f"  Average length: {np.mean(review_lengths):.1f} tokens")
print(f"  Median length: {np.median(review_lengths):.1f} tokens")

# Tokenize with different sequence lengths to test dynamic batching
def prepare_input_tensors(texts, max_length=64, batch_size=32):
    """Tokenize texts and create batched tensors"""
    encoded = tokenizer(
        texts,
        truncation=True,
        padding='max_length',
        max_length=max_length,
        return_tensors='pt'
    )
    
    # Create dataset and dataloader
    dataset = TensorDataset(
        encoded['input_ids'],
        encoded['attention_mask']
    )
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    
    return dataloader, encoded['input_ids'].shape

# Create dataloaders for different batch sizes
batch_sizes = [16, 32, 64, 128]
dataloaders = {}

for batch_size in batch_sizes:
    dataloaders[batch_size], input_shape = prepare_input_tensors(
        sample_texts, max_length=64, batch_size=batch_size
    )

print(f"Dataset prepared:")
print(f"Total samples: {len(sample_texts)}")
print(f"Input shape per sample: {input_shape}")
print(f"DataLoaders created for batch sizes: {batch_sizes}")

> **Dataset characteristics impact on optimization**: The IMDB reviews show natural length variability (35-1018 tokens), but we truncate to 64 tokens for consistent fast comparison, but in practice 256+ tokens are more commonly used for this use case.  
> 
> Advanced implementations support variable sequence lengths which further benefit from TensorRT's optimization profiles that can configure pre-optimizing kernels for common lengths while dynamically adapting memory allocation.

## Step 3: Measure baseline PyTorch performance

Before optimizing with TensorRT, let's establish baseline performance to measure improvements.

In [None]:
def benchmark_pytorch_model(model, dataloader, num_batches=20, warmup_batches=5):
    """Measure PyTorch model performance"""
    model.eval()
    times = []
    memory_usage = []
    
    # Warmup
    with torch.no_grad():
        for i, (input_ids, attention_mask) in enumerate(dataloader):
            if i >= warmup_batches:
                break
            input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
            _ = model(input_ids, attention_mask=attention_mask)
    
    # Benchmark
    with torch.no_grad():
        for i, (input_ids, attention_mask) in enumerate(dataloader):
            if i >= num_batches:
                break
                
            input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
            
            # Memory tracking
            torch.cuda.empty_cache()
            torch.cuda.reset_peak_memory_stats()
            baseline_memory = torch.cuda.memory_allocated() / 1024**2
            
            # Time measurement
            torch.cuda.synchronize()
            start_time = time.perf_counter()
            
            outputs = model(input_ids, attention_mask=attention_mask)
            
            torch.cuda.synchronize()
            end_time = time.perf_counter()
            
            # Record metrics
            peak_memory = torch.cuda.max_memory_allocated() / 1024**2
            times.append(end_time - start_time)
            memory_usage.append(peak_memory - baseline_memory)
    
    return times, memory_usage

In [None]:
# Measure baseline performance across batch sizes
baseline_results = {}

print("Measuring PyTorch baseline performance...")
for batch_size in batch_sizes:
    times, memory = benchmark_pytorch_model(pytorch_model, dataloaders[batch_size])
    
    avg_time = np.mean(times)
    throughput = batch_size / avg_time
    avg_memory = np.mean(memory)
    
    baseline_results[batch_size] = {
        'avg_time': avg_time,
        'throughput': throughput,
        'latency_ms': avg_time * 1000,
        'memory_mb': avg_memory
    }
    
    print(f"Batch {batch_size}: {throughput:.1f} samples/sec ({avg_time*1000:.1f}ms) | {avg_memory:.1f} MB")

print(f"\nBaseline established! Best throughput: {max(r['throughput'] for r in baseline_results.values()):.1f} samples/sec")

> **Insights from baseline performance analysis:** The baseline shows typical PyTorch scaling patterns where throughput increases modestly with batch size while memory usage doubles until memory bandwidth becomes the bottleneck. 
> 
> Notice how batch 128 actually performs worse than batch 64, indicating we're hitting T4's memory limits and experiencing memory pressure that degrades performance. TensorRT addresses this through optimized memory layouts, layer fusion, and precision management that better utilize T4's architectural capabilities.


## Step 4: Implement the TensorRT conversion workflow

TensorRT optimization happens in two phases: build-time optimization that analyzes and restructures the model, and runtime execution with the optimized engine.

> _**Note on the ONNX intermediate format:**_ [ONNX (Open Neural Network Exchange)](https://onnx.ai/) serves as a standard model format that enables different optimization engines to understand PyTorch architectures. TensorRT is one such engine that specializes in NVIDIA GPU optimization—it reads ONNX models and generates hardware-specific kernels, memory layouts, and execution strategies optimized for your target GPU's architecture. 

In [None]:
def convert_pytorch_to_onnx(model, input_shape, onnx_path):
    """Convert PyTorch model to ONNX format"""
    print("Step 1: Converting PyTorch → ONNX...")
    
    # TODO: Create dummy inputs for ONNX export
    # HINT: ONNX export needs sample inputs to trace the computation graph on the expected device
    # Use torch.randint to create realistic token IDs (0 to vocab_size) and torch.ones to create attention masks
    vocab_size = model.config.vocab_size
    dummy_input_ids =  # Add your code here
    dummy_attention_mask =  # Add your code here
    
    # Export to ONNX with dynamic axes for batching
    # No optimizations done here, we leave them for TensorRT
    torch.onnx.export(
        model,
        (dummy_input_ids, dummy_attention_mask),
        onnx_path,
        export_params=True,
        opset_version=11,
        do_constant_folding=True,
        input_names=['input_ids', 'attention_mask'],
        output_names=['logits'],
        dynamic_axes={
            'input_ids': {0: 'batch_size', 1: 'sequence_length'},
            'attention_mask': {0: 'batch_size', 1: 'sequence_length'},  
            'logits': {0: 'batch_size', 1: 'sequence_length'}
        }
    )
    
    # Verify ONNX model
    onnx_model = onnx.load(onnx_path)
    onnx.checker.check_model(onnx_model)
    print(f"✓ ONNX model saved and verified: {onnx_path}")
    return onnx_path

In [None]:
def build_tensorrt_engine(onnx_path, engine_path, precision='fp16', min_batch=1, opt_batch=32, max_batch=128):
    """Build TensorRT engine from ONNX model"""
    print(f"Step 2: Building TensorRT engine with {precision} precision...")
    
    # TODO: Initialize TensorRT logger, builder, network, and parser
    # HINT: The network_flags allows you to add support for dynamic shapes
    # Reference: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/inference-library/python-api-docs.html
    logger =  # Add your code here
    builder =  # Add your code here
    network_flags =  # Add your code here
    network =  # Add your code here
    parser =  # Add your code here
    
    # Parse ONNX model
    with open(onnx_path, 'rb') as f:
        parser.parse(f.read())
    
    # TODO: Configure builder settings for optimization
    # HINT: BuilderConfig controls optimization strategies and precision settings
    # At minimum, set the following config attributes: memory pool limit (use 4GB for T4) and fp16 precision if requests
    # References: 
    # - https://developer.nvidia.com/docs/drive/drive-os/6.0.7/public/drive-os-tensorrt/api-reference/docs/python/infer/Core/Builder.html
    # - https://developer.nvidia.com/docs/drive/drive-os/6.0.7/public/drive-os-tensorrt/api-reference/docs/python/infer/Core/BuilderConfig.html#tensorrt.IBuilderConfig
    config =  # Add your code here

    # Add your code here (to set up config attributes)
    
    # TODO: Configure dynamic batching with optimization profiles
    # HINT: Optimization profiles tell TensorRT the range of input shapes to optimize for
    # For our model, we need to set the shape for 'input_ids' and 'attention_mask'
    # References: 
    # - https://developer.nvidia.com/docs/drive/drive-os/6.0.7/public/drive-os-tensorrt/api-reference/docs/python/infer/Core/Builder.html
    # - https://docs.nvidia.com/deeplearning/tensorrt/10.8.0/_static/python-api/infer/Core/OptimizationProfile.html#tensorrt.IOptimizationProfile
    profile =  # Add your code here
    
    # Add your code here (to set up profile attributes) 

    config.add_optimization_profile(profile)
    
    # Build engine
    print("Building engine... (this may take 1-2 minutes)")
    start_build = time.time()
    
    # TODO: Build the serialized engine
    # Reference: https://developer.nvidia.com/docs/drive/drive-os/6.0.7/public/drive-os-tensorrt/api-reference/docs/python/infer/Core/Builder.html
    serialized_engine =  # Add your code here
    build_time = time.time() - start_build
    
    # Save engine to disk
    with open(engine_path, 'wb') as f:
        f.write(serialized_engine)
    
    print(f"✓ TensorRT engine built in {build_time:.1f}s: {engine_path}")
    return engine_path, build_time

In [None]:
# Convert model to ONNX and build TensorRT engines
onnx_path = os.path.join(output_dir, "distilbert.onnx")
fp32_engine_path = os.path.join(output_dir, "distilbert_fp32.trt")
fp16_engine_path = os.path.join(output_dir, "distilbert_fp16.trt")

# Conversion pipeline
input_shape = (32, 64)  # (batch_size, sequence_length)
convert_pytorch_to_onnx(pytorch_model, input_shape, onnx_path)

# Build engines for both precisions
fp32_engine, fp32_build_time = build_tensorrt_engine(
    onnx_path, fp32_engine_path, precision='fp32', 
    min_batch=16, opt_batch=32, max_batch=128
)
fp16_engine, fp16_build_time = build_tensorrt_engine(
    onnx_path, fp16_engine_path, precision='fp16',
    min_batch=16, opt_batch=32, max_batch=128  
)

print(f"\n=== Conversion Summary ===")
print(f"PyTorch → ONNX: ✓")
print(f"ONNX → TensorRT FP32: ✓ ({fp32_build_time:.1f}s)")
print(f"ONNX → TensorRT FP16: ✓ ({fp16_build_time:.1f}s)")

> **What happens at TensorRT build-time?** TensorRT's build phase analyzes the model architecture and generates hardware-optimized kernels. 
> 
> This one-time cost _(~60 seconds)_ pays dividends through runtime performance gains. 
> 
> The build process includes layer fusion (combining multiple operations into single GPU kernels), precision optimization (using FP16 where safe), and memory layout planning.  specifically for your target hardware.
> 
> _**About the build warnings:**_ The warnings you see are normal and demonstrate TensorRT's intelligent optimization where you get FP16 performance benefits with FP32 accuracy protection where it matters most:
> - _"Make sure input input_ids has Int64 binding"_ --> TensorRT automatically handles token ID type conversions for optimal GPU execution
> 
> - _"Detected layernorm nodes in FP16"_ --> TensorRT keeps numerically sensitive operations (LayerNorm, softmax) in FP32 while running other layers in FP16

## Step 5: Create the TensorRT inference engine

Now let's implement the TensorRT inference engine with dynamic batch size support.

In [None]:
class TensorRTInference:
    """TensorRT inference engine wrapper"""
    
    def __init__(self, engine_path):
        self.logger = trt.Logger(trt.Logger.WARNING)
        
        # Load and deserialize engine
        with open(engine_path, 'rb') as f:
            engine_data = f.read()
        
        runtime = trt.Runtime(self.logger)
        self.engine = runtime.deserialize_cuda_engine(engine_data)
        self.context = self.engine.create_execution_context()
        
        # Pre-allocate for maximum batch size to avoid memory issues
        max_batch_size = 128
        seq_length = 64
        
        # Allocate input memory
        input_ids_size = max_batch_size * seq_length
        attention_mask_size = max_batch_size * seq_length
        output_size = max_batch_size * 2  # Binary classification
        
        # TODO: Allocate pinned host memory for efficient async GPU transfers
        # HINT: You can use PyCUDA for this
        # Reference: https://documen.tician.de/pycuda/driver.html#pagelocked-host-memory
        self.input_ids_host =  # Add your code here
        self.attention_mask_host =  # Add your code here
        self.output_host =  # Add your code here
        
        # Device memory
        self.input_ids_device = cuda.mem_alloc(self.input_ids_host.nbytes)
        self.attention_mask_device = cuda.mem_alloc(self.attention_mask_host.nbytes)
        self.output_device = cuda.mem_alloc(self.output_host.nbytes)
        
        # Create CUDA stream
        self.stream = cuda.Stream()

        # Get TensorRT's built-in memory reporting
        self.engine_memory_mb = self.engine.device_memory_size / 1024**2
    
    def infer(self, input_ids, attention_mask):
        """Run inference with dynamic batch size"""
        batch_size = input_ids.shape[0]
        
        # TODO: Set dynamic input shapes for current batch
        # HINT: TensorRT needs to know the actual input shape for each inference, for both inputs: 'input_ids' and 'attention_mask'
        # Set it via self.context
        # Reference: https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html

        # Add your code here
       
        # Copy inputs to pre-allocated host memory
        input_ids_flat = input_ids.cpu().numpy().ravel()
        attention_mask_flat = attention_mask.cpu().numpy().ravel()
        
        self.input_ids_host[:len(input_ids_flat)] = input_ids_flat
        self.attention_mask_host[:len(attention_mask_flat)] = attention_mask_flat

        # Transfer input data to GPU
        cuda.memcpy_htod_async(self.input_ids_device, self.input_ids_host, self.stream)
        cuda.memcpy_htod_async(self.attention_mask_device, self.attention_mask_host, self.stream)
        
        # TODO: Bind tensor memory addresses for TensorRT execution
        # HINT: TensorRT needs to know where to find input/output tensors in GPU memory via their tensor address
        # Set three entries (input_ids, attention_mask, logits) for self.context
        # Reference: https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1001/api/python_api/infer/Core/ExecutionContext.html

        # Add your code here
        
        # Run inference
        self.context.execute_async_v3(stream_handle=self.stream.handle)
        
        # Transfer input data to GPU asynchronously
        # HINT: You can use PyCUDA for this, look for device-to-host (_d_to_h_) transfer with an _async prefix
        # Reference: https://documen.tician.de/pycuda/driver.html#unstructured-memory-transfers

        # Add your code here

        self.stream.synchronize()
        
        # Reshape output
        output_elements = batch_size * 2  # Number of elements
        output = self.output_host[:output_elements].reshape(batch_size, 2)
        
        return torch.tensor(output)

In [None]:
# Initialize TensorRT inference engines
print("Loading TensorRT engines...")
trt_fp32_engine = TensorRTInference(fp32_engine_path)
trt_fp16_engine = TensorRTInference(fp16_engine_path)
print("✓ TensorRT engines loaded and ready")

> **TensorRT runtime engine efficiency**: TensorRT's runtime engine uses pre-allocated memory pools and asynchronous GPU operations to minimize inference overhead. 
> 
> - The dynamic shape setting allows the same engine to handle different batch sizes efficiently without rebuilding. 
> - The pre-allocated memory approach eliminates allocation overhead during inference, which is crucial for achieving consistent low-latency performance in production environments.

## Step 6: Perform comprehensive performance benchmarking

Let's benchmark all configurations to measure TensorRT's optimization impact.

In [None]:
def benchmark_tensorrt_engine(trt_engine, dataloader, num_batches=20, warmup_batches=3):
    """Benchmark TensorRT engine performance"""
    times = []
    memory_usage = []
    
    # Warmup
    for i, (input_ids, attention_mask) in enumerate(dataloader):
        if i >= warmup_batches:
            break
        input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
        _ = trt_engine.infer(input_ids, attention_mask)
    
    # Benchmark
    for i, (input_ids, attention_mask) in enumerate(dataloader):
        if i >= num_batches:
            break
            
        input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
        
        # Time measurement  
        torch.cuda.synchronize()
        start_time = time.perf_counter()
        
        outputs = trt_engine.infer(input_ids, attention_mask)
        
        torch.cuda.synchronize()
        end_time = time.perf_counter()
        
        # Record metrics
        times.append(end_time - start_time)
    
    # TODO: Define memory usage
    # HINT: Use TensorRT's built-in memory reporting, and convert to MB 
    # Reference: https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Core/Engine.html
    memory_usage_engine =  # Add your code here
    memory_usage = memory_usage_engine * len(times)  # Create one entry per time measurement to match sizes
    
    return times, memory_usage

In [None]:
# Benchmark all configurations
configurations = [
    ("PyTorch FP32", "baseline"),
    ("TensorRT FP32", trt_fp32_engine), 
    ("TensorRT FP16", trt_fp16_engine)
]

all_results = {"PyTorch FP32": baseline_results}

print("Benchmarking TensorRT configurations...")

for config_name, engine in configurations[1:]:  # Skip baseline (already measured)
    print(f"\nBenchmarking {config_name}...")
    config_results = {}
    
    for batch_size in batch_sizes:
        times, memory = benchmark_tensorrt_engine(engine, dataloaders[batch_size])
        
        avg_time = np.mean(times)
        throughput = batch_size / avg_time
        avg_memory = np.mean(memory)
        
        config_results[batch_size] = {
            'avg_time': avg_time,
            'throughput': throughput, 
            'latency_ms': avg_time * 1000,
            'memory_mb': avg_memory
        }
        
        print(f"  Batch {batch_size}: {throughput:.1f} samples/sec ({avg_time*1000:.1f}ms) | {avg_memory:.1f} MB")
    
    all_results[config_name] = config_results

In [None]:
# Calculate optimization improvements for each batch size
print(f"\n=== TensorRT Optimization Impact Analysis ===")

# Throughput Analysis
print("\nThroughput Analysis: ")
for batch_size in batch_sizes:
    print(f"\n\t--- Batch Size: {batch_size} ---")

    baseline_batch = baseline_results[batch_size]
    trt_fp32_batch = all_results["TensorRT FP32"][batch_size]
    trt_fp16_batch = all_results["TensorRT FP16"][batch_size]

    fp32_improvement = trt_fp32_batch['throughput'] / baseline_batch['throughput']
    fp16_improvement = trt_fp16_batch['throughput'] / baseline_batch['throughput']

    print(f"\t\tTensorRT FP32 optimization: {fp32_improvement:.2f}x throughput improvement")
    print(f"\t\tTensorRT FP16 optimization: {fp16_improvement:.2f}x throughput improvement") 
    print(f"\t\t→ Mixed precision benefit: {fp16_improvement/fp32_improvement:.2f}x additional gain")

# Memory Analysis
print("\nMemory Analysis: ")
trt_fp32_memory = trt_fp32_engine.engine_memory_mb
trt_fp16_memory = trt_fp16_engine.engine_memory_mb

print(f"\n\tPyTorch FP32 Model (Weights only): {pytorch_model_memory:.1f} MB")
print(f"\tPyTorch peak activation memory: from {baseline_results[batch_sizes[0]]['memory_mb']:.1f} MB for batch_size {batch_sizes[-1]} to {baseline_results[batch_sizes[-1]]['memory_mb']:.1f} MB for batch_size {batch_sizes[-1]} (scales with batch size)")
print(f"\tTensorRT FP32 Engine (Weights + Activation Workspace): {trt_fp32_memory:.1f} MB")
print(f"\tTensorRT FP16 Engine (Weights + Activation Workspace): {trt_fp16_memory:.1f} MB")

pytorch_max_memory = pytorch_model_memory + baseline_results[batch_sizes[-1]]['memory_mb']
reduction_percentage = (pytorch_max_memory - trt_fp16_memory) / pytorch_max_memory * 100
print(f"\n\t→ The FP16 engine reduces the static memory footprint by {reduction_percentage:.1f}% for batch_size {batch_sizes[-1]}.")

> **TensorRT memory management**: Unlike PyTorch's dynamic memory allocation that scales with batch size, TensorRT pre-allocates memory based on your optimization profile's maximum batch size (128 in our case). 
> 
> This strategy trades memory efficiency for performance consistency—no allocation overhead during inference, but you pay the memory cost of your largest expected batch size regardless of actual usage.

----

> **TODO: TensorRT Optimization Analysis**
>
> Using your benchmark results from the previous analysis, answer these questions to test your understanding of TensorRT's core optimization mechanisms:
> 
> 
> 1. **Memory Management Understanding**
>    - Given that TensorRT pre-allocates memory for the maximum batch size, what are the production implications if you set max_batch=256 but typically run batches of 32?
>    - In what deployment scenarios would TensorRT's approach be better than PyTorch's dynamic allocation, and when might PyTorch's approach be preferable?
>    - Answer: ________________
> 
> 2. **Mixed Precision Impact** 
>    - Compare the performance improvement from TensorRT FP32 vs TensorRT FP16. Why do you see this difference?
>    - What specific T4 hardware feature enables the larger performance boost with FP16?
>    - Answer: ________________
> 
> 3. **Build-time vs Runtime**
>    - TensorRT required a build step while PyTorch loads instantly. When does this trade-off make business sense?
>    - Why can't PyTorch achieve similar runtime performance without this build step?
>    - Answer: ________________
> 
> 4. **Dynamic Batching Behavior**
>    - Your optimization profile was set to min=16, opt=32, max=128. How does TensorRT handle batch sizes within this range?
>    - What would happen if you tried to run a batch size outside this range?
>    - Answer: ________________
>
> 5. **Bonus challenge: What if `max_length` of the tokenizer changed from 64 to 256?**
>     - How would you expect this to impact the performance *improvement factor* of the TensorRT FP16 engine over the baseline PyTorch model? Would the speedup become larger, smaller, or stay the same?
>     - _HINT:_ Think about how a longer sequence length affects the computational workload (i.e., the size of the matrix multiplications). Which environment—the generic PyTorch framework or the hardware-specific TensorRT engine—is better at capitalizing on more intensive, parallelizable work?
>     - Answer: ________________

## Conclusion

In this exercise, you've experienced TensorRT's complete optimization workflow and measured its real-world performance impact on T4 hardware.

TensorRT demonstrates how hardware acceleration frameworks transform model deployment feasibility through systematic build-time optimization and runtime efficiency gains. TensorRT success factors include:
- Hardware-specific optimization unlocks true GPU potential
- Dynamic batching adapts to variable workload patterns
- One-time build cost scales across all production inferences
- Mixed precision leverages T4 Tensor Cores effectively

##### **Next optimization challenges to explore:**

- Explore **TensorRT-LLM** to leverage LLM-specific optimizations for faster autoregressive generation.

- Implement **dynamic shape optimization** to handle variable sequence lengths efficiently without reallocating memory.

- Optimize **layer fusion** (like attention and activation layers) to minimize kernel launches and improve throughput.

- Use **TensorRT’s auto-tuning** to select the most efficient kernels for each layer type.