# Exercise 1: Combine architecture and hardware optimizations for GPU acceleration

You have learned about hardware-aware architecture design and hardware acceleration tools in theory. Now it's time to apply these concepts by combining architectural modifications with hardware optimizations to discover how they interact in practice.

**Overview:** Explore how architectural choices and GPU acceleration can interact in complex ways, requiring careful coordination rather than simple stacking of optimizations to unlock expected performance gains.

**Scenario:** You work for a visual content moderation platform that processes millions of images in real-time. Although your model delivers strong accuracy, its current throughput falls far short of peak demand: you need to nearly 10x performance to keep up. To close the gap, you turn to both TensorRT-based hardware acceleration and architectural adjustments, especially since the DevOps team reports highly unstable GPU utilization (25–60%), suggesting inefficiencies in how the model and hardware interact.

**Goal:** Apply at least one architectural modification and one hardware acceleration technique to DenseNet121, then measure their interaction to see that the combined gains are not strictly additive and may have a positive or a negative effect.

**Tools:** PyTorch, TensorRT, ONNX, CUDA tools

**Estimated Time:** 20 minutes

## Step 1: Setup

Let's establish our baseline environment and verify T4 capabilities for integrated optimization.

In [None]:
# # Uncomment to install necessary libraries, then comment out and restart notebook
# ! pip install torchinfo tensorrt onnx onnxruntime-gpu cuda-python datasets

In [None]:
# Import core libraries
import os
import warnings
warnings.filterwarnings("ignore")

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import torchvision.models as models
import torchvision.transforms as transforms
from torchinfo import summary
import tensorrt as trt
import onnx
from cuda import cudart

import numpy as np
import matplotlib.pyplot as plt
import time
import subprocess
from pathlib import Path

# Create output directory
output_dir = "assets/exercise1"
os.makedirs(output_dir, exist_ok=True)

In [None]:
# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check T4 GPU capabilities
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_properties = torch.cuda.get_device_properties(0)
    
    print(f"GPU: {gpu_name}")
    print(f"Compute Capability: {gpu_properties.major}.{gpu_properties.minor}")
    print(f"Total Memory: {gpu_properties.total_memory / 1e9:.1f} GB")
    print(f"Multiprocessors: {gpu_properties.multi_processor_count}")
    print(f"Memory Bandwidth: ~320 GB/s (theoretical)")
    
    # Check for Tensor Core support (Compute Capability >= 7.0)
    tensor_cores_available = gpu_properties.major >= 7
    print(f"Tensor Core Support: {'✓ Available' if tensor_cores_available else '✗ Not Available'}")
    
    if tensor_cores_available:
        print("  → Mixed precision (FP16) will show significant speedup")
        print("  → Kernel fusion opportunities available")
        
    print("Setup complete!")
else:
    print("CUDA not available - exercise requires GPU")

> **T4 hardware context:** T4 GPUs feature 2,560 CUDA cores, 320 Tensor Cores, and 320 GB/s memory bandwidth. The Tensor Cores are specifically designed to accelerate mixed precision (FP16) operations, but their effectiveness depends on the types of operations in your model architecture.

## Step 2: Create benchmark dataset and model

Let's establish our baseline DenseNet121 model and prepare consistent benchmarking data.

In [None]:
# Create synthetic ImageNet-like data for consistent benchmarking
def create_benchmark_dataset(num_samples=1000, image_size=224):
    """
    Create synthetic dataset for controlled benchmarking
    
    Args:
        num_samples (int): Number of samples to generate
        image_size (int): Size of square images (224x224 for ImageNet)
    
    Returns:
        TensorDataset: Synthetic images and labels for benchmarking
    """
    # Generate random images with ImageNet statistics
    images = torch.randn(num_samples, 3, image_size, image_size)
    labels = torch.randint(0, 1000, (num_samples,))
    
    dataset = TensorDataset(images, labels)
    return dataset

# Create benchmark dataset and dataloaders
benchmark_dataset = create_benchmark_dataset()
print(f"Benchmark dataset created: {len(benchmark_dataset)} samples")

batch_sizes = [16, 32, 64]
dataloaders = {}

for batch_size in batch_sizes:
    dataloaders[batch_size] = DataLoader(
        benchmark_dataset, 
        batch_size=batch_size, 
        shuffle=False,
        pin_memory=True  # Enables faster GPU transfer
    )

print(f"DataLoaders created for batch sizes: {batch_sizes}")

> **Synthetic dataset considerations:** Using synthetic data eliminates dataset loading and preprocessing bottlenecks, allowing us to isolate and measure the pure model optimization effects. Real-world deployments would show additional complexities from data pipeline optimization, but this controlled approach reveals the core architectural and hardware interactions.

In [None]:
# Create baseline DenseNet121 model
baseline_model = models.densenet121()
baseline_model = baseline_model.to(device)
baseline_model.eval()

# Get model information
summary(baseline_model, input_size=(1, 3, 224, 224))

> **Understanding DenseNet's model:** The DenseNet model is based on *dense connectivity*, where each layer receives inputs from ALL preceding layers within a block, creating progressively larger feature maps through concatenation rather than element-wise addition. This leads to this memory usage characteristics:
> 
> - *Memory growth* - feature maps grow linearly with depth (growth_rate × num_layers), making later layers process significantly more channels than earlier ones. 
> - *Memory efficiency trade-offs* - while concatenation enables better gradient flow and feature reuse, it creates higher memory bandwidth requirements and different GPU utilization patterns. 
> 
> This baseline analysis can help you understand how architectural modifications interact with DenseNet's inherent dense connectivity when combined with hardware acceleration techniques. You can find even more details about DenseNet in the original [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993) paper.


## Step 3: Choose your hardware<>architecture optimization path

To optimize the model on GPU, you need to implement at least **one architectural modification** and **one hardware acceleration technique**, then measure their interaction effects. Choose your path!

### Architecture modification options (Choose >=1):

**Option A: Early Exit Architecture**
- Add intermediate classifier after dense block 3 for confident predictions
- Reduces average computation by skipping expensive final block for easy samples
- Adaptive computation based on prediction confidence

**Option B: Reduced Dense Layers**  
- Decrease growth rate from 32 to 24 channels per layer
- Reduces memory bandwidth and computation while maintaining architecture
- Direct efficiency improvement through less feature concatenation

**Option C: Low-Rank Classifier Factorization**
- Decompose final classifier using SVD (7M→2M parameters)
- Maintains dense matrix operations while reducing computational load
- Preserves patterns that work well with hardware acceleration

**Option D: Grouped Convolutions**
- Replace standard 1×1 bottleneck convolutions with grouped versions
- Reduces FLOPs while maintaining regular convolution patterns
- Better hardware utilization than depthwise separable approaches

**Option E: Structured Channel Pruning**
- Remove least important channels from dense block connections
- Creates structured sparsity that maintains computational efficiency
- Reduces both parameters and memory bandwidth requirements

### Hardware acceleration options (Choose >=1):

**Option A: Mixed Precision (FP16) Optimization**
- Leverages Tensor Cores for automatic FP16/FP32 selection
- Accelerates compatible operations while maintaining numerical stability
- Most effective with dense matrix operations

**Option B: Dynamic Batching Optimization**
- Optimizes batch processing and memory utilization
- Changes GPU occupancy patterns and memory coalescing
- Can reveal memory bandwidth vs compute trade-offs

**Option C: Kernel Fusion + Graph Optimization**  
- Combines multiple operations into single GPU kernels
- Reduces memory bandwidth requirements
- Effectiveness depends on operation compatibility

> **TODO: Summarize your optimization strategy**
> 
> Now that you've made your strategic choice on the combination of architecture and hardware optimizations to implement, briefly explain your reasoning _(1-2 sentences)_
> <br>HINT: Consider DenseNet121's characteristics when it comes to memory usage, compute complexity, and fusion opportunities.
> 
> _Add your answer here_: __________________

## Step 4: Implement architecture optimization(s)

Now it's time to translate your chosen architectural strategy into code. 

This step involves modifying the model's structure in PyTorch to change how it computes results. By altering the model's forward pass or its layers, you can introduce efficiencies like adaptive computation or reduced parameter counts, directly impacting its performance profile before any hardware-specific compilation.

In [None]:
def create_optimized_densenet(base_model, device):
    """Create DenseNet with optimizations"""

    # TODO: Create your optimization logic as you wish (modularization is recommended), and define the optimized model in the optimized_model variable
    # HINT: You can refer to the exercises in lesson 2 for some implementations
    # Or find inspiration at discuss.pytorch.org
    optimized_model = # Add your code here

    optimized_model = optimized_model.to(device)
    optimized_model.eval()
    
    return optimized_model

# Create architecture-optimized model
print("Creating optimized DenseNet...")
arch_model = create_optimized_densenet(baseline_model, device)

# Get optimized model information
arch_reduction = sum(p.numel() for p in baseline_model.parameters()) - sum(p.numel() for p in arch_model.parameters())
print(f"\nParameter reduction vs. baseline: {arch_reduction:,} ({arch_reduction/sum(p.numel() for p in baseline_model.parameters())*100:.1f}%)")
summary(arch_model, input_size=(1, 3, 224, 224))

> **On aligning architecture with hardware**: Every architectural modification changes the model's computational pattern in a specific way, whether through parameter reduction, operation reordering, or conditional execution. 
> 
> The key is understanding how these changes will interact with your target hardware's strengths and limitations. Some architectural patterns align naturally with GPU acceleration (dense matrix operations), while others introduce complexity that static compilers cannot optimize (dynamic branching). Measuring the individual architectural effect first provides a baseline to understand interaction effects later.

## Step 5: Implement hardware acceleration with TensorRT

With the architecture modified, you can now turn to hardware acceleration. 

This step involves converting our high-level PyTorch model into a format that a specialized compiler, [TensorRT](https://developer.nvidia.com/tensorrt), can optimize. [ONNX (Open Neural Network Exchange)](https://onnx.ai/) acts as a bridge: TensorRT takes this ONNX graph and applies powerful GPU-specific optimizations.

**Remember that building the TRT engine could take up to a few minutes.**

In [None]:
# First, let's export the model to the ONNX format preferred by TensorRT
def export_to_onnx(model, sample_input, onnx_path):
    """
    Export PyTorch model to ONNX format for TensorRT optimization
    
    ONNX (Open Neural Network Exchange) is an intermediate representation
    that TensorRT uses to analyze and optimize model architectures.
    """
    try:
        # Export model to ONNX format
        # TODO: Make sure the export supports chosen optimizations, if applicable parameters exist
        # HINT: How does ONNX handle batches? Should you set up fp16 precision here or in TensorRT?
        # Reference: https://docs.pytorch.org/docs/stable/onnx_dynamo.html#torch.onnx.export
        torch.onnx.export(
            model,
            sample_input,
            onnx_path,
            export_params=True,
            opset_version=17,
            do_constant_folding=True,
            input_names=['input'],
            output_names=['output']
        )

        # Verify ONNX model
        onnx_model = onnx.load(onnx_path)
        onnx.checker.check_model(onnx_model)
        print(f"ONNX export successful: {onnx_path}")
        print(f"   Total nodes: {len(onnx_model.graph.node)}")

        return True
    except Exception as e:
        print(f"ONNX export failed: {e}")
        return False

In [None]:
# Secondly, let's create the optimized TensorRT engine 
def optimize_with_tensorrt(onnx_path, engine_path):
    """
    Convert ONNX model to TensorRT engine with optional optimizations
    """
    
    try:
        # Create TensorRT logger and builder
        TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
        builder = trt.Builder(TRT_LOGGER)
        config = builder.create_builder_config()

        # TODO: Implement your TensorRT optimizations
        # HINT: You can refer to the exercise 1 in lesson 3 for an example implementation
        # Reference: https://developer.nvidia.com/docs/drive/drive-os/6.0.7/public/drive-os-tensorrt/api-reference/docs/python/infer/Core/BuilderConfig.html#tensorrt.IBuilderConfig

        # Add your code here
        
        # Parse ONNX model
        network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
        parser = trt.OnnxParser(network, TRT_LOGGER)
        
        with open(onnx_path, 'rb') as model_file:
            if not parser.parse(model_file.read()):
                print("Failed to parse ONNX model")
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return False

        # Build optimized engine
        serialized_engine = builder.build_serialized_network(network, config)
        
        if serialized_engine is None:
            print("Failed to build TensorRT engine")
            return False
            
        # Save optimized engine
        with open(engine_path, 'wb') as f:
            f.write(serialized_engine)
            
        print(f"TensorRT engine saved: {engine_path}")
        return True
        
    except Exception as e:
        print(f"TensorRT optimization failed: {e}")

# Export models to ONNX format
sample_input = torch.randn(1, 3, 224, 224).to(device)

baseline_onnx_path = os.path.join(output_dir, "baseline_densenet121.onnx")
arch_onnx_path = os.path.join(output_dir, "optimized_densenet121.onnx")

print("Exporting models to ONNX...")
baseline_onnx_success = export_to_onnx(baseline_model, sample_input, baseline_onnx_path)
arch_onnx_success = export_to_onnx(arch_model, sample_input, arch_onnx_path)

onnx.checker.check_model(baseline_onnx_path)
onnx.checker.check_model(arch_onnx_path)

# Optimize with TensorRT
baseline_engine_path = os.path.join(output_dir, "baseline_densenet_fused.engine")
arch_engine_path = os.path.join(output_dir, "optimized_densenet_fused.engine")

if baseline_onnx_success and arch_onnx_success:
    print("\nOptimizing with TensorRT...")
    baseline_trt_success = optimize_with_tensorrt(
        baseline_onnx_path, 
        baseline_engine_path
    )
    arch_trt_success = optimize_with_tensorrt(
        arch_onnx_path,
        arch_engine_path
    )

> **Understanding ONNX export and TensorRT compilation patterns**: 
> 1. The _ONNX model_ is not guaranteed to have a lower total node count. Early exit architectures, parameter sharing, or layer removal can reduce the graph complexity that ONNX captures. However, some optimizations like attention mechanisms or complex branching might actually increase node counts. 
> 
> 2. _TensorRT's compilation_ process typically runs quietly by default, but you might see warnings about unsupported operations, precision conversions, or optimization choices. Some of these warnings can be ignored (e.g., layer fusion opportunities taken) while others require investigation (e.g., unsupported operation).

## Step 6: Perform systematic performance benchmarking

Now let's measure the impact. 

This systematic approach lets us measure individual technique effects, combined effects, and quantify whether optimizations amplify each other, conflict, or remain independent.

A systematic benchmark is crucial to understand not just if our changes made the model faster, but why. You will compare four configurations to isolate the effects of each optimization and, most importantly, to reveal how they interact when combined. _Does 1+1 equal 2, or do you see an unexpected result?_

⎯⎯⎯⎯⎯⎯⎯

You only have **one TODO in step 6.2** to prepare the inputs for inference. <ins>If you want to scroll quickly through the rest of the code</ins>, here's a high-level summary:

- *6.1*: Build the TensorRT inference class for consistent benchmarking
- *6.2*: Create a fair comparison methodology across PyTorch and TensorRT models  
- *6.3*: Execute the core experiment and calculate optimizations' interaction effects

### 6.1 Create the TensorRT inference class

To run systematic comparisons, you need a unified interface for benchmarking TensorRT engines. This helper class handles the low-level GPU memory management and inference execution, allowing us to focus on measuring performance rather than implementation details.

**Note:** For this exercise, we use synchronous CUDA operations (`execute_v2`, `cudaMemcpy`) rather than async variants to ensure accurate timing measurements. Async operations would improve throughput but make it harder to isolate individual inference latency for benchmarking purposes.


In [None]:
# Helper class for running inference with a TensorRT engine
class TensorRTInfer:
    def __init__(self, engine_path):
        self.logger = trt.Logger(trt.Logger.WARNING)
        with open(engine_path, "rb") as f, trt.Runtime(self.logger) as runtime:
            self.engine = runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()

        # Assume 1 input, 1 output
        self.in_name = [self.engine.get_tensor_name(i)
                        for i in range(self.engine.num_io_tensors)
                        if self.engine.get_tensor_mode(self.engine.get_tensor_name(i)) == trt.TensorIOMode.INPUT][0]
        self.out_name = [self.engine.get_tensor_name(i)
                         for i in range(self.engine.num_io_tensors)
                         if self.engine.get_tensor_mode(self.engine.get_tensor_name(i)) == trt.TensorIOMode.OUTPUT][0]

        # Match engine dtypes to numpy
        self.np_in_dtype = np.float16 if self.engine.get_tensor_dtype(self.in_name) == trt.float16 else np.float32
        self.np_out_dtype = np.float16 if self.engine.get_tensor_dtype(self.out_name) == trt.float16 else np.float32

    def infer(self, input_tensor: np.ndarray):
        # Ensure correct dtype and contiguous memory
        inp = np.ascontiguousarray(input_tensor).astype(self.np_in_dtype)

        # Handle dynamic shapes
        if -1 in self.engine.get_tensor_shape(self.in_name):
            self.context.set_input_shape(self.in_name, inp.shape)

        # Get concrete output shape
        out_shape = tuple(self.context.get_tensor_shape(self.out_name))

        # Allocate device memory
        in_bytes = inp.nbytes
        out_bytes = np.prod(out_shape) * np.dtype(self.np_out_dtype).itemsize

        _, d_in = cudart.cudaMalloc(in_bytes)
        _, d_out = cudart.cudaMalloc(out_bytes)

        # Host output buffer
        h_out = np.empty(np.prod(out_shape), dtype=self.np_out_dtype)

        # Bind device pointers
        self.context.set_tensor_address(self.in_name, int(d_in))
        self.context.set_tensor_address(self.out_name, int(d_out))

        # H2D copy
        cudart.cudaMemcpy(d_in, inp.ctypes.data, in_bytes,
                          cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)
        
        # Prepare bindings array
        bindings = [int(d_in), int(d_out)]  # device pointers in the order of engine bindings

        # Run synchronously
        if not self.context.execute_v2(bindings):
            raise RuntimeError("TensorRT inference failed")

        # D2H copy
        cudart.cudaMemcpy(h_out.ctypes.data, d_out, out_bytes,
                          cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost)

        # Free device memory
        cudart.cudaFree(d_in)
        cudart.cudaFree(d_out)

        return h_out.reshape(out_shape)

### 6.2 Define benchmarking logic for each model

With our infrastructure in place, you can now create consistent benchmarking functions that can handle both PyTorch and TensorRT models. The key challenge is ensuring fair comparisons by controlling for differences in input preprocessing, warmup procedures, and timing methodology across all four configurations you'll test.

In [None]:
def _prepare_inputs(inputs, framework="pytorch", variant="baseline", device="cuda"):
    """
    Prepare inputs depending on framework and optimization variant.
    
    Args:
        inputs: batch from dataloader
        framework: "pytorch" or "tensorrt"
        variant: "baseline" or "optimized" (e.g., fp16)
        device: "cuda" or "cpu"
    """
    # TODO: Define your input
    # HINT: Your baseline and optimized models may expected different formats depending on the chosen optimization
    # Don't forget to place the input of the expected device too

    # Add your code here

    return inputs

def benchmark_trt_engine(engine_path, dataloader, num_batches=20, model_name="Model"):
    """Benchmark a TensorRT engine."""
    trt_infer = TensorRTInfer(engine_path)
    times = []
    
    print(f"Benchmarking {model_name}...")

    model_variant = "baseline" if "baseline" in model_name.lower() else "optimized"
    
    # Warmup
    for i, (inputs, _) in enumerate(dataloader):
        inputs = _prepare_inputs(inputs, framework="tensorrt", variant=model_variant)
        if i >= 3: break
        _ = trt_infer.infer(inputs)

    # Benchmark
    for i, (inputs, _) in enumerate(dataloader):
        if i >= num_batches: break
        inputs = _prepare_inputs(inputs, framework="tensorrt", variant=model_variant)
        
        torch.cuda.synchronize()
        start_time = time.perf_counter()
        _ = trt_infer.infer(inputs)
        torch.cuda.synchronize()
        end_time = time.perf_counter()
        
        times.append(end_time - start_time)
        
    return times

def benchmark_pytorch_model(model, dataloader, num_batches=20, model_name="Model"):
    """Benchmark a PyTorch model."""
    model.eval().to(device)
    times = []
    
    print(f"Benchmarking {model_name}...")
    
    model_variant = "baseline" if "baseline" in model_name.lower() else "optimized"

    # Warmup
    with torch.no_grad():
        for i, (inputs, _) in enumerate(dataloader):   
            inputs = _prepare_inputs(inputs, framework="pytorch", variant=model_variant)
            if i >= 3: break
            _ = model(inputs)

    # Benchmark
    with torch.no_grad():
        for i, (inputs, _) in enumerate(dataloader):
            if i >= num_batches: break
            inputs = _prepare_inputs(inputs, framework="pytorch", variant=model_variant)
            
            torch.cuda.synchronize()
            start_time = time.perf_counter()
            _ = model(inputs)
            torch.cuda.synchronize()
            end_time = time.perf_counter()
            times.append(end_time - start_time)
            
    return times

### 6.3 Analyze optimizations' interactions and results

This is the core experiment: measuring your different configurations to isolate individual optimization effects and discover how they interact when combined. 

You'll benchmark baseline PyTorch, architecture-only PyTorch, hardware-only TensorRT, and combined TensorRT to calculate the interaction factor and determine whether techniques amplify, conflict, or remain independent.

In [None]:
def analyze_technique_interactions():
    """Systematic analysis of the effect of optimizations."""
    print("\n=== TECHNIQUE INTERACTION ANALYSIS ===")
    results = {}
    batch_size = 32
    dataloader = dataloaders[batch_size]

    # --- 1. Baseline PyTorch ---
    times = benchmark_pytorch_model(baseline_model, dataloader, model_name="Baseline DenseNet (PyTorch)")
    results['baseline'] = {'throughput': batch_size / np.mean(times), 'name': "Baseline (PyTorch)"}
    print(f"  Throughput: {results['baseline']['throughput']:.1f} samples/sec\n")

    # --- 2. Architecture-Optimized PyTorch ---
    print("Benchmarking Arch-Optimized (PyTorch)...")
    total_samples = 0
    arch_times = []
    with torch.no_grad():
        for i, (inputs, _) in enumerate(dataloader):
            if i >= 15: break
            inputs = inputs.to(device)
            torch.cuda.synchronize()
            start_time = time.perf_counter()
            _, exit_type = arch_model(inputs)
            torch.cuda.synchronize()
            end_time = time.perf_counter()
            arch_times.append(end_time - start_time)
            total_samples += inputs.size(0)
    results['architecture_only'] = {'throughput': batch_size / np.mean(arch_times), 'name': "Arch-Optimized (PyTorch)"}
    print(f"  Throughput: {results['architecture_only']['throughput']:.1f} samples/sec\n")

    # --- 3. Hardware-Optimized TensorRT ---
    if baseline_trt_success:
        times = benchmark_trt_engine(baseline_engine_path, dataloader, model_name="Baseline DenseNet (TensorRT)")
        results['hardware_only'] = {'throughput': batch_size / np.mean(times), 'name': "Baseline (TensorRT)"}
        print(f"  Throughput: {results['hardware_only']['throughput']:.1f} samples/sec\n")
    else:
        results['hardware_only'] = {'throughput': 0, 'name': "Baseline (TensorRT)"}

    # --- 4. Combined Optimization TensorRT ---
    if arch_trt_success:
        times = benchmark_trt_engine(arch_engine_path, dataloader, model_name="Arch-Optimized (TensorRT)")
        results['combined'] = {'throughput': batch_size / np.mean(times), 'name': "Combined (TensorRT)"}
        print(f"  Throughput: {results['combined']['throughput']:.1f} samples/sec\n")
    else:
        results['combined'] = {'throughput': 0, 'name': "Combined (TensorRT)"}

    return results

interaction_results = analyze_technique_interactions()

In [None]:
# Print analysis results
baseline = interaction_results['baseline']
hardware_only = interaction_results.get('hardware_only', {'throughput': 0})
architecture_only = interaction_results['architecture_only']
combined = interaction_results.get('combined', {'throughput': 0})

# Avoid division by zero if a benchmark failed
if baseline['throughput'] > 0 and hardware_only['throughput'] > 0 and combined['throughput'] > 0:
    architecture_gain = architecture_only['throughput'] / baseline['throughput']
    hardware_gain = hardware_only['throughput'] / baseline['throughput']
    combined_gain = combined['throughput'] / baseline['throughput']

    # Analyze interaction effects
    theoretical_combined = architecture_gain * hardware_gain
    actual_combined = combined_gain
    interaction_factor = actual_combined / theoretical_combined

    print("=== INTEGRATION ANALYSIS ===")
    print("INDIVIDUAL TECHNIQUE EFFECTS:")
    print(f"Architecture optimization: {architecture_gain:.2f}x throughput")
    print(f"Hardware optimization:     {hardware_gain:.2f}x throughput")
    print()

    print("INTEGRATION EFFECTS:")
    print(f"Theoretical combined improvement (if independent): {theoretical_combined:.2f}x over baseline")
    print(f"Actual combined improvement:                       {actual_combined:.2f}x over baseline")
    print(f"Interaction factor: {interaction_factor:.2f}")

    if interaction_factor > 1.05:
        print("→ POSITIVE INTERACTION: Techniques amplify each other beyond expectation.")
    elif interaction_factor < 0.95:
        print("→ NEGATIVE INTERACTION: Techniques do not fully combine, but still improve over baseline.")
    else:
        print("→ NEUTRAL INTERACTION: Effects are roughly independent.")

else:
    print("\nOne or more benchmarks failed, cannot compute interaction analysis.")

> **Why 1+1 ≠ 2 in AI optimization**: You likely observed that the architecture and hardware gains did not simply multiply. 
> 
> This is because optimizations are not independent variables and, following [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law), the combined effect is different than _architecture_gain × hardware_gain_. Individual techniques create specific performance improvements, but their combination depends on how they interact at the computational level. 
> - Positive interactions occur when one optimization creates opportunities for another (e.g., parameter reduction enabling better cache utilization)
> - Negative interactions happen when optimizations conflict (e.g., dynamic logic defeating static compilation). 

------
> #### **TODO: Test your strategic optimization mindset** 
> 
> Add your answers to these analysis questions in the space below.
> 
> 1. **Analyze the bottleneck**: Based on the individual gains, which optimization (architectural or hardware) was more effective for DenseNet121 on its own? Why do you think that was the primary bottleneck for this model?
> 
> _Add your answer here_:  __________________
> 
> 2. **Explain the optimizations' interaction**: Describe the interaction effect you observed (positive, negative, or neutral). Was it expected?
> 
> _Add your answer here_:  __________________
> 
> 3. **Propose a final combination**: Imagine your goal was purely maximum throughput, and accuracy could drop slightly. Based on the options in Step 3, propose your final combination of architectural and hardware optimizations that you hypothesize would create a positive interaction (synergy). Justify your choice.
> 
> _Add your answer here_:  __________________

## Conclusion

You have now implemented a combined hardware-architecture optimization strategy, measuring their interaction effects with the DenseNet121 model on GPU. 

**Key takeaway**: _Optimizations are not composable by default_. Simply applying the best architectural trick and the best hardware tool does not guarantee multiplicative benefits. Effective performance engineering requires a holistic approach, considering how the model’s computational patterns and the hardware’s capabilities interact—sometimes synergistically, sometimes destructively.

##### **Next challenge -> Explore other architectures**

How would these same techniques apply to a different architecture, such as a Vision Transformer (ViT)? ViTs are dominated by large, dense matrix multiplications rather than sequential convolutions and concatenations. This structural difference may lead to very different interactions with mixed precision, kernel fusion, and architectural optimizations.