# Exercise 02: Full End-to-End CNN Lab (ResNet + Gradient Demo)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-04/exercise-02.ipynb)

## üî• FULL END-TO-END CNN LAB (ResNet + Gradient Demo)

### 0Ô∏è‚É£ Setup

In [None]:
# Install required packages using the kernel's Python interpreter
import sys
import subprocess
import importlib

def install_if_missing(package, import_name=None):
    """Install package if it's not already installed."""
    if import_name is None:
        import_name = package

    try:
        importlib.import_module(import_name)
        print(f"‚úì {package} is already installed")
    except ImportError:
        print(f"Installing {package}....")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"‚úì {package} installed successfully")

# Install required packages
install_if_missing("torch")
install_if_missing("torchvision")
install_if_missing("matplotlib")
install_if_missing("numpy")

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision import models
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
import time

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

### 1Ô∏è‚É£ Data Loading + Augmentation

In [None]:
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5),
                         (0.5, 0.5, 0.5))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5),
                         (0.5, 0.5, 0.5))
])

train_dataset = torchvision.datasets.CIFAR10(
    root="./data",
    train=True,
    download=True,
    transform=transform_train
)

test_dataset = torchvision.datasets.CIFAR10(
    root="./data",
    train=False,
    download=True,
    transform=transform_test
)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

classes = train_dataset.classes

### 2Ô∏è‚É£ Sample Data Check

In [None]:
def show_sample_images():
    images, labels = next(iter(train_loader))
    fig = plt.figure(figsize=(10,5))
    
    for i in range(8):
        ax = fig.add_subplot(2,4,i+1)
        img = images[i] / 2 + 0.5
        npimg = img.numpy()
        plt.imshow(np.transpose(npimg, (1,2,0)))
        ax.set_title(classes[labels[i]])
        ax.axis("off")
    
    plt.show()

show_sample_images()

**Never skip this step.**

### 3Ô∏è‚É£ Load ResNet-18

In [None]:
model = models.resnet18(weights=None)
model.fc = nn.Linear(model.fc.in_features, 10)
model = model.to(device)

### üîπ ASCII-Friendly Architecture Printer

In [None]:
def print_resnet_architecture(model):
    print("\n================ RESNET-18 ARCHITECTURE ================\n")
    
    print("Input: 3 x 32 x 32\n")
    
    print("Initial Stem:")
    print("  Conv1 (7x7, stride=2)")
    print("  BatchNorm")
    print("  ReLU")
    print("  MaxPool\n")
    
    print("Residual Layers:")
    
    for i, layer in enumerate([model.layer1, model.layer2, model.layer3, model.layer4], start=1):
        print(f"  Layer {i}:")
        for j, block in enumerate(layer):
            downsample = " + Downsample" if block.downsample is not None else ""
            print(f"    ‚îú‚îÄ‚îÄ BasicBlock {j+1}{downsample}")
            print("    ‚îÇ     Conv ‚Üí BN ‚Üí ReLU")
            print("    ‚îÇ     Conv ‚Üí BN")
            print("    ‚îÇ     + Identity Shortcut")
            print("    ‚îÇ     ‚Üí ReLU")
        print()
    
    print("Head:")
    print("  AdaptiveAvgPool")
    print("  Fully Connected (‚Üí 10 classes)")
    
    print("\n========================================================\n")

### üîπ Call It

In [None]:
print_resnet_architecture(model)

## üî• Visualize Residual Block Computation

We'll:

- Extract one real BasicBlock from your ResNet-18
- Run a real image through it
- Show:
  - F(x)
  - Identity path
  - F(x) + x
- Print the math cleanly in ASCII
- Visualize intermediate tensors

No fake diagrams. Real tensors.

### üî• Step 1 ‚Äî Grab One Residual Block

ResNet-18 structure:

```
model.layer1[0]  ‚Üê first BasicBlock
```

So:

In [None]:
block = model.layer1[0]
print(block)

This is a real residual block.

### üî• Step 2 ‚Äî Define Residual Visualization Function

This shows the math:

```
y = F(x) + x
```

In [None]:
def visualize_residual_block(block, model):
    model.eval()
    
    # Get one sample image
    images, _ = next(iter(test_loader))
    x = images[0].unsqueeze(0).to(device)

    # Pass through stem first
    with torch.no_grad():
        x = model.conv1(x)
        x = model.bn1(x)
        x = model.relu(x)
        x = model.maxpool(x)

    print("Input shape to block:", x.shape)

    with torch.no_grad():
        identity = x

        out = block.conv1(x)
        out = block.bn1(out)
        out = block.relu(out)

        out = block.conv2(out)
        out = block.bn2(out)

        if block.downsample is not None:
            identity = block.downsample(x)

        summed = out + identity
        final = block.relu(summed)

    print("\nTensor Shapes:")
    print("F(x) shape:", out.shape)
    print("Identity shape:", identity.shape)
    print("After Addition shape:", summed.shape)

    return x.cpu(), out.cpu(), identity.cpu(), summed.cpu(), final.cpu()

**Run it:**

In [None]:
x, fx, identity, summed, final = visualize_residual_block(block, model)

### üî• Step 3 ‚Äî Print the Math (ASCII View)

In [None]:
def print_residual_math():
    print("""
Residual Block Computation:

Given input x

1) F(x):
   x ‚Üí Conv ‚Üí BN ‚Üí ReLU
     ‚Üí Conv ‚Üí BN
     = F(x)

2) Identity path:
   x (optionally downsampled)

3) Addition:
   y = F(x) + x

4) Final activation:
   output = ReLU(y)

Key idea:
Instead of learning H(x),
the block learns F(x) = H(x) - x

So:
H(x) = F(x) + x
""")

print_residual_math()

That's the core insight.

### üî• Step 4 ‚Äî Visualize Feature Maps

Let's visualize:

- Identity path
- F(x)
- F(x) + x

In [None]:
def show_feature_maps(tensor, title, num_maps=8):
    fig = plt.figure(figsize=(12,4))
    for i in range(num_maps):
        ax = fig.add_subplot(2,4,i+1)
        plt.imshow(tensor[0][i], cmap='viridis')
        ax.axis("off")
    plt.suptitle(title)
    plt.show()

**Now:**

In [None]:
show_feature_maps(identity, "Identity (x)")
show_feature_maps(fx, "F(x)")
show_feature_maps(summed, "F(x) + x")

You'll literally see:

- Identity = original activation maps
- F(x) = learned modification
- Summed = refinement

That's the residual idea in pixels.

## üß† What Engineers Should Notice

Residual block does not replace representation.

It refines it.

Instead of learning:

```
new_representation = heavy_transform(x)
```

It learns:

```
new_representation = x + small_adjustment
```

Which means:

- If deep layers aren't needed ‚Üí F(x) ‚Üí 0
- Identity passes through cleanly
- Gradients flow directly

## üî• Why This Prevents Vanishing Gradient

**Without residual:**

```
x ‚Üí F1 ‚Üí F2 ‚Üí F3 ‚Üí ...
```

Gradients multiply repeatedly.

**With residual:**

```
x ‚Üí F(x) + x
```

Gradient has two paths:

- Through F(x)
- Directly through x

That second path stabilizes training.

**Mathematically:**

```
dL/dx = dL/dy * (dF/dx + 1)
```

That +1 is the stabilizer.

No magic. Just identity shortcut.

## üß† This Is Deeply Important

Residual blocks are to CNNs what:

- Gates are to LSTMs
- Skip connections are to Transformers

They all introduce:

**A protected path for gradients.**

That architectural motif repeats across deep learning.

### 4Ô∏è‚É£ Parameter Count

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print("ResNet-18 Parameters:", f"{count_parameters(model):,}")

**~11 million.**

### 5Ô∏è‚É£ Training Loop

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train(model, epochs=5):
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        correct = 0
        total = 0
        
        start = time.time()

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            outputs = model(images)
            loss = criterion(outputs, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, preds = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (preds == labels).sum().item()

        print(f"Epoch {epoch+1} | "
              f"Loss: {total_loss/len(train_loader):.4f} | "
              f"Train Acc: {correct/total:.4f} | "
              f"Time: {time.time()-start:.2f}s")

train(model, epochs=5)

### 6Ô∏è‚É£ Evaluation

In [None]:
def evaluate(model):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, preds = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (preds == labels).sum().item()

    print("Test Accuracy:", correct / total)

evaluate(model)

**Expected ~80‚Äì88% in 5 epochs.**

## üîé 7Ô∏è‚É£ Visualize Feature Maps (Conv Layer Outputs)

This is where things get interesting.

In [None]:
def visualize_feature_maps(model, image):
    model.eval()
    
    with torch.no_grad():
        x = model.conv1(image.unsqueeze(0).to(device))
        x = model.bn1(x)
        x = model.relu(x)
    
    feature_maps = x.cpu().squeeze(0)

    fig = plt.figure(figsize=(12,6))
    for i in range(16):
        ax = fig.add_subplot(4,4,i+1)
        plt.imshow(feature_maps[i], cmap="viridis")
        ax.axis("off")

    plt.suptitle("First Layer Feature Maps")
    plt.show()

images, _ = next(iter(test_loader))
visualize_feature_maps(model, images[0])

You'll see edge detectors emerging.

That's learned locality.

## üí• 8Ô∏è‚É£ Demonstrate Vanishing Gradient (Plain Deep CNN)

Now we build a deep CNN without residual connections.

In [None]:
class DeepPlainCNN(nn.Module):
    def __init__(self):
        super().__init__()
        
        layers = []
        in_channels = 3
        
        for _ in range(10):   # 10 deep conv layers
            layers.append(nn.Conv2d(in_channels, 64, kernel_size=3, padding=1))
            layers.append(nn.ReLU())
            in_channels = 64
            
        layers.append(nn.AdaptiveAvgPool2d((1,1)))
        layers.append(nn.Flatten())
        layers.append(nn.Linear(64, 10))
        
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)

plain_model = DeepPlainCNN().to(device)

**Gradient Inspection Function**

In [None]:
def check_gradients(model):
    model.train()
    images, labels = next(iter(train_loader))
    images, labels = images.to(device), labels.to(device)

    outputs = model(images)
    loss = criterion(outputs, labels)
    loss.backward()

    for name, param in model.named_parameters():
        if param.grad is not None:
            print(f"{name} | Grad Mean: {param.grad.abs().mean().item():.6f}")
            break

**Compare Gradient Strength**

In [None]:
print("Plain Deep CNN Gradients:")
check_gradients(plain_model)

print("\nResNet Gradients:")
check_gradients(model)

You'll typically see:

- Plain CNN ‚Üí much smaller early-layer gradients
- ResNet ‚Üí healthier gradients

That's residual connection effect.

## üß† Why This Works

Residual block does:

```
output = F(x) + x
```

This creates a shortcut path for gradients.

Instead of multiplying repeatedly through nonlinear layers, gradients can flow directly.

That stabilizes deep training.

## üî• What This Lab Demonstrates

- Locality bias
- Hierarchical feature extraction
- Residual gradient preservation
- Deep scaling advantage
- Real-world architecture

This is no longer "CNN explanation".

This is architecture intuition.