# Chapter 2: Convolutional Neural Networks (CNNs)

Welcome to Chapter 2! 🖼️

In this chapter, you'll discover why CNNs are the go-to architecture for anything involving images or spatial data. CNNs power everything from facial recognition on your phone to medical image analysis in hospitals!

**What you'll learn:**
- Why regular neural networks struggle with images
- How convolutions work (it's like using a magnifying glass to scan an image)
- Build CNNs that can recognize patterns in images
- Apply CNNs to biological images (like classifying cell types)

**Prerequisites:**
- Chapter 1 (Neural Networks Basics) - we'll build on those concepts
- Basic understanding of images as grids of pixels
- Familiarity with matrix operations (we'll explain as we go)

## 📚 Table of Contents
1. [Why Convolutions?](#why-convolutions)
2. [Understanding Convolution Operation](#convolution-op)
3. [CNN Components](#cnn-components)
4. [Building Your First CNN](#first-cnn)
5. [Popular CNN Architectures](#architectures)
6. [Transfer Learning](#transfer-learning)
7. [Biology Application: Cell Image Classification](#biology-app)

---


In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader, TensorDataset
import seaborn as sns
from PIL import Image

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Set seeds
np.random.seed(42)
torch.manual_seed(42)

print('Libraries imported successfully!')
print(f'PyTorch version: {torch.__version__}')
print(f'GPU available: {torch.cuda.is_available()}')

## 1. Why Convolutions? <a id="why-convolutions"></a>

### The Big Problem with Regular Neural Networks for Images

Let's understand why we need something special for images:

**Example 1: A tiny MNIST digit (28×28 pixels, grayscale)**
- Total inputs: 28 × 28 = 784 pixels
- If we want 1000 neurons in first layer: 784 × 1000 = **784,000 parameters** (weights)
- That's a lot, but manageable...

**Example 2: A small color photo (224×224 pixels, RGB)**
- Total inputs: 224 × 224 × 3 (colors) = 150,528 pixels
- With 1000 neurons: 150,528 × 1000 = **150,528,000 parameters!**
- This is getting out of control...

**Example 3: A high-resolution medical image (1024×1024 pixels, RGB)**
- Total inputs: 1024 × 1024 × 3 = 3,145,728 pixels
- With 1000 neurons: **3.1 BILLION parameters!**
- Your computer would run out of memory! 💥

### Why So Many Parameters is Bad

1. **Too much memory**: Your computer can't store all those weights
2. **Too slow**: Training takes forever (or never finishes)
3. **Overfitting**: The model memorizes training images instead of learning patterns
4. **Ignores structure**: A regular network doesn't know that nearby pixels are related

### The CNN Solution: Three Key Ideas

CNNs are so much better because they use these insights:

**1. Local Connectivity (nearby pixels matter more)**
- Analogy: When looking at a face, the eyes, nose, and mouth near each other matter more than a random eye and a distant toe
- Solution: Each neuron only connects to a small patch of the image (e.g., 3×3 or 5×5 pixels)
- Result: Fewer parameters! Instead of 150 million, maybe just thousands

**2. Parameter Sharing (reuse same detector everywhere)**
- Analogy: If you have a "cat detector," it should work whether the cat is in the top-left or bottom-right of the image
- Solution: Use the same filter (set of weights) across the entire image
- Result: Even fewer parameters! The same 3×3 filter is used everywhere

**3. Translation Invariance (position doesn't matter)**
- Analogy: A cat is still a cat whether it's in the center or corner of a photo
- Solution: Convolution operation naturally handles this
- Result: Model generalizes better to new images

### Real-World Analogy

Think of reading a book:
- **Regular Neural Network**: Memorizing the exact position of every word on every page (impossible!)
- **CNN**: Learning patterns (letter shapes, word structures) that work anywhere on the page (practical!)

Let's visualize what convolution actually does:


### Connections to Signal Processing and Classical Methods

#### Convolution in Mathematics and Signal Processing

The convolution operation used in CNNs is borrowed directly from **signal processing** and **classical applied mathematics** [@strang2016introduction]:

In continuous form:
$$(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) g(t - \tau) d\tau$$

In discrete form (what CNNs use):
$$(I * K)(i, j) = \sum_{m} \sum_{n} I(i-m, j-n) \cdot K(m, n)$$

**Historical context:** Convolutions have been used in signal processing since the 1960s for:
- Audio filtering (removing noise from sound)
- Image processing (edge detection, blurring, sharpening)
- Time series analysis (smoothing, trend detection)

**What CNNs innovate:** Instead of using hand-designed filters (like edge detectors), CNNs **learn the optimal filters** from data through backpropagation!

#### Translation Equivariance: A Mathematical Property

CNNs have a beautiful mathematical property called **translation equivariance**:

If input $I$ is shifted by vector $v$, the output is also shifted by $v$:
$$\text{CNN}(I \text{ shifted by } v) = \text{CNN}(I) \text{ shifted by } v$$

**Why this matters:**
- A cat is still a cat whether it appears in the top-left or bottom-right of an image
- This property is built into the architecture, not learned from data
- Classical fully-connected networks must learn this property separately for each location (inefficient!)

#### Comparison with Feature Engineering

**Classical computer vision (pre-CNN)** [@lecun1998gradient]:
1. Hand-design filters (e.g., Gabor filters, SIFT features, HOG features)
2. Apply filters to extract features
3. Use simple classifier (SVM, logistic regression) on extracted features

**CNN approach:**
1. Learn filters automatically from training data
2. Hierarchically compose simple filters into complex ones
3. Integrate feature extraction and classification (end-to-end learning)

**Statistical perspective:** CNNs perform **automatic feature selection** within a constrained hypothesis space (convolutional structure). This is similar to regularization in classical statistics - we're constraining the model to reduce overfitting, but unlike $L_1$ or $L_2$ regularization, we're using architectural constraints (weight sharing, local connectivity).



In [None]:
def visualize_convolution():
    """Visualize a simple 2D convolution operation."""
    
    # Create a simple input (5x5)
    input_img = np.array([
        [1, 1, 1, 0, 0],
        [0, 1, 1, 1, 0],
        [0, 0, 1, 1, 1],
        [0, 0, 1, 1, 0],
        [0, 1, 1, 0, 0]
    ])
    
    # Edge detection kernel (3x3)
    kernel = np.array([
        [-1, -1, -1],
        [-1,  8, -1],
        [-1, -1, -1]
    ])
    
    # Perform convolution manually
    output_size = input_img.shape[0] - kernel.shape[0] + 1
    output = np.zeros((output_size, output_size))
    
    for i in range(output_size):
        for j in range(output_size):
            # Extract region
            region = input_img[i:i+3, j:j+3]
            # Apply kernel
            output[i, j] = np.sum(region * kernel)
    
    # Visualize
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Input
    im1 = axes[0].imshow(input_img, cmap='gray', interpolation='nearest')
    axes[0].set_title('Input Image (5×5)', fontsize=13, weight='bold')
    axes[0].grid(True, which='both', color='red', linewidth=0.5, alpha=0.3)
    axes[0].set_xticks(np.arange(-0.5, 5, 1), minor=True)
    axes[0].set_yticks(np.arange(-0.5, 5, 1), minor=True)
    for i in range(5):
        for j in range(5):
            axes[0].text(j, i, str(int(input_img[i, j])), 
                        ha='center', va='center', color='red', fontsize=11, weight='bold')
    plt.colorbar(im1, ax=axes[0])
    
    # Kernel
    im2 = axes[1].imshow(kernel, cmap='RdBu', interpolation='nearest', vmin=-8, vmax=8)
    axes[1].set_title('Edge Detection Kernel (3×3)', fontsize=13, weight='bold')
    axes[1].grid(True, which='both', color='black', linewidth=0.5, alpha=0.3)
    axes[1].set_xticks(np.arange(-0.5, 3, 1), minor=True)
    axes[1].set_yticks(np.arange(-0.5, 3, 1), minor=True)
    for i in range(3):
        for j in range(3):
            axes[1].text(j, i, str(int(kernel[i, j])), 
                        ha='center', va='center', color='black', fontsize=11, weight='bold')
    plt.colorbar(im2, ax=axes[1])
    
    # Output
    im3 = axes[2].imshow(output, cmap='viridis', interpolation='nearest')
    axes[2].set_title('Output Feature Map (3×3)', fontsize=13, weight='bold')
    axes[2].grid(True, which='both', color='white', linewidth=0.5, alpha=0.3)
    axes[2].set_xticks(np.arange(-0.5, 3, 1), minor=True)
    axes[2].set_yticks(np.arange(-0.5, 3, 1), minor=True)
    for i in range(output_size):
        for j in range(output_size):
            axes[2].text(j, i, f'{output[i, j]:.0f}', 
                        ha='center', va='center', color='white', fontsize=11, weight='bold')
    plt.colorbar(im3, ax=axes[2])
    
    plt.tight_layout()
    plt.show()
    
    print('\n📊 Convolution Operation:')
    print('Input (5×5) * Kernel (3×3) = Output (3×3)')
    print('\nOutput size formula: (input_size - kernel_size + 1)')
    print('In this case: (5 - 3 + 1) = 3')

visualize_convolution()

## 2. Understanding Convolution Operation <a id="convolution-op"></a>

### Mathematical Definition

For a 2D input $I$ and kernel $K$, the convolution at position $(i, j)$ is:

$$S(i, j) = (I * K)(i, j) = \sum_{m} \sum_{n} I(i+m, j+n) \cdot K(m, n)$$

### Key Parameters

1. **Kernel Size**: Size of the filter (e.g., 3×3, 5×5)
2. **Stride**: Step size when sliding kernel (default: 1)
3. **Padding**: Add zeros around input to control output size
4. **Dilation**: Spacing between kernel elements

### Output Size Calculation

$$O = \frac{W - K + 2P}{S} + 1$$

where:
- $O$ = output size
- $W$ = input size
- $K$ = kernel size
- $P$ = padding
- $S$ = stride

### Convolution as a Linear Operator

From a mathematical perspective, convolution is a **linear operator** - it takes an input and produces an output through a linear transformation, just like matrix multiplication.

#### Connection to Matrix Factorization

Interestingly, the convolution operation can be viewed as a form of **constrained matrix multiplication** where:
1. **Weight sharing:** The same filter weights are applied at every position
2. **Sparse connectivity:** Each output depends only on a local region of the input

In standard matrix multiplication: $Y = WX$ (where $W$ can have any values)

In convolution: We impose structure on $W$:
- Many entries in $W$ are zero (sparse, local connectivity)
- Many entries in $W$ are tied (weight sharing)

**This constraint is a form of regularization**, similar to how in classical statistics we might use:
- Ridge regression ($L_2$ penalty) to prevent overfitting
- Lasso regression ($L_1$ penalty) to induce sparsity
- CNNs use **architectural constraints** (structure of $W$) to prevent overfitting

#### Relationship to Basis Decomposition

A convolution layer with $k$ filters can be thought of as decomposing the input into $k$ different "views" or "basis representations":

$$\text{Output} = \sum_{i=1}^{k} (\text{Input} * \text{Filter}_i)$$

This is analogous to **basis decomposition** in classical analysis (Fourier series, wavelet transforms), but the bases (filters) are learned from data rather than predefined.

**Connection to dictionary learning:** In sparse coding [@hastie2009elements], we learn a dictionary of basis functions. CNNs do something similar but enforce additional structure (locality, hierarchy).



In [None]:
def demonstrate_conv_parameters():
    """Demonstrate effect of different convolution parameters."""
    
    # Create a sample input
    x = torch.randn(1, 1, 7, 7)  # batch=1, channels=1, height=7, width=7
    
    print('Input shape:', x.shape)
    print('Format: (batch_size, channels, height, width)\n')
    
    # Different configurations
    configs = [
        {'kernel_size': 3, 'stride': 1, 'padding': 0, 'name': 'Default'},
        {'kernel_size': 3, 'stride': 2, 'padding': 0, 'name': 'Stride=2'},
        {'kernel_size': 3, 'stride': 1, 'padding': 1, 'name': 'Padding=1 (same)'},
        {'kernel_size': 5, 'stride': 1, 'padding': 0, 'name': 'Kernel=5×5'},
    ]
    
    results = []
    for config in configs:
        conv = nn.Conv2d(in_channels=1, out_channels=1, 
                        kernel_size=config['kernel_size'],
                        stride=config['stride'],
                        padding=config['padding'])
        output = conv(x)
        
        # Calculate output size using formula
        calc_size = int((7 - config['kernel_size'] + 2*config['padding']) / config['stride'] + 1)
        
        result = f"{config['name']:20s} | Output: {output.shape[2]}×{output.shape[3]} (calculated: {calc_size}×{calc_size})"
        results.append(result)
        print(result)
    
    print('\n💡 Key Insight:')
    print('  - Stride > 1: Reduces spatial dimensions (downsampling)')
    print('  - Padding = (kernel_size-1)/2: Maintains input size ("same" padding)')
    print('  - Larger kernels: See more context but fewer parameters')

demonstrate_conv_parameters()

## 3. CNN Components <a id="cnn-components"></a>

A typical CNN consists of:

### 1. Convolutional Layers
- Learn spatial hierarchies of features
- Share weights across spatial locations
- Each filter detects a specific pattern

### 2. Activation Functions
- Usually ReLU: $\text{ReLU}(x) = \max(0, x)$
- Introduces non-linearity

### 3. Pooling Layers
- Reduce spatial dimensions
- Provide translation invariance
- Types: Max pooling, Average pooling

### 4. Fully Connected Layers
- Final classification
- Combine all features

### 5. Dropout (Regularization)
- Randomly drop neurons during training
- Prevents overfitting

### Statistical Interpretation of CNN Components

#### Pooling as Downsampling with Robustness

**Max pooling** and **average pooling** serve similar purposes to classical downsampling but with built-in robustness:

**Average pooling:**
$$\text{Pool}(x) = \frac{1}{k^2} \sum_{i,j \in \text{region}} x_{i,j}$$
- Analogous to **moving average** in time series analysis
- Reduces variance (smoothing effect)
- Less sensitive to outliers than max pooling

**Max pooling:**
$$\text{Pool}(x) = \max_{i,j \in \text{region}} x_{i,j}$$
- Provides **translation invariance** (small shifts don't change the max)
- More robust to noise than average pooling
- Similar to robust statistics using maximum likelihood

**Statistical insight:** Pooling trades spatial resolution for robustness, similar to how in classical statistics we might use:
- Binning continuous variables (reduces variance but loses information)
- Robust estimators (median instead of mean)

#### Dropout as Ensemble Learning

**Dropout** [@srivastava2014dropout] randomly drops neurons during training:

From an ensemble learning perspective, dropout trains an exponential number of "thinned" networks and averages their predictions. This is similar to:

- **Bagging** in random forests: Train multiple models on subsamples of data
- **Dropout:** Train multiple sub-networks on subsamples of the architecture

**Mathematical connection to Bayesian inference:** Dropout can be interpreted as approximate Bayesian inference, where we're marginalizing over different network architectures [@bishop2006pattern]. This provides a form of uncertainty estimation.

#### Batch Normalization and Statistical Standardization

**Batch normalization** [@ioffe2015batch] is a direct application of statistical standardization:

$$\hat{x} = \frac{x - \mu_{\text{batch}}}{\sqrt{\sigma^2_{\text{batch}} + \epsilon}}$$

This is exactly **z-score normalization** from statistics! 

**Why it helps:**
- Addresses "internal covariate shift" (distribution of layer inputs changes during training)
- Similar to **feature scaling** in classical ML (normalizing inputs before regression/SVM)
- Makes optimization easier by keeping activations in a reasonable range

The learnable parameters ($\gamma, \beta$) allow the network to undo this normalization if needed:
$$y = \gamma \hat{x} + \beta$$

This is elegant: we provide a good starting point (normalized) but let the network learn the optimal scale and shift.



In [None]:
def visualize_pooling():
    """Visualize max pooling operation."""
    
    # Create input
    input_data = np.array([
        [1, 3, 2, 4],
        [5, 6, 7, 8],
        [9, 2, 3, 1],
        [0, 4, 5, 2]
    ])
    
    # Max pooling 2x2
    output = np.zeros((2, 2))
    for i in range(2):
        for j in range(2):
            region = input_data[i*2:(i+1)*2, j*2:(j+1)*2]
            output[i, j] = np.max(region)
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Input
    im1 = axes[0].imshow(input_data, cmap='YlOrRd', interpolation='nearest')
    axes[0].set_title('Input (4×4)', fontsize=13, weight='bold')
    axes[0].grid(True, which='both', color='black', linewidth=2)
    axes[0].set_xticks(np.arange(-0.5, 4, 1), minor=True)
    axes[0].set_yticks(np.arange(-0.5, 4, 1), minor=True)
    
    # Add pooling windows
    for i in range(2):
        for j in range(2):
            rect = plt.Rectangle((j*2-0.5, i*2-0.5), 2, 2, 
                                fill=False, edgecolor='blue', linewidth=3)
            axes[0].add_patch(rect)
    
    for i in range(4):
        for j in range(4):
            axes[0].text(j, i, str(int(input_data[i, j])), 
                        ha='center', va='center', color='black', fontsize=12, weight='bold')
    plt.colorbar(im1, ax=axes[0])
    
    # Output
    im2 = axes[1].imshow(output, cmap='YlOrRd', interpolation='nearest')
    axes[1].set_title('Max Pooled Output (2×2)', fontsize=13, weight='bold')
    axes[1].grid(True, which='both', color='black', linewidth=2)
    axes[1].set_xticks(np.arange(-0.5, 2, 1), minor=True)
    axes[1].set_yticks(np.arange(-0.5, 2, 1), minor=True)
    for i in range(2):
        for j in range(2):
            axes[1].text(j, i, str(int(output[i, j])), 
                        ha='center', va='center', color='black', fontsize=14, weight='bold')
    plt.colorbar(im2, ax=axes[1])
    
    plt.tight_layout()
    plt.show()
    
    print('\n📊 Max Pooling (2×2, stride=2):')
    print('Takes maximum value from each 2×2 region')
    print('Reduces spatial dimensions by half')
    print('Provides translation invariance')

visualize_pooling()