# CNN End-to-End Example: Theoretical Walkthrough

This notebook provides a comprehensive theoretical analysis of a Convolutional Neural Network, examining both the forward and backward passes with detailed calculations of:
- Output dimensions
- Memory requirements (32-bit floats)
- Parameter counts
- Computational cost (FLOPs)

## Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [Notation and Conventions](#notation-and-conventions)
3. [Forward Pass](#forward-pass)
   - [Input Layer](#input-layer)
   - [Block 1: Conv1 → BatchNorm1 → ReLU1 → MaxPool1](#block-1)
   - [Block 2: Conv2 → BatchNorm2 → ReLU2 → MaxPool2](#block-2)
   - [Block 3: Conv3 → BatchNorm3 → ReLU3 → MaxPool3](#block-3)
   - [Global Average Pooling](#global-average-pooling)
   - [Fully Connected Output Layer](#fully-connected-output-layer)
4. [Backward Pass](#backward-pass)
   - [Fully Connected Layer Gradients](#fc-gradients)
   - [Global Average Pooling Gradients](#gap-gradients)
   - [Block 3 Gradients](#block-3-gradients)
   - [Block 2 Gradients](#block-2-gradients)
   - [Block 1 Gradients](#block-1-gradients)
5. [Summary Tables](#summary-tables)
   - [Forward Pass Summary](#forward-summary)
   - [Backward Pass Summary](#backward-summary)
   - [Total Network Statistics](#total-statistics)

## Architecture Overview


<br>
<div align="center">
<img src="../images/chap8/E2E1.png" width="710"/>
<img src="../images/chap8/E2E2.png" width="600"/>
</div>


We'll analyze a CNN designed for image classification with the following architecture:

```
Input (32×32×3 RGB images)
    ↓
Block 1: Conv(3→32, k=3, s=1, p=1) → BatchNorm → ReLU → MaxPool(k=2, s=2)
    ↓
Block 2: Conv(32→64, k=3, s=1, p=1) → BatchNorm → ReLU → MaxPool(k=2, s=2)
    ↓
Block 3: Conv(64→128, k=3, s=1, p=1) → BatchNorm → ReLU → MaxPool(k=2, s=2)
    ↓
Global Average Pooling
    ↓
Fully Connected (128 → 10 classes)
```

**Task**: Classify images into 10 categories

**Batch Size**: $\textcolor{magenta}{N = 32}$ for all calculations

## Notation and Conventions

Throughout this notebook, we use the following color-coded notation:

| Symbol | Meaning | Color |
|--------|---------|-------|
| $\textcolor{magenta}{N}$ | Batch size | Magenta |
| $\textcolor{blue}{C_{in}}$ | Input channels | Blue |
| $\textcolor{orange}{H_{in}, W_{in}}$ | Input height and width | Orange |
| $\textcolor{green}{C_{out}}$ | Output channels | Green |
| $\textcolor{red}{H_{out}, W_{out}}$ | Output height and width | Red |
| $\textcolor{purple}{k, k_h, k_w}$ | Kernel size | Purple |
| $s$ | Stride | Black |
| $p$ | Padding | Black |

### Key Formulas

**Output Spatial Dimensions**:
$$H_{out} = \left\lfloor \frac{H_{in} + 2p - k}{s} \right\rfloor + 1$$

$$W_{out} = \left\lfloor \frac{W_{in} + 2p - k}{s} \right\rfloor + 1$$

**Memory (32-bit floats)**: Number of elements × 4 bytes

**Convolution FLOPs**: $N \times H_{out} \times W_{out} \times C_{out} \times (C_{in} \times k_h \times k_w \times 2)$
- The factor of 2 accounts for multiply-accumulate operations

---
# Forward Pass
---

## Input Layer

### Given
- **Input Shape**: $(\textcolor{magenta}{N}, \textcolor{blue}{C_{in}}, \textcolor{orange}{H_{in}}, \textcolor{orange}{W_{in}}) = (32, 3, 32, 32)$
- **Description**: RGB images of size 32×32 pixels

### Output
- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{blue}{3}, \textcolor{orange}{32}, \textcolor{orange}{32})$
- **Memory**: $32 \times 3 \times 32 \times 32 = 98,304$ elements = **393,216 bytes** (384 KB)
- **Parameters**: 0
- **FLOPs**: 0

### Analysis
The input layer simply receives the data. No transformations occur here, so there are no parameters to learn or computations to perform.

## Block 1: Conv1 → BatchNorm1 → ReLU1 → MaxPool1

### Conv1: Convolutional Layer

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{blue}{3}, \textcolor{orange}{32}, \textcolor{orange}{32})$
- **Filters**: $\textcolor{green}{C_{out}} = 32$
- **Kernel Size**: $\textcolor{purple}{k_h = k_w = 3}$
- **Stride**: $s = 1$
- **Padding**: $p = 1$

#### Output Dimensions
$$\textcolor{red}{H_{out}} = \left\lfloor \frac{32 + 2(1) - 3}{1} \right\rfloor + 1 = \left\lfloor \frac{31}{1} \right\rfloor + 1 = 32$$

$$\textcolor{red}{W_{out}} = \left\lfloor \frac{32 + 2(1) - 3}{1} \right\rfloor + 1 = 32$$

- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{32}, \textcolor{red}{32}, \textcolor{red}{32})$

#### Memory (Forward)
- **Activations**: $32 \times 32 \times 32 \times 32 = 1,048,576$ elements = **4,194,304 bytes** (4 MB)

#### Parameters
- **Weights**: $\textcolor{green}{C_{out}} \times \textcolor{blue}{C_{in}} \times \textcolor{purple}{k_h} \times \textcolor{purple}{k_w} = 32 \times 3 \times 3 \times 3 = 864$
- **Bias**: $\textcolor{green}{C_{out}} = 32$
- **Total**: $864 + 32 = 896$ parameters = **3,584 bytes**

#### FLOPs
For each output position, we perform:
- $C_{in} \times k_h \times k_w = 3 \times 3 \times 3 = 27$ multiplications
- $27$ additions (accumulation)
- $1$ bias addition

Total multiply-adds per output element: $2 \times 27 = 54$

$$\text{FLOPs} = N \times H_{out} \times W_{out} \times C_{out} \times (2 \times C_{in} \times k_h \times k_w)$$
$$= 32 \times 32 \times 32 \times 32 \times 54 = 56,623,104 \approx \textbf{56.6 MFLOPs}$$

---

### BatchNorm1: Batch Normalization

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{32}, \textcolor{red}{32}, \textcolor{red}{32})$
- **Number of channels**: $\textcolor{green}{C = 32}$

#### Output
- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{32}, \textcolor{red}{32}, \textcolor{red}{32})$ (unchanged)
- **Memory**: $32 \times 32 \times 32 \times 32 = 1,048,576$ elements = **4,194,304 bytes** (4 MB)

#### Parameters
- **Scale ($\gamma$)**: $C = 32$
- **Shift ($\beta$)**: $C = 32$
- **Running mean** (not learned, updated during training): $C = 32$
- **Running variance** (not learned, updated during training): $C = 32$
- **Learnable Total**: $32 + 32 = 64$ parameters = **256 bytes**

#### FLOPs
For each channel, we compute:
1. Mean: $N \times H \times W$ additions
2. Variance: $N \times H \times W$ subtractions and squaring
3. Normalization: $N \times H \times W$ operations (subtract mean, divide by std)
4. Scale and shift: $N \times H \times W \times 2$ operations

Approximate FLOPs per channel: $5 \times N \times H \times W = 5 \times 32 \times 32 \times 32 = 163,840$

$$\text{FLOPs} = C \times 163,840 = 32 \times 163,840 = 5,242,880 \approx \textbf{5.2 MFLOPs}$$

---

### ReLU1: Activation Function

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{32}, \textcolor{red}{32}, \textcolor{red}{32})$
- **Function**: $\text{ReLU}(x) = \max(0, x)$

#### Output
- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{32}, \textcolor{red}{32}, \textcolor{red}{32})$ (unchanged)
- **Memory**: $32 \times 32 \times 32 \times 32 = 1,048,576$ elements = **4,194,304 bytes** (4 MB)

#### Parameters
- **Total**: 0 (ReLU has no learnable parameters)

#### FLOPs
One comparison per element:
$$\text{FLOPs} = N \times C \times H \times W = 32 \times 32 \times 32 \times 32 = 1,048,576 \approx \textbf{1.0 MFLOPs}$$

---

### MaxPool1: Max Pooling

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{32}, \textcolor{red}{32}, \textcolor{red}{32})$
- **Kernel Size**: $\textcolor{purple}{k = 2}$
- **Stride**: $s = 2$
- **Padding**: $p = 0$

#### Output Dimensions
$$\textcolor{red}{H_{out}} = \left\lfloor \frac{32 + 2(0) - 2}{2} \right\rfloor + 1 = \left\lfloor \frac{30}{2} \right\rfloor + 1 = 16$$

$$\textcolor{red}{W_{out}} = \left\lfloor \frac{32 + 2(0) - 2}{2} \right\rfloor + 1 = 16$$

- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{32}, \textcolor{red}{16}, \textcolor{red}{16})$

#### Memory
- **Activations**: $32 \times 32 \times 16 \times 16 = 262,144$ elements = **1,048,576 bytes** (1 MB)

#### Parameters
- **Total**: 0 (pooling has no learnable parameters)

#### FLOPs
For each output position, we compare $k \times k = 4$ values:
$$\text{FLOPs} = N \times C \times H_{out} \times W_{out} \times k^2 = 32 \times 32 \times 16 \times 16 \times 4 = 1,048,576 \approx \textbf{1.0 MFLOPs}$$

---

### Block 1 Summary

| Layer | Output Shape | Memory | Parameters | FLOPs |
|-------|--------------|--------|------------|-------|
| Conv1 | (32, 32, 32, 32) | 4.0 MB | 896 | 56.6 M |
| BatchNorm1 | (32, 32, 32, 32) | 4.0 MB | 64 | 5.2 M |
| ReLU1 | (32, 32, 32, 32) | 4.0 MB | 0 | 1.0 M |
| MaxPool1 | (32, 32, 16, 16) | 1.0 MB | 0 | 1.0 M |
| **Block Total** | — | **13.0 MB** | **960** | **63.8 M** |

## Block 2: Conv2 → BatchNorm2 → ReLU2 → MaxPool2

### Conv2: Convolutional Layer

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{blue}{32}, \textcolor{orange}{16}, \textcolor{orange}{16})$
- **Filters**: $\textcolor{green}{C_{out}} = 64$
- **Kernel Size**: $\textcolor{purple}{k_h = k_w = 3}$
- **Stride**: $s = 1$
- **Padding**: $p = 1$

#### Output Dimensions
$$\textcolor{red}{H_{out}} = \left\lfloor \frac{16 + 2(1) - 3}{1} \right\rfloor + 1 = \left\lfloor \frac{15}{1} \right\rfloor + 1 = 16$$

$$\textcolor{red}{W_{out}} = 16$$

- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{64}, \textcolor{red}{16}, \textcolor{red}{16})$

#### Memory
- **Activations**: $32 \times 64 \times 16 \times 16 = 524,288$ elements = **2,097,152 bytes** (2 MB)

#### Parameters
- **Weights**: $64 \times 32 \times 3 \times 3 = 18,432$
- **Bias**: $64$
- **Total**: $18,432 + 64 = 18,496$ parameters = **73,984 bytes**

#### FLOPs
$$\text{FLOPs} = 32 \times 16 \times 16 \times 64 \times (2 \times 32 \times 3 \times 3)$$
$$= 32 \times 16 \times 16 \times 64 \times 576 = 301,989,888 \approx \textbf{302.0 MFLOPs}$$

---

### BatchNorm2: Batch Normalization

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{64}, \textcolor{red}{16}, \textcolor{red}{16})$
- **Number of channels**: $\textcolor{green}{C = 64}$

#### Output
- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{64}, \textcolor{red}{16}, \textcolor{red}{16})$
- **Memory**: $32 \times 64 \times 16 \times 16 = 524,288$ elements = **2,097,152 bytes** (2 MB)

#### Parameters
- **Learnable**: $2 \times C = 2 \times 64 = 128$ parameters = **512 bytes**

#### FLOPs
$$\text{FLOPs} = 64 \times (5 \times 32 \times 16 \times 16) = 64 \times 40,960 = 2,621,440 \approx \textbf{2.6 MFLOPs}$$

---

### ReLU2: Activation Function

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{64}, \textcolor{red}{16}, \textcolor{red}{16})$

#### Output
- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{64}, \textcolor{red}{16}, \textcolor{red}{16})$
- **Memory**: **2,097,152 bytes** (2 MB)

#### Parameters
- **Total**: 0

#### FLOPs
$$\text{FLOPs} = 32 \times 64 \times 16 \times 16 = 524,288 \approx \textbf{0.5 MFLOPs}$$

---

### MaxPool2: Max Pooling

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{64}, \textcolor{red}{16}, \textcolor{red}{16})$
- **Kernel Size**: $\textcolor{purple}{k = 2}$
- **Stride**: $s = 2$

#### Output Dimensions
$$\textcolor{red}{H_{out}} = \left\lfloor \frac{16 - 2}{2} \right\rfloor + 1 = 8$$
$$\textcolor{red}{W_{out}} = 8$$

- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{64}, \textcolor{red}{8}, \textcolor{red}{8})$

#### Memory
- **Activations**: $32 \times 64 \times 8 \times 8 = 131,072$ elements = **524,288 bytes** (0.5 MB)

#### Parameters
- **Total**: 0

#### FLOPs
$$\text{FLOPs} = 32 \times 64 \times 8 \times 8 \times 4 = 524,288 \approx \textbf{0.5 MFLOPs}$$

---

### Block 2 Summary

| Layer | Output Shape | Memory | Parameters | FLOPs |
|-------|--------------|--------|------------|-------|
| Conv2 | (32, 64, 16, 16) | 2.0 MB | 18,496 | 302.0 M |
| BatchNorm2 | (32, 64, 16, 16) | 2.0 MB | 128 | 2.6 M |
| ReLU2 | (32, 64, 16, 16) | 2.0 MB | 0 | 0.5 M |
| MaxPool2 | (32, 64, 8, 8) | 0.5 MB | 0 | 0.5 M |
| **Block Total** | — | **6.5 MB** | **18,624** | **305.6 M** |

## Block 3: Conv3 → BatchNorm3 → ReLU3 → MaxPool3

### Conv3: Convolutional Layer

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{blue}{64}, \textcolor{orange}{8}, \textcolor{orange}{8})$
- **Filters**: $\textcolor{green}{C_{out}} = 128$
- **Kernel Size**: $\textcolor{purple}{k_h = k_w = 3}$
- **Stride**: $s = 1$
- **Padding**: $p = 1$

#### Output Dimensions
$$\textcolor{red}{H_{out}} = \left\lfloor \frac{8 + 2(1) - 3}{1} \right\rfloor + 1 = 8$$
$$\textcolor{red}{W_{out}} = 8$$

- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{128}, \textcolor{red}{8}, \textcolor{red}{8})$

#### Memory
- **Activations**: $32 \times 128 \times 8 \times 8 = 262,144$ elements = **1,048,576 bytes** (1 MB)

#### Parameters
- **Weights**: $128 \times 64 \times 3 \times 3 = 73,728$
- **Bias**: $128$
- **Total**: $73,728 + 128 = 73,856$ parameters = **295,424 bytes**

#### FLOPs
$$\text{FLOPs} = 32 \times 8 \times 8 \times 128 \times (2 \times 64 \times 3 \times 3)$$
$$= 32 \times 8 \times 8 \times 128 \times 1,152 = 301,989,888 \approx \textbf{302.0 MFLOPs}$$

---

### BatchNorm3: Batch Normalization

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{128}, \textcolor{red}{8}, \textcolor{red}{8})$
- **Number of channels**: $\textcolor{green}{C = 128}$

#### Output
- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{128}, \textcolor{red}{8}, \textcolor{red}{8})$
- **Memory**: **1,048,576 bytes** (1 MB)

#### Parameters
- **Learnable**: $2 \times 128 = 256$ parameters = **1,024 bytes**

#### FLOPs
$$\text{FLOPs} = 128 \times (5 \times 32 \times 8 \times 8) = 128 \times 10,240 = 1,310,720 \approx \textbf{1.3 MFLOPs}$$

---

### ReLU3: Activation Function

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{128}, \textcolor{red}{8}, \textcolor{red}{8})$

#### Output
- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{128}, \textcolor{red}{8}, \textcolor{red}{8})$
- **Memory**: **1,048,576 bytes** (1 MB)

#### Parameters
- **Total**: 0

#### FLOPs
$$\text{FLOPs} = 32 \times 128 \times 8 \times 8 = 262,144 \approx \textbf{0.3 MFLOPs}$$

---

### MaxPool3: Max Pooling

#### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{128}, \textcolor{red}{8}, \textcolor{red}{8})$
- **Kernel Size**: $\textcolor{purple}{k = 2}$
- **Stride**: $s = 2$

#### Output Dimensions
$$\textcolor{red}{H_{out}} = \left\lfloor \frac{8 - 2}{2} \right\rfloor + 1 = 4$$
$$\textcolor{red}{W_{out}} = 4$$

- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{128}, \textcolor{red}{4}, \textcolor{red}{4})$

#### Memory
- **Activations**: $32 \times 128 \times 4 \times 4 = 65,536$ elements = **262,144 bytes** (0.25 MB)

#### Parameters
- **Total**: 0

#### FLOPs
$$\text{FLOPs} = 32 \times 128 \times 4 \times 4 \times 4 = 262,144 \approx \textbf{0.3 MFLOPs}$$

---

### Block 3 Summary

| Layer | Output Shape | Memory | Parameters | FLOPs |
|-------|--------------|--------|------------|-------|
| Conv3 | (32, 128, 8, 8) | 1.0 MB | 73,856 | 302.0 M |
| BatchNorm3 | (32, 128, 8, 8) | 1.0 MB | 256 | 1.3 M |
| ReLU3 | (32, 128, 8, 8) | 1.0 MB | 0 | 0.3 M |
| MaxPool3 | (32, 128, 4, 4) | 0.25 MB | 0 | 0.3 M |
| **Block Total** | — | **3.25 MB** | **74,112** | **303.9 M** |

## Global Average Pooling

### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{128}, \textcolor{red}{4}, \textcolor{red}{4})$
- **Operation**: Average each channel across spatial dimensions

### Output
For each channel, compute:
$$\text{GAP}_c = \frac{1}{H \times W} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{c,h,w}$$

$$\text{GAP}_c = \frac{1}{4 \times 4} \sum_{h=1}^{4} \sum_{w=1}^{4} x_{c,h,w} = \frac{1}{16} \sum_{h=1}^{4} \sum_{w=1}^{4} x_{c,h,w}$$

- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{128})$

### Memory
- **Activations**: $32 \times 128 = 4,096$ elements = **16,384 bytes** (16 KB)

### Parameters
- **Total**: 0 (no learnable parameters)

### FLOPs
For each of 128 channels in each of 32 batches:
- Add $4 \times 4 = 16$ values: 15 additions
- Divide by 16: 1 division

$$\text{FLOPs} = N \times C \times (H \times W) = 32 \times 128 \times 16 = 65,536 \approx \textbf{0.07 MFLOPs}$$

### Analysis
Global Average Pooling (GAP) reduces spatial dimensions to 1×1 per channel, effectively converting spatial feature maps into a feature vector. This:
1. Eliminates the need for large fully connected layers
2. Reduces overfitting by having no parameters
3. Makes the network more robust to spatial translations

## Fully Connected Output Layer

### Given
- **Input Shape**: $(\textcolor{magenta}{32}, \textcolor{blue}{128})$
- **Output Classes**: $\textcolor{green}{10}$
- **Operation**: Linear transformation $y = xW + b$

### Output
- **Output Shape**: $(\textcolor{magenta}{32}, \textcolor{green}{10})$
- **Interpretation**: Logits (raw scores) for 10 classes

### Memory
- **Activations**: $32 \times 10 = 320$ elements = **1,280 bytes** (1.25 KB)

### Parameters
- **Weights**: $\textcolor{blue}{128} \times \textcolor{green}{10} = 1,280$
- **Bias**: $\textcolor{green}{10}$
- **Total**: $1,280 + 10 = 1,290$ parameters = **5,160 bytes**

### FLOPs
For each output:
- 128 multiplications and 128 additions (dot product)
- 1 bias addition

$$\text{FLOPs} = N \times \text{output\_dim} \times (2 \times \text{input\_dim}) = 32 \times 10 \times (2 \times 128)$$
$$= 32 \times 10 \times 256 = 81,920 \approx \textbf{0.08 MFLOPs}$$

### Final Output
The output logits are typically passed through a Softmax function during inference:
$$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{10} e^{z_j}}$$

This converts logits to class probabilities. During training, we use Cross-Entropy Loss:
$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log(\text{Softmax}(z_i)_{y_i})$$

where $y_i$ is the true class label.

---
# Backward Pass
---

During backpropagation, we compute gradients with respect to:
1. **Parameters** (weights and biases) - used to update the model
2. **Activations** (layer inputs) - used to propagate gradients to previous layers

We work backwards from the loss function through each layer.

## Fully Connected Layer Gradients

### Given
- **Forward**: $y = Wx + b$
- **Input Shape**: $(\textcolor{magenta}{32}, 128)$
- **Output Shape**: $(\textcolor{magenta}{32}, 10)$
- **Gradient from Loss**: $\frac{\partial \mathcal{L}}{\partial y}$ with shape $(32, 10)$

### Gradients to Compute

#### 1. Gradient w.r.t. Weights ($\frac{\partial \mathcal{L}}{\partial W}$)

$$\frac{\partial \mathcal{L}}{\partial W} = x^T \cdot \frac{\partial \mathcal{L}}{\partial y}$$

- **Shape**: $(128, 10)$ - same as $W$
- **Memory**: $128 \times 10 = 1,280$ elements = **5,120 bytes**

#### 2. Gradient w.r.t. Bias ($\frac{\partial \mathcal{L}}{\partial b}$)

$$\frac{\partial \mathcal{L}}{\partial b} = \sum_{i=1}^{N} \frac{\partial \mathcal{L}}{\partial y_i}$$

Sum across batch dimension.

- **Shape**: $(10,)$ - same as $b$
- **Memory**: $10$ elements = **40 bytes**

#### 3. Gradient w.r.t. Input ($\frac{\partial \mathcal{L}}{\partial x}$)

$$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot W^T$$

- **Shape**: $(32, 128)$ - same as input $x$
- **Memory**: $32 \times 128 = 4,096$ elements = **16,384 bytes**

### FLOPs

1. **Weight gradient**: Matrix multiplication $(128, 32) \times (32, 10)$
   - FLOPs: $128 \times 10 \times 2 \times 32 = 81,920$

2. **Bias gradient**: Sum over batch
   - FLOPs: $10 \times 32 = 320$

3. **Input gradient**: Matrix multiplication $(32, 10) \times (10, 128)$
   - FLOPs: $32 \times 128 \times 2 \times 10 = 81,920$

**Total FLOPs**: $81,920 + 320 + 81,920 = 164,160 \approx \textbf{0.16 MFLOPs}$

### Memory Summary (Backward)
- **Gradient Storage**: $5,120 + 40 + 16,384 = 21,544$ bytes ≈ **21 KB**

## Global Average Pooling Gradients

### Given
- **Forward**: $y_c = \frac{1}{H \times W} \sum_{h,w} x_{c,h,w}$
- **Input Shape**: $(32, 128, 4, 4)$
- **Output Shape**: $(32, 128)$
- **Gradient from next layer**: $\frac{\partial \mathcal{L}}{\partial y}$ with shape $(32, 128)$

### Gradient Computation

Since each spatial position contributes equally to the average:

$$\frac{\partial \mathcal{L}}{\partial x_{c,h,w}} = \frac{1}{H \times W} \cdot \frac{\partial \mathcal{L}}{\partial y_c}$$

The gradient is **broadcasted** from $(32, 128)$ to $(32, 128, 4, 4)$ and scaled by $\frac{1}{16}$.

### Memory
- **Gradient**: $32 \times 128 \times 4 \times 4 = 65,536$ elements = **262,144 bytes** (0.25 MB)

### Parameters
- **Gradients**: 0 (no learnable parameters)

### FLOPs
Broadcasting and scaling:
$$\text{FLOPs} = N \times C \times H \times W = 32 \times 128 \times 4 \times 4 = 65,536 \approx \textbf{0.07 MFLOPs}$$

## Block 3 Gradients: MaxPool3 ← ReLU3 ← BatchNorm3 ← Conv3

### MaxPool3 Gradients

#### Given
- **Forward Input**: $(32, 128, 8, 8)$
- **Forward Output**: $(32, 128, 4, 4)$
- **Gradient from GAP**: $\frac{\partial \mathcal{L}}{\partial y}$ with shape $(32, 128, 4, 4)$

#### Gradient Computation
Max pooling passes gradient only to the position where the maximum occurred:

$$\frac{\partial \mathcal{L}}{\partial x_{h,w}} = \begin{cases} 
\frac{\partial \mathcal{L}}{\partial y_{h',w'}} & \text{if } x_{h,w} = \max(\text{pool region}) \\
0 & \text{otherwise}
\end{cases}$$

- **Gradient Shape**: $(32, 128, 8, 8)$
- **Memory**: $32 \times 128 \times 8 \times 8 = 262,144$ elements = **1,048,576 bytes** (1 MB)
- **FLOPs**: $\approx 0.3$ MFLOPs (routing gradients)

---

### ReLU3 Gradients

#### Given
- **Forward**: $y = \max(0, x)$
- **Gradient from MaxPool**: $\frac{\partial \mathcal{L}}{\partial y}$ with shape $(32, 128, 8, 8)$

#### Gradient Computation

$$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \mathbb{1}_{x > 0}$$

where $\mathbb{1}_{x > 0}$ is an indicator function (1 if $x > 0$, else 0).

- **Gradient Shape**: $(32, 128, 8, 8)$
- **Memory**: **1,048,576 bytes** (1 MB)
- **FLOPs**: $262,144 \approx \textbf{0.3 MFLOPs}$ (element-wise multiplication)

---

### BatchNorm3 Gradients

#### Given
- **Forward**: $y = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$
- **Input Shape**: $(32, 128, 8, 8)$
- **Gradient from ReLU**: $\frac{\partial \mathcal{L}}{\partial y}$ with shape $(32, 128, 8, 8)$

#### Gradient Computation

BatchNorm backward pass is complex. For each channel:

1. **Gradient w.r.t. $\gamma$**:
$$\frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{n,h,w} \frac{\partial \mathcal{L}}{\partial y_{n,c,h,w}} \cdot \frac{x_{n,c,h,w} - \mu_c}{\sqrt{\sigma_c^2 + \epsilon}}$$

2. **Gradient w.r.t. $\beta$**:
$$\frac{\partial \mathcal{L}}{\partial \beta} = \sum_{n,h,w} \frac{\partial \mathcal{L}}{\partial y_{n,c,h,w}}$$

3. **Gradient w.r.t. input $x$**: Involves gradients of mean and variance (complex chain rule)

#### Memory
- **Input gradient**: $(32, 128, 8, 8)$ = **1,048,576 bytes** (1 MB)
- **Parameter gradients**: $\gamma$ and $\beta$ each have 128 elements = **1,024 bytes**

#### FLOPs
Similar to forward pass (computing statistics and applying chain rule):
$$\text{FLOPs} \approx 1.3 \text{ MFLOPs}$$

---

### Conv3 Gradients

#### Given
- **Forward**: Convolution with input $(32, 64, 8, 8)$, output $(32, 128, 8, 8)$
- **Filters**: $128 \times 64 \times 3 \times 3$
- **Gradient from BatchNorm**: $\frac{\partial \mathcal{L}}{\partial y}$ with shape $(32, 128, 8, 8)$

#### Gradient Computation

1. **Gradient w.r.t. Weights** ($\frac{\partial \mathcal{L}}{\partial W}$):
   - Convolve input with gradient
   - Shape: $(128, 64, 3, 3)$
   - Memory: $73,728$ elements = **294,912 bytes**

2. **Gradient w.r.t. Bias** ($\frac{\partial \mathcal{L}}{\partial b}$):
   - Sum gradient across batch and spatial dimensions
   - Shape: $(128,)$
   - Memory: $128$ elements = **512 bytes**

3. **Gradient w.r.t. Input** ($\frac{\partial \mathcal{L}}{\partial x}$):
   - Convolve gradient with transposed filters ("full" convolution)
   - Shape: $(32, 64, 8, 8)$
   - Memory: $131,072$ elements = **524,288 bytes** (0.5 MB)

#### FLOPs
Backward convolution requires roughly **2× forward FLOPs**:
- Weight gradient convolution: $\approx 302$ MFLOPs
- Input gradient convolution: $\approx 302$ MFLOPs

**Total**: $\approx \textbf{604 MFLOPs}$

---

### Block 3 Backward Summary

| Layer | Gradient Shape | Memory | Param Gradients | FLOPs |
|-------|----------------|--------|-----------------|-------|
| MaxPool3 | (32, 128, 8, 8) | 1.0 MB | 0 | 0.3 M |
| ReLU3 | (32, 128, 8, 8) | 1.0 MB | 0 | 0.3 M |
| BatchNorm3 | (32, 128, 8, 8) | 1.0 MB | 1,024 B | 1.3 M |
| Conv3 | (32, 64, 8, 8) | 0.5 MB | 295,424 B | 604.0 M |
| **Block Total** | — | **3.5 MB** | **296,448 B** | **605.9 M** |

## Block 2 Gradients: MaxPool2 ← ReLU2 ← BatchNorm2 ← Conv2

Following the same pattern as Block 3, but with different dimensions:

### MaxPool2 Gradients
- **Gradient Shape**: $(32, 64, 16, 16)$
- **Memory**: $32 \times 64 \times 16 \times 16 = 524,288$ elements = **2,097,152 bytes** (2 MB)
- **FLOPs**: $\approx \textbf{0.5 MFLOPs}$

### ReLU2 Gradients
- **Gradient Shape**: $(32, 64, 16, 16)$
- **Memory**: **2,097,152 bytes** (2 MB)
- **FLOPs**: $524,288 \approx \textbf{0.5 MFLOPs}$

### BatchNorm2 Gradients
- **Input Gradient Shape**: $(32, 64, 16, 16)$
- **Memory**: **2,097,152 bytes** (2 MB)
- **Parameter Gradients**: $\gamma$ and $\beta$ (64 each) = **512 bytes**
- **FLOPs**: $\approx \textbf{2.6 MFLOPs}$

### Conv2 Gradients
- **Input Gradient Shape**: $(32, 32, 16, 16)$
- **Memory**: $32 \times 32 \times 16 \times 16 = 262,144$ elements = **1,048,576 bytes** (1 MB)
- **Weight Gradient**: $(64, 32, 3, 3)$ = $18,432$ elements = **73,728 bytes**
- **Bias Gradient**: $(64,)$ = **256 bytes**
- **FLOPs**: $\approx \textbf{604 MFLOPs}$

### Block 2 Backward Summary

| Layer | Gradient Shape | Memory | Param Gradients | FLOPs |
|-------|----------------|--------|-----------------|-------|
| MaxPool2 | (32, 64, 16, 16) | 2.0 MB | 0 | 0.5 M |
| ReLU2 | (32, 64, 16, 16) | 2.0 MB | 0 | 0.5 M |
| BatchNorm2 | (32, 64, 16, 16) | 2.0 MB | 512 B | 2.6 M |
| Conv2 | (32, 32, 16, 16) | 1.0 MB | 73,984 B | 604.0 M |
| **Block Total** | — | **7.0 MB** | **74,496 B** | **607.6 M** |

## Block 1 Gradients: MaxPool1 ← ReLU1 ← BatchNorm1 ← Conv1

### MaxPool1 Gradients
- **Gradient Shape**: $(32, 32, 32, 32)$
- **Memory**: $32 \times 32 \times 32 \times 32 = 1,048,576$ elements = **4,194,304 bytes** (4 MB)
- **FLOPs**: $\approx \textbf{1.0 MFLOPs}$

### ReLU1 Gradients
- **Gradient Shape**: $(32, 32, 32, 32)$
- **Memory**: **4,194,304 bytes** (4 MB)
- **FLOPs**: $1,048,576 \approx \textbf{1.0 MFLOPs}$

### BatchNorm1 Gradients
- **Input Gradient Shape**: $(32, 32, 32, 32)$
- **Memory**: **4,194,304 bytes** (4 MB)
- **Parameter Gradients**: $\gamma$ and $\beta$ (32 each) = **256 bytes**
- **FLOPs**: $\approx \textbf{5.2 MFLOPs}$

### Conv1 Gradients
- **Input Gradient Shape**: $(32, 3, 32, 32)$ (back to original input)
- **Memory**: $32 \times 3 \times 32 \times 32 = 98,304$ elements = **393,216 bytes** (384 KB)
- **Weight Gradient**: $(32, 3, 3, 3)$ = $864$ elements = **3,456 bytes**
- **Bias Gradient**: $(32,)$ = **128 bytes**
- **FLOPs**: $\approx \textbf{113 MFLOPs}$

### Block 1 Backward Summary

| Layer | Gradient Shape | Memory | Param Gradients | FLOPs |
|-------|----------------|--------|-----------------|-------|
| MaxPool1 | (32, 32, 32, 32) | 4.0 MB | 0 | 1.0 M |
| ReLU1 | (32, 32, 32, 32) | 4.0 MB | 0 | 1.0 M |
| BatchNorm1 | (32, 32, 32, 32) | 4.0 MB | 256 B | 5.2 M |
| Conv1 | (32, 3, 32, 32) | 384 KB | 3,584 B | 113.0 M |
| **Block Total** | — | **12.4 MB** | **3,840 B** | **120.2 M** |

---
# Summary Tables
---

## Forward Pass Summary

| Layer | Input Shape | Output Shape | Memory | Parameters | FLOPs |
|-------|-------------|--------------|--------|------------|-------|
| **Input** | — | (32, 3, 32, 32) | 384 KB | 0 | 0 |
| Conv1 | (32, 3, 32, 32) | (32, 32, 32, 32) | 4.0 MB | 896 | 56.6 M |
| BatchNorm1 | (32, 32, 32, 32) | (32, 32, 32, 32) | 4.0 MB | 64 | 5.2 M |
| ReLU1 | (32, 32, 32, 32) | (32, 32, 32, 32) | 4.0 MB | 0 | 1.0 M |
| MaxPool1 | (32, 32, 32, 32) | (32, 32, 16, 16) | 1.0 MB | 0 | 1.0 M |
| Conv2 | (32, 32, 16, 16) | (32, 64, 16, 16) | 2.0 MB | 18,496 | 302.0 M |
| BatchNorm2 | (32, 64, 16, 16) | (32, 64, 16, 16) | 2.0 MB | 128 | 2.6 M |
| ReLU2 | (32, 64, 16, 16) | (32, 64, 16, 16) | 2.0 MB | 0 | 0.5 M |
| MaxPool2 | (32, 64, 16, 16) | (32, 64, 8, 8) | 0.5 MB | 0 | 0.5 M |
| Conv3 | (32, 64, 8, 8) | (32, 128, 8, 8) | 1.0 MB | 73,856 | 302.0 M |
| BatchNorm3 | (32, 128, 8, 8) | (32, 128, 8, 8) | 1.0 MB | 256 | 1.3 M |
| ReLU3 | (32, 128, 8, 8) | (32, 128, 8, 8) | 1.0 MB | 0 | 0.3 M |
| MaxPool3 | (32, 128, 8, 8) | (32, 128, 4, 4) | 0.25 MB | 0 | 0.3 M |
| GAP | (32, 128, 4, 4) | (32, 128) | 16 KB | 0 | 0.07 M |
| FC | (32, 128) | (32, 10) | 1.25 KB | 1,290 | 0.08 M |
| **TOTAL** | — | — | **22.2 MB** | **94,986** | **673.4 M** |

## Backward Pass Summary

| Layer | Gradient Shape | Memory | Param Grad Memory | FLOPs |
|-------|----------------|--------|-------------------|-------|
| FC | (32, 128) | 16 KB | 5,160 B | 0.16 M |
| GAP | (32, 128, 4, 4) | 0.25 MB | 0 | 0.07 M |
| MaxPool3 | (32, 128, 8, 8) | 1.0 MB | 0 | 0.3 M |
| ReLU3 | (32, 128, 8, 8) | 1.0 MB | 0 | 0.3 M |
| BatchNorm3 | (32, 128, 8, 8) | 1.0 MB | 1,024 B | 1.3 M |
| Conv3 | (32, 64, 8, 8) | 0.5 MB | 295,424 B | 604.0 M |
| MaxPool2 | (32, 64, 16, 16) | 2.0 MB | 0 | 0.5 M |
| ReLU2 | (32, 64, 16, 16) | 2.0 MB | 0 | 0.5 M |
| BatchNorm2 | (32, 64, 16, 16) | 2.0 MB | 512 B | 2.6 M |
| Conv2 | (32, 32, 16, 16) | 1.0 MB | 73,984 B | 604.0 M |
| MaxPool1 | (32, 32, 32, 32) | 4.0 MB | 0 | 1.0 M |
| ReLU1 | (32, 32, 32, 32) | 4.0 MB | 0 | 1.0 M |
| BatchNorm1 | (32, 32, 32, 32) | 4.0 MB | 256 B | 5.2 M |
| Conv1 | (32, 3, 32, 32) | 384 KB | 3,584 B | 113.0 M |
| **TOTAL** | — | **23.1 MB** | **379,944 B** | **1,334.0 M** |

## Total Network Statistics

### Parameters

| Component | Count | Memory (Bytes) |
|-----------|-------|----------------|
| Convolutional layers | 93,184 | 372,736 |
| Batch Normalization | 448 | 1,792 |
| Fully Connected | 1,290 | 5,160 |
| **Total Parameters** | **94,986** | **379,944** |

### Memory Requirements (per batch of 32 images)

| Type | Forward | Backward | Total |
|------|---------|----------|-------|
| Activations | 22.2 MB | 23.1 MB | 45.3 MB |
| Parameters | 0.36 MB | 0.36 MB | 0.72 MB |
| Gradients | — | 0.36 MB | 0.36 MB |
| **Total** | **22.6 MB** | **23.8 MB** | **46.4 MB** |

### Computational Cost

| Pass | FLOPs | Percentage |
|------|-------|------------|
| Forward | 673.4 MFLOPs | 33.5% |
| Backward | 1,334.0 MFLOPs | 66.5% |
| **Total per iteration** | **2,007.4 MFLOPs** | **100%** |

**Note**: Backward pass requires approximately **2× the FLOPs** of the forward pass due to:
1. Computing gradients for both parameters and inputs
2. Additional operations for weight updates

### Computational Breakdown by Layer Type

| Layer Type | Forward FLOPs | % of Total Forward |
|------------|---------------|--------------------|
| Convolution | 660.6 M | 98.1% |
| Batch Normalization | 9.1 M | 1.4% |
| Activation (ReLU) | 1.8 M | 0.3% |
| Pooling | 1.8 M | 0.3% |
| Fully Connected | 0.08 M | 0.01% |

**Key Insight**: Convolutional layers dominate computation (~98%), making them the primary target for optimization.

## Analysis and Insights

### 1. **Parameter Efficiency**
- Total parameters: **94,986** (~95K)
- This is extremely lightweight compared to fully-connected architectures
- For comparison, a single FC layer from 32×32×3 flattened input to 128 features would require 393,344 parameters alone

### 2. **Memory Requirements**
- Forward pass: **22.6 MB** per batch
- Backward pass: **23.8 MB** per batch
- Training requires storing both, plus optimizer states (e.g., momentum)
- For Adam optimizer, multiply memory by ~3× (parameters + first moment + second moment)

### 3. **Computational Cost**
- Dominated by convolutional operations (98% of FLOPs)
- Conv2 and Conv3 are the bottlenecks despite smaller spatial dimensions
- This is because FLOPs scale with $C_{in} \times C_{out}$, not just spatial size

### 4. **Design Trade-offs**
- **Early layers** (Block 1): Large spatial dimensions but few channels → moderate compute
- **Middle layers** (Block 2, 3): Smaller spatial dimensions but more channels → high compute
- **Global Average Pooling**: Eliminates need for large FC layers, saving millions of parameters

### 5. **Optimization Opportunities**
- **Depthwise separable convolutions**: Can reduce FLOPs by 8-9× with minimal accuracy loss
- **Mixed precision training**: Use FP16 instead of FP32 → 2× memory reduction
- **Gradient checkpointing**: Trade computation for memory by recomputing activations during backward pass

### 6. **Scaling Considerations**
- Doubling input resolution (32×32 → 64×64): **4× memory**, **4× FLOPs**
- Doubling channels at each layer: **4× parameters**, **4× FLOPs**
- Adding one more conv block: **+~300M FLOPs**, **+~74K parameters**