# Convolutional Networks

### Limitations of MLP 

**Problem 1** 

Suppose we train a network using $(3 \times 200 \times 200)$ an RGB image. <br>
We'd convert this to a single vector of size $120, 000$, now let's consider passing it through a single $\text{Full connected Layer}$ where we have $1000$ hidden units. We'd need $120 \text{ million parameters!!}$ This isn't feasible.

**Problem 2**

Note that an image contain values which describes: 
1. Small Scale - Color pixels and brightness
2. Intermidiate Scale - Basic shapes (line/strokes of colors)
3. Mid Scale - Basic Image context (face/parts of object) 
4. Large Scale - Larger collection of objects
   
The linear operations applied in a Dense NN will distort/corrupt this information before learning anything valuable, so we need operations that don't have these effects.

---

## Invariance and equivariance

We first address the second problem mathematically. 

$\text{Let } f \text{ be a function } f: (3 \times m \times n) \to (3 \times m \times n) \text{ which takes an image and outputs an image is said to be } invariant \text{ to a transformation } t[x] \text{ if:}$ $$\boxed{f[t[x]] = f[x]}$$

In terms of images this means: The network $f[x]$ should identify an image as containing the same objectm if it's been translated, rotated flipped or warped.

$\text{We say that the function } f \text{ is also } equivariant \text{ or } covariant \text{ to a transformation } t[x] \text{ if } t[x] \text{ if:}$ $$\boxed{f[t[x]] = t[f[x]]}$$

In terms of images this means: If the image is translated, rotated or flipped, then the network $f[x]$ should return a segmentation that has been transformed in the same way

---

There a 2 main Layers invovled in a Convolution Network: 

1. $\textbf{Convolutional Layers}$
   - This invovles convolution operation which is $equivariant$ to $\text{Translation}$
2. $\textbf{Pooling Layers}$
   - This involves selecting/pooling layers of the transformed image which is $partially \ invariant \ to \ translation$

### Convolution Operation

| **Aspect** | **1D Convolution** | **2D Convolution** |
|------------|--------------------|--------------------|
| **Simple Explanation** | A convolution is a **weighted sum** of neighboring input values, where the weights form a **kernel** (or filter). The kernel slides across the input, computing the weighted sum at each position. The number of weights in the kernel is called the **kernel size**. | A 2D convolution is a **weighted sum** of neighboring pixel values in a local region, where the weights form a 2D **kernel** (or filter). The kernel slides across the image (both horizontally and vertically), computing the weighted sum at each position. The kernel dimensions (e.g., $3 \times 3$, $5 \times 5$) define the **kernel size**. |
| **Mathematical Formulation** | For a 1D input $\mathbf{x} = [x_1, x_2, ..., x_n]$ and kernel $\boldsymbol{\omega} = [\omega_1, \omega_2, ..., \omega_k]$ of size $k$, the convolution output at position $i$ is: $$z_i = \sum_{j=0}^{k-1} \omega_{j+1} \cdot x_{i+j}$$ For kernel size 3 centered at position $i$: $$z_i = \omega_1 x_{i-1} + \omega_2 x_i + \omega_3 x_{i+1}$$ | For a 2D input $\mathbf{X} \in \mathbb{R}^{H \times W}$ and kernel $\boldsymbol{\Omega} \in \mathbb{R}^{k_h \times k_w}$, the convolution output at position $(i, j)$ is: $$z_{i,j} = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} \omega_{m,n} \cdot x_{i+m, j+n}$$ For a $3 \times 3$ kernel centered at $(i,j)$: $$z_{i,j} = \sum_{m=-1}^{1} \sum_{n=-1}^{1} \omega_{m,n} \cdot x_{i+m, j+n}$$ |
| **Example** | **Input vector**: $\mathbf{x} = [2, 5, 3, 7]$ <br> **Kernel**: $\boldsymbol{\omega} = [1, 0, -1]$ (kernel size = 3) <br><br> **Computation** (with zero padding): <br> $z_1 = 1(0) + 0(2) + (-1)(5) = -5$ <br> $z_2 = 1(2) + 0(5) + (-1)(3) = -1$ <br> $z_3 = 1(5) + 0(3) + (-1)(7) = -2$ <br> $z_4 = 1(3) + 0(7) + (-1)(0) = 3$ <br><br> **Output vector**: $\mathbf{z} = [-5, -1, -2, 3]$ | **Input matrix**: $\mathbf{X} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$ <br> **Kernel**: $\boldsymbol{\Omega} = \begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \end{bmatrix}$ (size $3 \times 3$) <br><br> **Computation** (center position): <br> $z_{2,2} = 1(1) + 0(2) + (-1)(3) +$ <br> $\quad\quad\quad 1(4) + 0(5) + (-1)(6) +$ <br> $\quad\quad\quad 1(7) + 0(8) + (-1)(9)$ <br> $z_{2,2} = -6 - 6 - 6 = -12$ <br><br> **Output** (with valid padding): single value $z = -12$ |
| **Visualization** | <img src="../images/chap8/conv1.png" width="200" /> | <img src="../images/chap8/convol2D1.png" width="400" /> |

### Padding Strategies

When applying convolutions, we need to handle boundaries where the kernel extends beyond the input. Different **padding strategies** determine the output size and boundary behavior.

| **Padding Type** | **Description** | **1D Example** | **2D Example** | **Visualization**|
|------------------|-----------------|----------------|----------------|------------------|
| **Valid Padding** | **No padding** is added.<br>The kernel only slides over valid positions where it fully overlaps the input.<br> This **reduces output size**. <br><br> **Output size**: $n_{out} = n_{in} - k + 1$ <br> where $n_{in}$ = input size, $k$ = kernel size | **Input**: $\mathbf{x} = [2, 5, 3, 7]$ (size 4) <br> **Kernel**: $\boldsymbol{\omega} = [1, 0, -1]$ (size 3) <br><br> **Valid positions**: 2 positions <br> $z_1 = 1(2) + 0(5) + (-1)(3) = -1$ <br> $z_2 = 1(5) + 0(3) + (-1)(7) = -2$ <br><br> **Output**: $\mathbf{z} = [-1, -2]$ (size 2) | **Input**: $\mathbf{X} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$ (size $3 \times 3$) <br> **Kernel**: $3 \times 3$ <br><br> **Valid position**: only center <br> $z_{1,1} = 1(1) + ... + (-1)(9)$ <br><br> **Output**: single value (size $1 \times 1$) | <img src="../images/chap8/nopad.png" width="300" />|
| **Same / Half Padding** | Add **zeros** around the input boundary so that **output size equals input size** (when stride = 1).<br> Most common in deep CNNs. <br><br> **Padding amount**: $p = \lfloor k/2 \rfloor$ <br> **Output size**: $n_{out} = n_{in}$ (same as input) | **Input**: $\mathbf{x} = [2, 5, 3, 7]$ (size 4) <br> **Kernel**: $\boldsymbol{\omega} = [1, 0, -1]$ (size 3) <br> **Padding**: $p = \lfloor 3/2 \rfloor = 1$ <br> **Padded**: $[0, 2, 5, 3, 7, 0]$ <br><br> $z_1 = 1(0) + 0(2) + (-1)(5) = -5$ <br> $z_2 = 1(2) + 0(5) + (-1)(3) = -1$ <br> $z_3 = 1(5) + 0(3) + (-1)(7) = -2$ <br> $z_4 = 1(3) + 0(7) + (-1)(0) = 3$ <br><br> **Output**: $\mathbf{z} = [-5, -1, -2, 3]$ (size 4) | **Input**: $\mathbf{X} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$ (size $3 \times 3$) <br> **Kernel**: $3 \times 3$ <br> **Padding**: $p = 1$ on all sides <br> **Padded**: $\begin{bmatrix} 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 2 & 3 & 0 \\ 0 & 4 & 5 & 6 & 0 \\ 0 & 7 & 8 & 9 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix}$ <br><br> **Output**: $3 \times 3$ matrix (same size) |<img src="../images/chap8/halfpad.png" width="300" />|
| **Full Padding** | Add **maximum padding** with zeros so that every input element is visited by the kernel at least once.<br> The kernel can extend completely beyond the input on both sides.<br> This **increases output size**. <br><br> **Padding amount**: $p = k - 1$ <br> **Output size**: $n_{out} = n_{in} + k - 1$ | **Input**: $\mathbf{x} = [2, 5, 3, 7]$ (size 4) <br> **Kernel**: $\boldsymbol{\omega} = [1, 0, -1]$ (size 3) <br> **Padding**: $p = 3 - 1 = 2$ <br> **Padded**: $[0, 0, 2, 5, 3, 7, 0, 0]$ <br><br> $z_1 = 1(0) + 0(0) + (-1)(2) = -2$ <br> $z_2 = 1(0) + 0(2) + (-1)(5) = -5$ <br> $z_3 = 1(2) + 0(5) + (-1)(3) = -1$ <br> $z_4 = 1(5) + 0(3) + (-1)(7) = -2$ <br> $z_5 = 1(3) + 0(7) + (-1)(0) = 3$ <br> $z_6 = 1(7) + 0(0) + (-1)(0) = 7$ <br><br> **Output**: $\mathbf{z} = [-2, -5, -1, -2, 3, 7]$ (size 6) | **Input**: $\mathbf{X} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$ (size $3 \times 3$) <br> **Kernel**: $3 \times 3$ <br> **Padding**: $p = 2$ on all sides <br> **Padded**: $\begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 2 & 3 & 0 & 0 \\ 0 & 0 & 4 & 5 & 6 & 0 & 0 \\ 0 & 0 & 7 & 8 & 9 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}$ <br><br> **Output**: $5 \times 5$ matrix |<img src="../images/chap8/fullpad.png" width="300" />|

**Key Insights:**
- **Valid**: No boundary artifacts, but loses spatial resolution → shrinking output
- **Same/Half**: Maintains spatial dimensions → most common in CNNs (ResNet, VGG, etc.)
- **Full**: Increases spatial dimensions → useful in transposed convolutions (upsampling)

### Convolution Hyperparameters: Stride, Kernel Size, and Dilation

These three hyperparameters control the **receptive field**, **output size**, and **computational efficiency** of convolutional layers.

| **Parameter** | **Description** | **1D Example** | **2D Example** |
|---------------|-----------------|----------------|----------------|
| **Kernel Size** | The **spatial extent** of the filter/kernel. Determines how many neighboring values contribute to each output. <br><br> Common sizes: $3 \times 3$, $5 \times 5$, $7 \times 7$ <br> Larger kernels = larger receptive field <br> **Output size** (valid padding): $n_{out} = n_{in} - k + 1$ | **Input**: $\mathbf{x} = [1, 2, 3, 4, 5, 6]$ (size 6) <br><br> **Kernel size 3**: $\boldsymbol{\omega} = [1, 2, 1]$ <br> $z_1 = 1(1) + 2(2) + 1(3) = 8$ <br> $z_2 = 1(2) + 2(3) + 1(4) = 12$ <br> $z_3 = 1(3) + 2(4) + 1(5) = 16$ <br> $z_4 = 1(4) + 2(5) + 1(6) = 20$ <br> **Output**: $[8, 12, 16, 20]$ (size 4) <br><br> **Kernel size 5**: $\boldsymbol{\omega} = [1, 1, 1, 1, 1]$ <br> $z_1 = 1 + 2 + 3 + 4 + 5 = 15$ <br> $z_2 = 2 + 3 + 4 + 5 + 6 = 20$ <br> **Output**: $[15, 20]$ (size 2) | **Input**: $\mathbf{X} = \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \\ 13 & 14 & 15 & 16 \end{bmatrix}$ (size $4 \times 4$) <br><br> **Kernel $2 \times 2$**: <br> Output size: $(4-2+1) \times (4-2+1) = 3 \times 3$ <br><br> **Kernel $3 \times 3$**: <br> Output size: $(4-3+1) \times (4-3+1) = 2 \times 2$ <br><br> **Kernel $4 \times 4$**: <br> Output size: $(4-4+1) \times (4-4+1) = 1 \times 1$ <br><br> Larger kernel = smaller output, bigger receptive field |
| **Stride** | The **step size** by which the kernel moves across the input. Controls downsampling. <br><br> **Stride = 1**: Kernel moves one position at a time (default) <br> **Stride > 1**: Skip positions, reduce output size <br> **Output size**: $n_{out} = \lfloor \frac{n_{in} - k}{s} \rfloor + 1$ where $s$ = stride | **Input**: $\mathbf{x} = [1, 2, 3, 4, 5, 6, 7, 8]$ (size 8) <br> **Kernel**: $\boldsymbol{\omega} = [1, 0, -1]$ (size 3) <br><br> **Stride = 1** (every position): <br> $z_1 = 1(1) + 0(2) - 1(3) = -2$ <br> $z_2 = 1(2) + 0(3) - 1(4) = -2$ <br> $z_3 = 1(3) + 0(4) - 1(5) = -2$ <br> ... <br> **Output**: 6 values <br><br> **Stride = 2** (every other position): <br> $z_1 = 1(1) + 0(2) - 1(3) = -2$ <br> $z_2 = 1(3) + 0(4) - 1(5) = -2$ <br> $z_3 = 1(5) + 0(6) - 1(7) = -2$ <br> **Output**: $[-2, -2, -2]$ (size 3) <br><br> **Stride = 3**: <br> **Output**: 2 values | **Input**: $4 \times 4$ matrix <br> **Kernel**: $3 \times 3$ <br><br> **Stride = 1**: <br> Output: $\lfloor \frac{4-3}{1} \rfloor + 1 = 2$ <br> Output size: $2 \times 2$ <br> Positions: $(0,0), (0,1), (1,0), (1,1)$ <br><br> **Stride = 2**: <br> Output: $\lfloor \frac{4-3}{2} \rfloor + 1 = 1$ <br> Output size: $1 \times 1$ <br> Position: only $(0,0)$ <br><br> **Stride = (1, 2)** (different per axis): <br> Vertical stride = 1, Horizontal stride = 2 <br> Output size: $2 \times 1$ <br><br> Common: stride=2 for downsampling by 2× |
| **Dilation** | Spacing between kernel elements.<br> Creates an **expanded receptive field** without adding parameters.<br> Also called "atrous convolution". <br><br> **Dilation = 1**: Standard convolution (no gaps) <br> **Dilation = $d$**: Insert $d-1$ zeros between kernel weights <br> **Effective kernel size**: $k_{eff} = k + (k-1)(d-1)$ | **Input**: $\mathbf{x} = [1, 2, 3, 4, 5, 6, 7, 8]$ (size 8) <br> **Kernel**: $\boldsymbol{\omega} = [1, 0, -1]$ (size 3) <br><br> **Dilation = 1** (standard): <br> $z_1 = 1(1) + 0(2) - 1(3) = -2$ <br> Uses positions: $i, i+1, i+2$ <br> Effective size: 3 <br><br> **Dilation = 2** (1 gap between): <br> $z_1 = 1(1) + 0(3) - 1(5) = -4$ <br> Uses positions: $i, i+2, i+4$ <br> Effective size: $3 + (3-1)(2-1) = 5$ <br> **Output**: $[-4, -4, -4, -4]$ <br><br> **Dilation = 3** (2 gaps): <br> $z_1 = 1(1) + 0(4) - 1(7) = -6$ <br> Uses positions: $i, i+3, i+6$ <br> Effective size: $3 + (3-1)(3-1) = 7$ | **Input**: $8 \times 8$ matrix <br> **Kernel**: $3 \times 3$ <br><br> **Dilation = 1**: <br> Standard $3 \times 3$ convolution <br> Receptive field: $3 \times 3 = 9$ pixels <br><br> **Dilation = 2**: <br> Kernel pattern: <br> $\begin{bmatrix} \omega_{0,0} & 0 & \omega_{0,2} \\ 0 & 0 & 0 \\ \omega_{2,0} & 0 & \omega_{2,2} \end{bmatrix}$ <br> Effective size: $5 \times 5$ <br> Receptive field: 25 pixels <br> Only 9 parameters! <br><br> **Dilation = 4**: <br> Effective size: $9 \times 9$ <br> Covers 81 pixels with 9 parameters <br><br> Used in: DeepLab, WaveNet |

**General Output Size Formula:**

For input size $n_{in}$, kernel size $k$, padding $p$, stride $s$, and dilation $d$:

$$\boxed{n_{out} = \left\lfloor \frac{n_{in} + 2p - d(k-1) - 1}{s} \right\rfloor + 1}$$

**Key Tradeoffs:**
- **Large kernel size**: More parameters, more computation, bigger receptive field
- **Large stride**: Faster computation, aggressive downsampling, may lose information
- **Large dilation**: Exponentially growing receptive field, no extra parameters, but gaps in coverage


---

### Receptive Field

#### What is the Receptive Field?

The **receptive field** of a neuron in a CNN is the **region of the input image** that influences the output of that neuron.<br> In other words, it's the "field of view" that a particular output pixel can "see" in the original input.

- For a **single convolutional layer**: The receptive field is simply the kernel size
- For **deep networks**: Each layer's receptive field grows, allowing deeper neurons to see larger portions of the input
- The receptive field grows **multiplicatively** as we stack more layers

#### What Does It Tell Us?

The receptive field is crucial because it determines:

1. **Context Understanding**: Larger receptive fields allow the network to capture more global context
   - Small receptive field: Good for detecting **local features** (edges, textures)
   - Large receptive field: Good for understanding **spatial relationships** and **semantic content**

2. **Network Depth Requirements**: To recognize large objects, we need neurons with receptive fields large enough to cover them
   - Face detection: Need receptive field ≥ face size
   - Scene understanding: Need receptive field covering significant portion of image

3. **Design Choices**: 
   - Shallow networks with large kernels vs. deep networks with small kernels
   - Trade-off: $3 \times (3 \times 3)$ convolutions have same receptive field as $1 \times (7 \times 7)$ but with **fewer parameters** and **more non-linearity**

#### How is it Computed?

| **Case** | **Formula** | **Description** |
|----------|-------------|-----------------|
| **Single Layer** | $$\boxed{r = k}$$ | For one convolutional layer, the receptive field size $r$ equals the kernel size $k$. |
| **Multiple Layers (Iterative)** | $$\boxed{r_l = r_{l-1} + (k_l - 1) \cdot \prod_{i=1}^{l-1} s_i \cdot d_l}$$ | For layer $l$ with kernel size $k_l$, stride $s_l$, and dilation $d_l$. <br> Starting with $r_0 = 1$ (the input pixel itself). <br> Build receptive field layer by layer. |
| **Simplified (stride=1, dilation=1)** | $$\boxed{r_L = 1 + L \cdot (k - 1)}$$ <br> Or equivalently: <br> $$\boxed{r_L = k + (L-1)(k-1)}$$ | For $L$ layers, each with kernel size $k$ and stride 1. <br> Linear growth with depth. <br> Each layer adds $(k-1)$ to receptive field. |
| **General Closed Form** | $$\boxed{r_L = \sum_{i=1}^{L} \left[(k_i - 1) \prod_{j=1}^{i-1} s_j \right] + 1}$$ | For $L$ layers with kernel sizes $k_1, k_2, ..., k_L$ and strides $s_1, s_2, ..., s_L$. <br> Handles arbitrary configurations. <br> Strides multiply receptive field growth. |

#### Examples:

| **Example 1** | **Example 2** | **Example 3** | **Example 4** |
|---------------|---------------|---------------|---------------|
| **Three layers, kernel=3, stride=1** | **Three layers, kernel=3, stride=2** | **Mix of kernel sizes** | **With dilation** |
| Layer 1: $r_1 = 3$ <br> Layer 2: $r_2 = 3 + (3-1) \cdot 1 = 5$ <br> Layer 3: $r_3 = 5 + (3-1) \cdot 1 = 7$ <br><br> **Key insight**: Each additional layer adds $(k-1) = 2$ to the receptive field. | Layer 1: $r_1 = 3$ <br> Layer 2: $r_2 = 3 + (3-1) \cdot 2 = 7$ <br> Layer 3: $r_3 = 7 + (3-1) \cdot (2 \cdot 2) = 15$ <br><br> **Key insight**: Strides **amplify** receptive field growth! | Layers: $k_1=7$, $k_2=3$, $k_3=3$ <br> (all stride=1) <br><br> Layer 1: $r_1 = 7$ <br> Layer 2: $r_2 = 7 + (3-1) = 9$ <br> Layer 3: $r_3 = 9 + (3-1) = 11$ <br><br> **Key insight**: Larger initial kernel gives head start | Layer with kernel=3, dilation=2, stride=1: <br><br> Effective kernel size: <br> $k_{eff} = 3 + (3-1)(2-1) = 5$ <br><br> Receptive field increases by $(5-1) = 4$ <br><br> **Key insight**: Dilation expands receptive field without extra parameters |

---


### Channels in Convolutional Layers

#### What are Channels?

**Channels** represent the **depth dimension** of the input/output, separate from spatial dimensions (height and width).

- **Input channels**: Number of "layers" or "features" in the input
  - Grayscale image: 1 channel (intensity)
  - RGB image: 3 channels (Red, Green, Blue)
  - Intermediate layers: arbitrary number of channels (learned features)

- **Output channels**: Number of different filters/feature maps we want to produce
  - Controlled by the number of filters in the layer
  - Each filter produces one output channel

---

#### Why Do We Use Multiple Channels (Filters)?

Adding multiple output channels means applying **multiple different filters** to the same input. Each filter learns to detect **different features**:

**1. Feature Diversity** 
- Different filters learn to detect different patterns:
  - Filter 1 might detect **horizontal edges**
  - Filter 2 might detect **vertical edges**
  - Filter 3 might detect **diagonal lines**
  - Filter 4 might detect **color gradients**
  - Filter 5 might detect **textures**
  
**2. Hierarchical Feature Learning**
- **Early layers** (few channels → many channels):
  - Input: 3 channels (RGB)
  - Output: 64 channels
  - Learn **low-level features**: edges, corners, simple textures
  
- **Middle layers** (many channels → more channels):
  - Input: 64 channels
  - Output: 128/256 channels
  - Learn **mid-level features**: combinations of edges (shapes, parts of objects)
  
- **Deep layers** (many channels → many channels):
  - Input: 256 channels
  - Output: 512 channels
  - Learn **high-level features**: object parts, semantic concepts

**3. Increased Representational Power**
- More filters = more capacity to learn complex patterns
- Each filter adds a new "perspective" or "detector" for analyzing the input
- Network can combine information from multiple channels to make decisions

**4. Why More Weights?**
- **Trade-off**: More parameters vs. better representation
- Example: 64 filters on RGB image
  - Adds: $64 \times 3 \times 3 \times 3 = 1,728$ parameters
  - Gain: 64 different feature detectors instead of 1
  - Result: Network can detect many patterns **simultaneously**

**Analogy**: 
- **1 channel** = Looking at the world through 1 detector (e.g., only detecting vertical edges)
- **64 channels** = Looking at the world through 64 different detectors simultaneously (edges, textures, colors, patterns)
- The network **learns** what each filter should detect through training

**Key Insight**: More channels ≠ redundancy. Each filter specializes in detecting different features, allowing the network to build a rich, diverse representation of the input.

---

#### Dimension Notation with Batches

**Input dimensions**: $\textcolor{magenta}{N} \times \textcolor{blue}{C_{in}} \times \textcolor{orange}{H_{in}} \times \textcolor{orange}{W_{in}}$
- $\textcolor{magenta}{N}$ = **Batch size** (number of samples processed together)
- $\textcolor{blue}{C_{in}}$ = Number of **input channels** (depth)
- $\textcolor{orange}{H_{in}}$ = Input **height** (spatial)
- $\textcolor{orange}{W_{in}}$ = Input **width** (spatial)

**Kernel/Filter dimensions**: $\textcolor{green}{C_{out}} \times \textcolor{blue}{C_{in}} \times \textcolor{purple}{k_h} \times \textcolor{purple}{k_w}$
- $\textcolor{green}{C_{out}}$ = Number of **output channels** (how many filters)
- $\textcolor{blue}{C_{in}}$ = Number of **input channels** (must match input depth)
- $\textcolor{purple}{k_h}, \textcolor{purple}{k_w}$ = **Kernel size** (height, width)

**Output dimensions**: $\textcolor{magenta}{N} \times \textcolor{green}{C_{out}} \times \textcolor{red}{H_{out}} \times \textcolor{red}{W_{out}}$
- $\textcolor{magenta}{N}$ = **Batch size** (unchanged through convolution)
- $\textcolor{green}{C_{out}}$ = Number of **output channels** (feature maps)
- $\textcolor{red}{H_{out}}$ = Output **height** (spatial)
- $\textcolor{red}{W_{out}}$ = Output **width** (spatial)

**Visual Representation:**

<div align="center">
<img src="../images/chap8/batchConv.png" width="700" />
</div>

**Key Notes:**
- The **batch dimension** $\textcolor{magenta}{N}$ remains constant through convolution
- Each sample in the batch is processed **independently** with the **same filters**
- Batching enables parallel processing and efficient GPU utilization

---

#### How Convolution Works with Channels

**Single Filter (produces 1 output channel):**

1. A single filter has dimensions: $\textcolor{blue}{C_{in}} \times \textcolor{purple}{k_h} \times \textcolor{purple}{k_w}$
2. It convolves across **all input channels** simultaneously
3. Results from all input channels are **summed** to produce one output value
4. This produces **one output channel** (feature map)

**Formula for one output pixel**:
$$z_{h,w} = \sum_{c=1}^{\textcolor{blue}{C_{in}}} \sum_{i=0}^{\textcolor{purple}{k_h}-1} \sum_{j=0}^{\textcolor{purple}{k_w}-1} \omega_{c,i,j} \cdot x_{c, h+i, w+j}$$

**Multiple Filters (produces multiple output channels):**

- Use $\textcolor{green}{C_{out}}$ different filters
- Each filter produces 1 output channel
- Total output: $\textcolor{green}{C_{out}}$ feature maps

---

#### Detailed Examples

| **Example** | **Input** | **Filter** | **Computation/Process** | **Output** |
|-------------|-----------|------------|-------------------------|------------|
| **Example 1: RGB → Single Feature Map** | $\textcolor{blue}{3} \times \textcolor{orange}{5} \times \textcolor{orange}{5}$ (RGB image) <br><br> $\textcolor{blue}{C_{in} = 3}$ (R, G, B) <br> $\textcolor{orange}{H_{in} = 5, W_{in} = 5}$ | $\textcolor{blue}{3} \times \textcolor{purple}{3} \times \textcolor{purple}{3}$ <br><br> $\textcolor{blue}{C_{in} = 3}$ (matches input) <br> $\textcolor{purple}{k_h = 3, k_w = 3}$ <br><br> **Parameters**: $3 \times 3 \times 3 = 27$ weights | **At position (1,1)**: <br> $$\begin{aligned} z_{1,1} &= \underbrace{\sum_{i=0}^{2}\sum_{j=0}^{2} \omega_{\text{R},i,j} \cdot x_{\text{R}, 1+i, 1+j}}_{\text{Red}} \\ &+ \underbrace{\sum_{i=0}^{2}\sum_{j=0}^{2} \omega_{\text{G},i,j} \cdot x_{\text{G}, 1+i, 1+j}}_{\text{Green}} \\ &+ \underbrace{\sum_{i=0}^{2}\sum_{j=0}^{2} \omega_{\text{B},i,j} \cdot x_{\text{B}, 1+i, 1+j}}_{\text{Blue}} \end{aligned}$$ <br> Sum across all 3 input channels → 1 value | $\textcolor{green}{1} \times \textcolor{red}{3} \times \textcolor{red}{3}$ <br> (valid padding) <br><br> $\textcolor{green}{C_{out} = 1}$ (single feature map) <br> $\textcolor{red}{H_{out} = 5 - 3 + 1 = 3}$ <br> $\textcolor{red}{W_{out} = 3}$ |
| **Example 2: RGB → Multiple Feature Maps** | $\textcolor{blue}{3} \times \textcolor{orange}{32} \times \textcolor{orange}{32}$ (RGB image) <br><br> $\textcolor{blue}{C_{in} = 3}$ <br> $\textcolor{orange}{H_{in} = 32, W_{in} = 32}$ | $\textcolor{green}{64} \times \textcolor{blue}{3} \times \textcolor{purple}{5} \times \textcolor{purple}{5}$ <br><br> $\textcolor{green}{C_{out} = 64}$ different filters <br> Each filter: $\textcolor{blue}{3} \times \textcolor{purple}{5} \times \textcolor{purple}{5}$ <br><br> **Parameters**: $64 \times 3 \times 5 \times 5 = 4{,}800$ weights | **Process**: <br> 1. Filter 1 convolves with all 3 input channels → output channel 1 <br> 2. Filter 2 convolves with all 3 input channels → output channel 2 <br> 3. ... <br> 4. Filter 64 convolves with all 3 input channels → output channel 64 <br><br> Each filter produces **one** feature map | $\textcolor{green}{64} \times \textcolor{red}{32} \times \textcolor{red}{32}$ <br> (same padding) <br><br> $\textcolor{green}{C_{out} = 64}$ feature maps <br> $\textcolor{red}{H_{out} = 32, W_{out} = 32}$ <br> (spatial size preserved) |
| **Example 3: Deep Layer (Multi-channel → Multi-channel)** | $\textcolor{blue}{128} \times \textcolor{orange}{16} \times \textcolor{orange}{16}$ <br> (from previous layer) <br><br> $\textcolor{blue}{C_{in} = 128}$ <br> $\textcolor{orange}{H_{in} = 16, W_{in} = 16}$ | $\textcolor{green}{256} \times \textcolor{blue}{128} \times \textcolor{purple}{3} \times \textcolor{purple}{3}$ <br><br> $\textcolor{green}{C_{out} = 256}$ different filters <br> Each filter operates on **all** $\textcolor{blue}{128}$ input channels <br> Each filter: $1{,}152$ weights <br><br> **Parameters**: $256 \times 128 \times 3 \times 3 = 294{,}912$ weights | **Process**: <br> Each of the 256 filters: <br> 1. Convolves across all 128 input channels <br> 2. Sums contributions from all channels <br> 3. Produces one output feature map <br><br> Total: 256 different filters → 256 output channels <br><br> Each output pixel depends on $128 \times 3 \times 3 = 1{,}152$ input values | $\textcolor{green}{256} \times \textcolor{red}{16} \times \textcolor{red}{16}$ <br> (same padding) <br><br> $\textcolor{green}{C_{out} = 256}$ feature maps <br> $\textcolor{red}{H_{out} = 16, W_{out} = 16}$ <br> (spatial size preserved) |

---

#### Parameter Count Formula

For a convolutional layer:

$$\boxed{\text{Parameters} = \textcolor{green}{C_{out}} \times \textcolor{blue}{C_{in}} \times \textcolor{purple}{k_h} \times \textcolor{purple}{k_w} + \textcolor{green}{C_{out}}}$$

The $+ \textcolor{green}{C_{out}}$ term accounts for **bias terms** (one per output channel).

Without bias:
$$\boxed{\text{Parameters} = \textcolor{green}{C_{out}} \times \textcolor{blue}{C_{in}} \times \textcolor{purple}{k_h} \times \textcolor{purple}{k_w}}$$

---

#### Output Spatial Dimensions

Using the general formula from before:

$$\boxed{H_{out} = \left\lfloor \frac{{H_{in}} + 2p - d({k_h}-1) - 1}{s} \right\rfloor + 1}$$

$$\boxed{{W_{out}} = \left\lfloor \frac{{W_{in}} + 2p - d({k_w}-1) - 1}{s} \right\rfloor + 1}$$

Where $p$ = padding, $s$ = stride, $d$ = dilation.

**Key Insight**: 
- Channels ($\textcolor{blue}{C_{in}} \to \textcolor{green}{C_{out}}$) are controlled by the **number of filters**
- Spatial dimensions ($\textcolor{orange}{H_{in}, W_{in}} \to \textcolor{red}{H_{out}, W_{out}}$) are controlled by **kernel size, stride, padding**