# Batch Normalization



#### What is Batch Normalization?

**Batch Normalization (BatchNorm)** is a technique that normalizes the inputs to each layer during training, making the network easier to train and more stable.

**Key Idea:** Normalize activations to have **mean ≈ 0** and **standard deviation ≈ 1**, then apply learnable scaling and shifting.

---

#### Why Do We Need Batch Normalization?

**Problems it solves:**

1. **Internal Covariate Shift**: As network trains, the distribution of layer inputs changes, forcing later layers to constantly adapt
   - BatchNorm stabilizes these distributions

2. **Vanishing/Exploding Gradients**: Helps maintain reasonable gradient magnitudes throughout the network
   - Allows higher learning rates

3. **Sensitivity to Initialization**: Reduces dependence on careful weight initialization

4. **Regularization Effect**: Acts as a form of regularization, reducing overfitting
   - Can reduce/eliminate need for Dropout

**Benefits:**
- ✅ Faster training (can use higher learning rates)
- ✅ Less sensitive to initialization
- ✅ Acts as regularization
- ✅ Reduces internal covariate shift
- ✅ Can eliminate need for Dropout in some cases

---

#### General Batch Normalization Algorithm

**For a batch of activations**, BatchNorm performs 4 steps:

| **Step** | **Operation** | **Formula** | **Purpose** |
|----------|---------------|-------------|-------------|
| **1. Compute Batch Statistics** | Calculate mean and variance across the batch | $$\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i$$ $$\sigma^2_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2$$ | Get statistics for normalization <br> $m$ = batch size |
| **2. Normalize** | Subtract mean, divide by std deviation | $$\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}}$$ | Force distribution to mean=0, std=1 <br> $\epsilon$ prevents division by zero (typically $10^{-5}$) |
| **3. Scale** | Multiply by learnable parameter | $$y_i = \gamma \cdot \hat{x}_i$$ | Allow network to **learn optimal scale** <br> $\gamma$ is learned during training |
| **4. Shift** | Add learnable parameter | $$y_i = \gamma \cdot \hat{x}_i + \beta$$ | Allow network to **learn optimal mean** <br> $\beta$ is learned during training |

**Complete Formula:**

$$\boxed{y_i = \gamma \cdot \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}} + \beta}$$

**Key Insight:** $\gamma$ and $\beta$ allow the network to undo the normalization if needed!
- If $\gamma = \sqrt{\sigma^2_{\mathcal{B}}}$ and $\beta = \mu_{\mathcal{B}}$, we recover the original values

---


### Understanding the Learnable Parameters (γ and β)

**Why Do We Need Learnable Parameters?**

After normalization, all activations have mean=0 and std=1. <br>
But this might not be optimal for learning! The learnable parameters $\gamma$ (gamma) and $\beta$ (beta) give the network **flexibility** to:

1. **Undo the normalization if needed**
2. **Learn the optimal distribution** for each feature/channel
3. **Preserve representational power** of the network

---

#### The Mathematics Behind γ and β

**After normalization**, we have:
$$\hat{x} = \frac{x - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}}$$

Where $\hat{x}$ has mean=0 and std=1.

**Then we apply scale and shift**:
$$\boxed{y = \gamma \cdot \hat{x} + \beta}$$

**Key insight:** If the network learns:
- $\gamma = \sqrt{\sigma^2_{\mathcal{B}}}$ (the original std)
- $\beta = \mu_{\mathcal{B}}$ (the original mean)

Then: $y = \sqrt{\sigma^2_{\mathcal{B}}} \cdot \frac{x - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}} + \mu_{\mathcal{B}} \approx x$ (recovers original input!)

**This means:** The network can learn to **disable** batch normalization if it's not helpful!

---

#### What Do γ and β Learn?

| **Scenario** | **Learned Values** | **Effect** | **Interpretation** |
|--------------|-------------------|------------|-------------------|
| **Standard Normalization** | $\gamma = 1$ <br> $\beta = 0$ | $y = \hat{x}$ | Keep normalized distribution: mean=0, std=1 <br> This is the initialization |
| **Undo Normalization** | $\gamma = \sqrt{\sigma^2_{\mathcal{B}}}$ <br> $\beta = \mu_{\mathcal{B}}$ | $y \approx x$ | Recover original distribution <br> Network decides normalization isn't helpful |
| **Increase Variance** | $\gamma = 2$ <br> $\beta = 0$ | $y = 2\hat{x}$ | Distribution: mean=0, std=2 <br> Wider spread of values |
| **Shift Mean** | $\gamma = 1$ <br> $\beta = 3$ | $y = \hat{x} + 3$ | Distribution: mean=3, std=1 <br> Shift activation threshold |
| **Custom Distribution** | $\gamma = 0.5$ <br> $\beta = -2$ | $y = 0.5\hat{x} - 2$ | Distribution: mean=-2, std=0.5 <br> Network learns optimal values |

---

#### Detailed Example: How γ and β Work

**Setup:**
- Normalized values: $\hat{x} = [-1.5, -0.5, 0, 0.5, 1.5]$ (mean=0, std≈1)
- We'll see different $(\gamma, \beta)$ effects

| **Parameters** | **Computation** | **Output** | **Distribution** |
|----------------|-----------------|------------|------------------|
| $\gamma=1, \beta=0$ <br> (Standard) | $y = 1 \cdot \hat{x} + 0$ | $[-1.5, -0.5, 0, 0.5, 1.5]$ | Mean = 0 <br> Std ≈ 1 <br> (unchanged) |
| $\gamma=2, \beta=0$ <br> (Scale up) | $y = 2 \cdot \hat{x} + 0$ | $[-3, -1, 0, 1, 3]$ | Mean = 0 <br> Std ≈ 2 <br> (doubled variance) |
| $\gamma=1, \beta=5$ <br> (Shift up) | $y = 1 \cdot \hat{x} + 5$ | $[3.5, 4.5, 5, 5.5, 6.5]$ | Mean = 5 <br> Std ≈ 1 <br> (shifted right) |
| $\gamma=0.5, \beta=-1$ <br> (Scale & shift) | $y = 0.5 \cdot \hat{x} - 1$ | $[-1.75, -1.25, -1, -0.75, -0.25]$ | Mean = -1 <br> Std ≈ 0.5 <br> (compressed & shifted) |
| $\gamma=3, \beta=10$ <br> (Large scale & shift) | $y = 3 \cdot \hat{x} + 10$ | $[5.5, 8.5, 10, 11.5, 14.5]$ | Mean = 10 <br> Std ≈ 3 <br> (very wide, high mean) |

---

#### Why This Matters: Activation Functions

Different activation functions work best with different input distributions:

| **Activation** | **Optimal Input Range** | **How γ, β Help** |
|----------------|------------------------|-------------------|
| **ReLU** | Positive values work best <br> (negatives → 0) | Learn $\beta > 0$ to shift distribution positive <br> More neurons stay active |
| **Sigmoid** | Works best around [-2, 2] <br> (saturates outside) | Learn $\gamma$ to compress values <br> Learn $\beta$ to center around 0 |
| **Tanh** | Works best around [-1, 1] <br> (saturates outside) | Learn $\gamma < 1$ to compress <br> Keep $\beta \approx 0$ |
| **Leaky ReLU** | Works for any range | Less sensitive, but can still optimize distribution |

**Example with ReLU:**

```
Scenario 1: Without learnable parameters
  Normalized: [-1.5, -0.5, 0.5, 1.5]  (mean=0)
  After ReLU: [0, 0, 0.5, 1.5]        (50% neurons dead!)

Scenario 2: With learned β=2
  Scaled: [-1.5, -0.5, 0.5, 1.5] + 2 = [0.5, 1.5, 2.5, 3.5]
  After ReLU: [0.5, 1.5, 2.5, 3.5]   (all neurons active!)
```

---

#### How Are γ and β Learned?

**During backpropagation**, gradients flow through:

$$\frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{i} \frac{\partial \mathcal{L}}{\partial y_i} \cdot \hat{x}_i$$

$$\frac{\partial \mathcal{L}}{\partial \beta} = \sum_{i} \frac{\partial \mathcal{L}}{\partial y_i}$$

**Intuition:**
- If increasing $\gamma$ reduces loss → $\gamma$ increases (scale up)
- If shifting $\beta$ upward reduces loss → $\beta$ increases (shift up)
- Optimized using same optimizer as weights (SGD, Adam, etc.)

---

#### Number of Learnable Parameters

| **Layer Type** | **Input Shape** | **Normalization Per** | **Parameters** | **Example** |
|----------------|-----------------|----------------------|----------------|-------------|
| **BatchNorm1d** | $(N, D)$ | Feature | $2D$ | $D=512$ features <br> → 512 $\gamma$ + 512 $\beta$ <br> = **1,024 params** |
| **BatchNorm2d** | $(N, C, H, W)$ | Channel | $2C$ | $C=64$ channels <br> → 64 $\gamma$ + 64 $\beta$ <br> = **128 params** |

**Note:** This is **tiny** compared to convolutional or linear layer parameters!

**Example comparison:**
```
Conv2d(3, 64, kernel_size=3):
  Parameters: 64 × 3 × 3 × 3 = 1,728

BatchNorm2d(64):
  Parameters: 64 × 2 = 128

Ratio: 1,728 / 128 ≈ 13.5×
BatchNorm adds minimal parameters!
```

---

#### Initialization of γ and β

**Default initialization** (PyTorch, TensorFlow):
- $\gamma = 1$ (initialized as ones)
- $\beta = 0$ (initialized as zeros)

**Why?**
- Starts as **identity transformation**: $y = 1 \cdot \hat{x} + 0 = \hat{x}$
- Keeps normalized distribution initially
- Lets the network **learn** to adjust if needed



#### Batch Normalization for Different Data Types

The **dimension along which we compute statistics** varies by data type:

| **Data Type** | **Input Shape** | **Normalization Dimension** | **Learnable Parameters** | **Use Case** |
|---------------|-----------------|----------------------------|--------------------------|--------------|
| **Fully Connected (1D)** | $(N, D)$ <br> $N$ = batch size <br> $D$ = features | Normalize across **batch dimension** $N$ <br> Each feature has its own $\mu, \sigma$ | $\gamma, \beta \in \mathbb{R}^D$ <br> One pair per feature | Dense/FC layers |
| **Convolutional (2D)** | $(N, C, H, W)$ <br> $N$ = batch <br> $C$ = channels <br> $H, W$ = spatial | Normalize across **batch $N$ and spatial dimensions $H, W$** <br> Each channel has its own $\mu, \sigma$ | $\gamma, \beta \in \mathbb{R}^C$ <br> One pair per channel | CNNs, image data |
| **Recurrent (1D sequence)** | $(N, T, D)$ <br> $N$ = batch <br> $T$ = time steps <br> $D$ = features | Normalize across **batch dimension** $N$ <br> Each feature at each timestep | $\gamma, \beta \in \mathbb{R}^D$ <br> One pair per feature | RNNs, LSTMs |

---

#### 1D Batch Normalization (Fully Connected Layers)

**Input Shape:** $(N, D)$ where $N$ = batch size, $D$ = number of features

**Normalization:** Across the **batch dimension** for each feature independently

| **Example** | **Setup** | **Computation** | **Result** |
|-------------|-----------|-----------------|------------|
| **Simple Example** | **Input**: Batch of 3 samples, 2 features <br> $$X = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$$ <br> Shape: $(3, 2)$ | **Feature 1** (column 1): <br> $\mu_1 = \frac{1+2+3}{3} = 2$ <br> $\sigma^2_1 = \frac{(1-2)^2+(2-2)^2+(3-2)^2}{3} = \frac{2}{3}$ <br> $\sigma_1 = \sqrt{\frac{2}{3}} \approx 0.816$ <br><br> **Feature 2** (column 2): <br> $\mu_2 = \frac{4+5+6}{3} = 5$ <br> $\sigma^2_2 = \frac{(4-5)^2+(5-5)^2+(6-5)^2}{3} = \frac{2}{3}$ <br> $\sigma_2 = \sqrt{\frac{2}{3}} \approx 0.816$ | **Normalized** (before $\gamma, \beta$): <br> $$\hat{X} = \begin{bmatrix} \frac{1-2}{0.816} & \frac{4-5}{0.816} \\ \frac{2-2}{0.816} & \frac{5-5}{0.816} \\ \frac{3-2}{0.816} & \frac{6-5}{0.816} \end{bmatrix} = \begin{bmatrix} -1.22 & -1.22 \\ 0 & 0 \\ 1.22 & 1.22 \end{bmatrix}$$ <br><br> Each column has mean=0, std=1 |
| **With Learnable Parameters** | Suppose: <br> $\gamma = [2, 0.5]$ <br> $\beta = [1, -1]$ | Apply: $y = \gamma \cdot \hat{x} + \beta$ <br><br> **Feature 1**: <br> $y_1 = 2 \times (-1.22) + 1 = -1.44$ <br> $y_2 = 2 \times 0 + 1 = 1$ <br> $y_3 = 2 \times 1.22 + 1 = 3.44$ <br><br> **Feature 2**: <br> $y_1 = 0.5 \times (-1.22) - 1 = -1.61$ <br> $y_2 = 0.5 \times 0 - 1 = -1$ <br> $y_3 = 0.5 \times 1.22 - 1 = -0.39$ | **Final Output**: <br> $$Y = \begin{bmatrix} -1.44 & -1.61 \\ 1 & -1 \\ 3.44 & -0.39 \end{bmatrix}$$ <br><br> Network learned to scale and shift each feature |

**PyTorch Implementation:**

```python
import torch
import torch.nn as nn

# Input: batch_size=3, features=2
x = torch.tensor([[1., 4.], [2., 5.], [3., 6.]])

# BatchNorm1d: normalize across batch dimension
bn = nn.BatchNorm1d(num_features=2)  # 2 features → 2 pairs of (γ, β)

# Forward pass
output = bn(x)
print(output)

# Parameters
print(f"Gamma (scale): {bn.weight.data}")  # Shape: (2,)
print(f"Beta (shift): {bn.bias.data}")     # Shape: (2,)
```

**Key Points:**
- **Parameters**: $2D$ learnable parameters ($D$ gammas + $D$ betas)
- **Statistics**: Computed per feature across the batch
- **Typical placement**: After linear layer, before activation

---

#### 2D Batch Normalization (Convolutional Layers)

**Input Shape:** $(N, C, H, W)$ where:
- $N$ = batch size
- $C$ = number of channels
- $H, W$ = spatial dimensions (height, width)

**Normalization:** Across **batch dimension $N$** and **spatial dimensions $H, W$** for each channel independently

**Key Difference from 1D:** Each channel gets one mean/variance computed across ALL spatial locations and ALL samples in the batch.

| **Example** | **Setup** | **Computation** | **Result** |
|-------------|-----------|-----------------|------------|
| **Simple CNN Example** | **Input**: 2 samples, 2 channels, $2\times2$ spatial <br> $$\text{Sample 1, Channel 1} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$$ $$\text{Sample 1, Channel 2} = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}$$ $$\text{Sample 2, Channel 1} = \begin{bmatrix} 2 & 3 \\ 4 & 5 \end{bmatrix}$$ $$\text{Sample 2, Channel 2} = \begin{bmatrix} 6 & 7 \\ 8 & 9 \end{bmatrix}$$ <br> Shape: $(2, 2, 2, 2)$ | **Channel 1** (across both samples, all spatial positions): <br> Values: $[1, 2, 3, 4, 2, 3, 4, 5]$ (8 values total) <br> $\mu_1 = \frac{1+2+3+4+2+3+4+5}{8} = 3$ <br> $\sigma^2_1 = \frac{\sum (x_i - 3)^2}{8} = 1.5$ <br> $\sigma_1 = \sqrt{1.5} \approx 1.22$ <br><br> **Channel 2** (across both samples, all spatial positions): <br> Values: $[5, 6, 7, 8, 6, 7, 8, 9]$ (8 values total) <br> $\mu_2 = \frac{5+6+7+8+6+7+8+9}{8} = 7$ <br> $\sigma^2_2 = \frac{\sum (x_i - 7)^2}{8} = 1.5$ <br> $\sigma_2 = \sqrt{1.5} \approx 1.22$ | **Normalized** (before $\gamma, \beta$): <br><br> **Sample 1, Channel 1**: <br> $$\begin{bmatrix} \frac{1-3}{1.22} & \frac{2-3}{1.22} \\ \frac{3-3}{1.22} & \frac{4-3}{1.22} \end{bmatrix} = \begin{bmatrix} -1.64 & -0.82 \\ 0 & 0.82 \end{bmatrix}$$ <br><br> All values in channel 1 (across all samples) now have mean=0, std=1 <br><br> Same process for channel 2 |
| **Visualization** | **Input dimensions**: <br> - Batch: $N = 4$ <br> - Channels: $C = 64$ <br> - Spatial: $H = 32, W = 32$ <br><br> Total activations: $4 \times 64 \times 32 \times 32$ | **For Channel 1**: <br> - Collect all $4 \times 32 \times 32 = 4{,}096$ values <br> - Compute $\mu_1, \sigma_1$ from these 4,096 values <br> - Normalize all 4,096 values using this $\mu_1, \sigma_1$ <br><br> **For Channel 2**: <br> - Collect all $4 \times 32 \times 32 = 4{,}096$ values <br> - Compute $\mu_2, \sigma_2$ <br> - Normalize <br><br> Repeat for all 64 channels | **Output dimensions**: Same as input <br> $(4, 64, 32, 32)$ <br><br> **Parameters**: <br> - $\gamma \in \mathbb{R}^{64}$ (one per channel) <br> - $\beta \in \mathbb{R}^{64}$ (one per channel) <br> - Total: $128$ learnable parameters |

**PyTorch Implementation:**

```python
import torch
import torch.nn as nn

# Input: batch_size=2, channels=3, height=4, width=4
x = torch.randn(2, 3, 4, 4)

# BatchNorm2d: normalize across batch and spatial dimensions
bn = nn.BatchNorm2d(num_features=3)  # 3 channels → 3 pairs of (γ, β)

# Forward pass
output = bn(x)

# Parameters
print(f"Gamma (scale): {bn.weight.data}")  # Shape: (3,) - one per channel
print(f"Beta (shift): {bn.bias.data}")     # Shape: (3,) - one per channel

# Each channel is normalized independently
# Channel 0: normalized across 2*4*4 = 32 values
# Channel 1: normalized across 2*4*4 = 32 values
# Channel 2: normalized across 2*4*4 = 32 values
```

**Key Points:**
- **Parameters**: $2C$ learnable parameters ($C$ gammas + $C$ betas)
- **Statistics**: Each channel has one $\mu$ and one $\sigma^2$ computed across:
  - All samples in batch ($N$ dimension)
  - All spatial locations ($H \times W$ dimensions)
- **Typical placement**: After convolution, before activation
- **Spatial sharing**: Same $\gamma$ and $\beta$ applied to all spatial locations within a channel

---

#### Comparison of Normalization Dimensions

| **Aspect** | **BatchNorm1d (FC layers)** | **BatchNorm2d (Conv layers)** |
|------------|----------------------------|------------------------------|
| **Input Shape** | $(N, D)$ <br> $N$ = batch, $D$ = features | $(N, C, H, W)$ <br> $N$ = batch, $C$ = channels, $H, W$ = spatial |
| **Normalize Over** | Batch dimension $N$ | Batch dimension $N$ + Spatial dimensions $H, W$ |
| **Statistics per** | Feature (column) | Channel |
| **Number of means/variances** | $D$ (one per feature) | $C$ (one per channel) |
| **Learnable Parameters** | $\gamma, \beta \in \mathbb{R}^D$ <br> Total: $2D$ | $\gamma, \beta \in \mathbb{R}^C$ <br> Total: $2C$ |
| **Values per statistic** | $N$ values <br> (all samples for one feature) | $N \times H \times W$ values <br> (all samples, all spatial locations for one channel) |
| **Example** | Batch of 32, 512 features: <br> - 512 means <br> - 512 variances <br> - 1,024 learnable params | Batch of 32, 64 channels, $28 \times 28$: <br> - 64 means <br> - 64 variances <br> - 128 learnable params <br> - Each statistic computed from $32 \times 28 \times 28 = 25{,}088$ values |

---

#### Training vs. Inference

**During Training:**
- Compute $\mu, \sigma^2$ from **current batch**
- Update **running estimates** of population statistics using exponential moving average:
  $$\mu_{\text{running}} \leftarrow (1 - \text{momentum}) \cdot \mu_{\text{running}} + \text{momentum} \cdot \mu_{\mathcal{B}}$$
  $$\sigma^2_{\text{running}} \leftarrow (1 - \text{momentum}) \cdot \sigma^2_{\text{running}} + \text{momentum} \cdot \sigma^2_{\mathcal{B}}$$
  - Default momentum: 0.1

**During Inference:**
- Use **fixed running statistics** (not batch statistics)
- Reason: Batch size might be 1, or statistics might be unstable
- Formula remains the same:
  $$y = \gamma \cdot \frac{x - \mu_{\text{running}}}{\sqrt{\sigma^2_{\text{running}} + \epsilon}} + \beta$$

---

#### Where to Place Batch Normalization?

**Original Paper (2015):** Before activation
```
Conv/Linear → BatchNorm → ReLU
```

**Modern Practice:** After activation (often works better)
```
Conv/Linear → ReLU → BatchNorm
```

**Complete CNN Block with BatchNorm:**

```python
import torch.nn as nn

# Option 1: Before activation (original)
block1 = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU()
)

# Option 2: After activation (modern)
block2 = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.BatchNorm2d(64)
)

# Option 3: With pooling
block3 = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2)
)
```

---

#### Practical Considerations

| **Aspect** | **Recommendation** | **Reason** |
|------------|-------------------|------------|
| **Batch Size** | ≥ 16, ideally 32+ | Small batches → unstable statistics <br> Use GroupNorm or LayerNorm for small batches |
| **Learning Rate** | Can use higher rates (e.g., 0.01 instead of 0.001) | BatchNorm stabilizes training |
| **Dropout** | Often unnecessary with BatchNorm | BatchNorm already provides regularization |
| **Initialization** | Less critical | BatchNorm reduces sensitivity to initialization |
| **Momentum** | Default 0.1 works well | Controls running statistics update rate |

---

#### Complete Example: Full Forward Pass

**Setup:**
- Input: $(2, 3, 4, 4)$ (2 samples, 3 channels, $4 \times 4$ spatial)
- BatchNorm2d with 3 channels

**Step-by-step:**

```python
import torch
import torch.nn as nn

# Input
x = torch.randn(2, 3, 4, 4)  # Shape: (N=2, C=3, H=4, W=4)

# BatchNorm layer
bn = nn.BatchNorm2d(num_features=3)

# Set some example parameters
bn.weight.data = torch.tensor([2.0, 1.5, 0.5])  # γ for each channel
bn.bias.data = torch.tensor([1.0, -1.0, 0.0])   # β for each channel

# Forward pass
bn.train()  # Training mode: use batch statistics
output = bn(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")  # Same: (2, 3, 4, 4)

# For channel 0:
# 1. Compute μ, σ² from 2*4*4 = 32 values
# 2. Normalize: (x - μ) / σ
# 3. Scale: 2.0 * normalized
# 4. Shift: + 1.0
```

---

#### Key Takeaways

1. **BatchNorm normalizes activations** to stabilize training and allow higher learning rates

2. **Different variants** for different data types:
   - **1D (BatchNorm1d)**: Fully connected layers, normalize across batch per feature
   - **2D (BatchNorm2d)**: Convolutional layers, normalize across batch + spatial dimensions per channel

3. **Learnable parameters** ($\gamma, \beta$) allow network to undo normalization if needed

4. **Training vs. Inference**:
   - Training: Use batch statistics
   - Inference: Use running statistics

5. **Benefits**: Faster training, less sensitive to initialization, acts as regularization

6. **Placement**: After Conv/Linear, typically before activation (but after activation also works)

---
---

## LayerNorm

| **Aspect**         | **LayerNorm**                                      |
|--------------------|----------------------------------------------------|
| **Normalizes over**| Features (per sample, not batch)                   |
| **Input shape**    | Any (e.g. $(N, D)$ for FC, $(N, T, D)$ for RNNs)   |
| **Statistics per** | Each sample (mean, std across features)            |
| **Learnable params**| $\gamma, \beta \in \mathbb{R}^D$ (one per feature)|
| **Batch size needed**| No (works with batch size 1)                     |
| **Typical use**    | RNNs, Transformers, small batch sizes              |
| **Benefit**        | Stable normalization for sequence models           |