```{contents}
```
## Layer Normalization

**Layer Normalization (LayerNorm)** is a technique used in deep neural networks to **stabilize and accelerate training** by normalizing the activations **within each layer for each data sample**.

It is a core component of modern architectures such as **Transformers, LLMs, VAEs, GANs, and RNNs**.

---

### **Core Intuition**

Neural networks train faster and more reliably when the scale of activations is controlled.

LayerNorm ensures that:

> **Each layer sees inputs with consistent scale and distribution.**

Instead of the network constantly adjusting to shifting internal values, it receives **well-conditioned signals**.

---

### **What Exactly Does LayerNorm Do?**

For a given input vector ( x ) of a single training example:

$$
\mu = \frac{1}{H} \sum_{i=1}^{H} x_i
$$
$$
\sigma = \sqrt{\frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2}
$$

Then normalize:

$$
\hat{x}_i = \frac{x_i - \mu}{\sigma + \epsilon}
$$

Then apply learnable scale and shift:

$$
y_i = \gamma \hat{x}_i + \beta
$$

Where:

* ( H ) = hidden dimension
* ( \gamma, \beta ) = learned parameters

---

### **Why Not Batch Normalization?**

| Batch Norm                      | Layer Norm                        |
| ------------------------------- | --------------------------------- |
| Depends on batch statistics     | Independent of batch              |
| Fails for small batch sizes     | Works for any batch size          |
| Unstable for sequence models    | Ideal for sequence models         |
| Difficult with variable lengths | Naturally handles variable length |

This is why **Transformers and LLMs use LayerNorm, not BatchNorm**.

---

### **Applications**

#### Transformers & LLMs

Used before or after attention and feedforward layers to stabilize training.

#### Recurrent Neural Networks

Improves training stability for long sequences.

#### Generative Models

Used in VAEs and GANs for smoother optimization.

#### Reinforcement Learning

Stabilizes policy networks.

---

### **Benefits**

| Benefit                      | Explanation                           |
| ---------------------------- | ------------------------------------- |
| Faster convergence           | Well-scaled gradients                 |
| Improved stability           | Reduces exploding/vanishing gradients |
| Model robustness             | Less sensitivity to initialization    |
| Consistent training behavior | Across batch sizes                    |

---

### **Pre-Norm vs Post-Norm**

Modern Transformers use **Pre-Norm**:

```
x → LayerNorm → Attention → + → LayerNorm → FFN → +
```

Pre-Norm enables **deeper and more stable models**.

---

**Intuitive Summary**

LayerNorm acts like an **automatic volume control** inside the network — keeping every layer's output at a healthy operating level.

---

If you'd like, I can next explain **Positional Encoding** and why it is essential for Transformers.