# BERT Backbone

Before reading this note, make sure you have read the following:
* [Embeddings](embeddings.ipynb)
* [Attention](attention.ipynb)
* [FF](feed_forward.ipynb)
* [Encoder](encoder.ipynb)
* [Pooler](pooler.ipynb)

Up to this point, we have built isolated components. Now, we assemble the **Body** of the Transformer.

### Encoder-Only Architecture
It is crucial to understand that BERT is an **Encoder-Only** model.
* **GPT (Decoder):** Unidirectional. Reads left-to-right. Good at *generating* the future.
* **BERT (Encoder):** Bidirectional. Reads the entire sentence at once. Good at *understanding* the context.

Because BERT sees the future words, it cannot "write" text like GPT. Instead, it creates a deep, contextual "mental image" of the text you feed it.

### Modules
BERT holds three specific sub-modules:

1.  **Embeddings:**
    * *Input:* Raw integers (Token IDs).
    * *Role:* The "Dictionary." Converts discrete IDs into continuous vectors.
    * *Analogy:* Looking up words in a massive encyclopedia before reading.

2.  **Encoder Stack:**
    * *Input:* Vectors from Embeddings.
    * *Role:* The "Brain." It loops through N layers of Attention and Feed-Forward processing.
    * *Analogy:* Reading the sentence N times, each time understanding the relationships between words better.

3.  **Pooler:**
    * *Input:* The final state of the `[CLS]` token.
    * *Role:* The "Summary." Squeezes the understanding of the whole sequence into one vector.
    * *Analogy:* Writing a one-sentence summary of the whole paragraph.

### Data Flow Pipeline

When you feed BERT with input IDs, the data travels through a strictly defined pipeline.

#### Step 1: Pre-processing 
The user gives us a mask of `1`s (Keep) and `0`s (Padding).\
The Attention mechanism, however, involves a **Softmax** function:
* If we use `0` for padding in the attention scores, $e^0 = 1$, so the model will "attend" to the padding a little bit.
* We need the value to be $-\infty$ so that $e^{-\infty} \approx 0$, so we make a transformation:
$$ 	Mask_{new} = (1.0 - Mask_{old}) \times -10000.0 $$
* Input `1` $	o$ `(1-1)*-10k` = **`0`** (Add nothing to the score).
* Input `0` $	o$ `(1-0)*-10k` = **`-10000`** (Destroy the score).

#### Step 2: Encoder Stack
The data enters the Encoder Stack. It passes through Layer 0, then the output of Layer 0 goes to Layer 1, and so on.
* **Crucial Property:** The shape never changes.
* Input: `(Batch, Seq, 768)` $	o$ Output: `(Batch, Seq, 768)`

#### Step 3: Fork
At the end, the path splits:
1.  **Sequence Output:** The full list of vectors (one for every word).
2.  **Pooled Output:** The first token's vector (`[CLS]`) processed by a Tanh layer.


# BERT Heads 

Up to this point, we have built the universal understanding engine: the **Backbone**.\
However, the backbone only outputs vectors. To do anything useful, we need to attach a **Head**.

* **Body (Backbone):** Heavy (~110M params). Expensive to train. Generic knowledge.
* **Head:** Light (Linear Layers). Cheap to train. Specific task.

### Heads
These are the specialized modules that plug into the backbone.

**A. Masked Language Model (MLM)**
* **Goal:** Predict the hidden word (e.g., `[MASK]` $\to$ "cat").
* **Input:** `sequence_output` (Vectors for every token).
* **Architecture:** `Dense` $\to$ `GELU` $\to$ `LayerNorm` $\to$ `Project to Vocab Size`.
* **Crucial Feature:** Weights are tied to the Embeddings to save memory.

**B. Next Sentence Prediction (NSP)**
* **Goal:** Determine if Sentence B logically follows Sentence A (True/False).
* **Input:** `pooled_output` (Summary vector of `[CLS]`).
* **Architecture:** `Linear` $\to$ `2 Classes`.
* **Role:** Forces the backbone to learn sentence-level relationships.

**C. Sequence Classification**
* **Goal:** Classify the entire input text (e.g., Spam vs. Ham).
* **Input:** `pooled_output`.
* **Architecture:** `Dropout` $\to$ `Linear` $\to$ `N Classes`.
* **Role:** The standard head used for Fine-Tuning on downstream tasks.


### Assembled Models
We combine the Backbone and Heads to create the final models used for training.

**1. BertForPreTraining**
* **Composition:** `Backbone` + `MLM Head` + `NSP Head`.
* **Workflow:** Runs both heads simultaneously.
    $$\text{Total Loss} = \text{Loss}_{\text{MLM}} + \text{Loss}_{\text{NSP}}$$
* **Use Case:** Training the model from scratch on large corpora (Wikipedia).

**2. BertForSequenceClassification**
* **Composition:** `Backbone` + `Classification Head`.
* **Workflow:** Loads pre-trained backbone weights, then trains only the specific head.
* **Use Case:** Fine-tuning on specific datasets (Sentiment Analysis, Spam Detection, etc.).

## Weight initialization and gradient clipping

When training deep Transformer architectures like BERT from scratch, relying on default deep learning heuristics, such as standard PyTorch weight initialization, frequently leads to catastrophic failure modes like dead neurons, exploding losses, and mode collapse.

### Weight initialization variance effect
In a standard multi-layer network, PyTorch initializes weights to preserve variance across layers for typical depths. However, in the Masked Language Modeling (MLM) head of Bert, we compute the final logits by taking the dot product of a high-dimensional hidden state vector and the unscaled embedding matrix:
$$ \mathbf{z} = \mathbf{h} \mathbf{W}^T $$

If $\mathbf{W}$ (our vocabulary embeddings) is initialized using a standard normal distribution $\mathcal{N}(0, 1)$, and $\mathbf{h}$ (the 768-dimensional hidden state) has its own variance, the variance of the resulting logit vector $\mathbf{z}$ scales linearly with the hidden dimension size $d_{model}$. For $d_{model} = 768$, the standard deviation of our logits becomes massive ($\approx 27.7$). 

When these high-variance logits are passed through the Softmax function, the output probability distribution becomes heavily skewed (overconfident) toward a random token.
$$ p_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}} $$

Because the model is highly confident in the *wrong* prediction across a vocabulary of 30,000 tokens, the Cross-Entropy Loss evaluates a near-zero probability for the correct target, yielding a high initial loss ($\mathcal{L} \approx 100$).

To prevent this immediate saturation, we initialize all linear layers and embeddings using a truncated normal distribution with a tightly constrained standard deviation: $\mathcal{N}(0, 0.02^2)$. 

Why 0.02? Mathematically, $1 / \sqrt{768} \approx 0.036$. By forcing the variance to be even smaller than $1 / \sqrt{d_{model}}$, we guarantee that the initial logits $\mathbf{z}$ are clustered very closely around 0. 

When logits are near zero, the Softmax function outputs a nearly uniform distribution. For a vocabulary size $|V|$, the probability of any token becomes $p \approx 1 / |V|$. This yields a predictable, mathematically sound initial loss:
$$ \mathcal{L} = -\ln\left(\frac{1}{|V|}\right) \approx 10.31 $$
This ensures the model begins training with an open, unbiased state, allowing gradients to flow evenly rather than fighting a massive initial error term.

### Gradient clipping
Even with perfect initialization, Transformers are notoriously susceptible to early-stage gradient explosions. The self-attention mechanism computes attention scores by multiplying queries and keys. If a single outlier batch produces a sharp attention distribution, it can result in an enormous gradient spike during backpropagation.


We implement **Gradient Clipping** to strictly bound the $L_2$ norm of the gradient vector. If the magnitude exceeds our threshold (typically 1.0), we scale the entire gradient vector down. This trick preserves the *direction* of the gradient, ensuring the model still learns the correct pattern, but restricts the *step size*, physically preventing the optimizer from blasting the weights into an irrecoverable dead zone.