# BERT 

Before reading this note, make sure you have read the following:
* [Embeddings](embeddings.ipynb)
* [Attention](attention.ipynb)
* [FF](feed_forward.ipynb)
* [Encoder](encoder.ipynb)
* [Pooler](pooler.ipynb)

Up to this point, we have built isolated components. Now, we assemble the **Body** of the Transformer.

### Encoder-Only Architecture
It is crucial to understand that BERT is an **Encoder-Only** model.
* **GPT (Decoder):** Unidirectional. Reads left-to-right. Good at *generating* the future.
* **BERT (Encoder):** Bidirectional. Reads the entire sentence at once. Good at *understanding* the context.

Because BERT sees the future words, it cannot "write" text like GPT. Instead, it creates a deep, contextual "mental image" of the text you feed it.

### Modules
BERT holds three specific sub-modules:

1.  **Embeddings:**
    * *Input:* Raw integers (Token IDs).
    * *Role:* The "Dictionary." Converts discrete IDs into continuous vectors.
    * *Analogy:* Looking up words in a massive encyclopedia before reading.

2.  **Encoder Stack:**
    * *Input:* Vectors from Embeddings.
    * *Role:* The "Brain." It loops through N layers of Attention and Feed-Forward processing.
    * *Analogy:* Reading the sentence N times, each time understanding the relationships between words better.

3.  **Pooler:**
    * *Input:* The final state of the `[CLS]` token.
    * *Role:* The "Summary." Squeezes the understanding of the whole sequence into one vector.
    * *Analogy:* Writing a one-sentence summary of the whole paragraph.

### Data Flow Pipeline

When you feed BERT with input IDs, the data travels through a strictly defined pipeline.

#### Step 1: Pre-processing 
The user gives us a mask of `1`s (Keep) and `0`s (Padding).\
The Attention mechanism, however, involves a **Softmax** function:
* If we use `0` for padding in the attention scores, $e^0 = 1$, so the model will "attend" to the padding a little bit.
* We need the value to be $-\infty$ so that $e^{-\infty} \approx 0$, so we make a transformation:
$$ 	Mask_{new} = (1.0 - Mask_{old}) \times -10000.0 $$
* Input `1` $	o$ `(1-1)*-10k` = **`0`** (Add nothing to the score).
* Input `0` $	o$ `(1-0)*-10k` = **`-10000`** (Destroy the score).

#### Step 2: Encoder Stack
The data enters the Encoder Stack. It passes through Layer 0, then the output of Layer 0 goes to Layer 1, and so on.
* **Crucial Property:** The shape never changes.
* Input: `(Batch, Seq, 768)` $	o$ Output: `(Batch, Seq, 768)`

#### Step 3: Fork
At the end, the path splits:
1.  **Sequence Output:** The full list of vectors (one for every word).
2.  **Pooled Output:** The first token's vector (`[CLS]`) processed by a Tanh layer.
