# The Attention Mechanism

### Query, Key, and Value

The attention mechanism is often explained using a **Retrieval System** analogy (like a database or a search engine).

For every token in the sequence (e.g., "bank"), we generate three vectors:
1.  **Query ($Q$)**: *What am I looking for?* (e.g., "I need context to understand if I mean 'river bank' or 'money bank'.")
2.  **Key ($K$)**: *What do I contain?* (e.g., "I am the word 'river'.")
3.  **Value ($V$)**: *What information do I pass along?* (e.g., "I am a nature-related noun.")


**Score**: The Query matches against all Keys. If $Q_{\text{bank}}$ matches $K_{\text{river}}$ well, the score is high.

This results in a new vector for "bank" that is heavily enriched with the concept of "river".

The formula for Scaled Dot-Product Attention is:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

* $QK^T$: Dot product similarity between Queries and Keys.
* $\sqrt{d_k}$: Scaling factor to prevent gradients from vanishing in the Softmax.
* Softmax: Converts raw scores to probabilities summing to 1.

### Efficient Multi-Head Attention

BERT uses **12 attention heads**. This allows the model to focus on different aspects of language simultaneously (e.g., Head 1 could focus on grammar, Head 2 on vocabulary, Head 3 on sentence structure, etc.).

**Naive Implementation (Slow):**
Creating 12 separate `nn.Linear` layers and looping over them is inefficient because GPUs prefer large matrix operations over many small ones.

**Vectorized Implementation (Fast):**
We use **one giant matrix** for all heads and then use tensor reshaping to split them virtually.

1.  **Project:** Multiply input $x$ (size 768) by a large weight matrix $W^Q$ (size $768 \times 768$).
2.  **Reshape:** Split the result into 12 chunks of size 64.
    * Shape change: `[Batch, Seq, 768]` $\to$ `[Batch, Seq, 12, 64]`
3.  **Permute:** Swap axes so "Heads" is its own dimension.
    * Shape change: `[Batch, 12, Seq, 64]`

Now, a single matrix multiplication `matmul` computes attention for all 12 heads in parallel.