# Homework 4: Attention is All You Need

**Objective:**
In this assignment, you will build a complete Self-Attention block for a single sentence from scratch. Here you will start with the **raw input vectors** and generate the Queries, Keys, and Values yourself using **Matrix Multiplication**. This mimics the actual architecture of a Transformer layer (like in GPT or BERT).

### **Context: What is "Word Embedding"?**
Before we do math, let's talk about data. Computers cannot understand strings like "Apple". To process language, we convert every word into a list of numbers called a **Vector**.
* This conversion is called **Word Embedding**.
* The key idea is that words with similar meanings (like "King" and "Queen") will have vectors that are numerically close to each other.
* For this homework, we assume our sentence has **128 words**, and each word is represented by a vector of **size 16**.

---

### **Question 1: Setup and Weight Initialization**
**Context:**
In a real Transformer, we don't just use the input vectors directly. We "project" them into three different spaces: Query, Key, and Value. We do this by multiplying the input by three learnable weight matrices: $W_Q, W_K, W_V$.

**Task:**
1.  Set the random seed to `42`.
2.  Generate a "Input Sentence" matrix $X$ of shape `(128, 16)` using standard random normal matrix (`rand`).
    * 128 = Sequence Length (Words)
    * 16 = Embedding Dimension ($d_{model}$)
3.  Generate three **Weight Matrices** ($W_Q, W_K, W_V$).
    * In this simple example, we will project from dimension 16 back to dimension 16.
    * Shape of each weight matrix: `(16, 16)`.
    * You can create these matrices using standard random normal matrix (`randn`).

In [None]:
import numpy as np

# TODO: Set seed

# TODO: Generate Input X (128, 16)

# TODO: Generate Weights W_q, W_k, W_v (16, 16)

### **Question 2: Linear Projections (Creating Q, K, V)**
**Context:**
Now we calculate the Query, Key, and Value matrices using the linear definitions:
$$Q = X \cdot W_Q$$
$$K = X \cdot W_K$$
$$V = X \cdot W_V$$

**Task:**
1.  Perform the matrix multiplication. You can use the `@` operator (standard for matrix multiplication) or `np.dot` or `np.einsum`
2.  Verify that the resulting shapes of $Q, K, V$ are all `(128, 16)`.

In [None]:
# TODO: Calculate Q, K, V using matrix multiplication

# TODO: Print shapes to verify

### **Question 3: The Attention Scores**
**Context:**
The core of attention is finding out how much every word "cares" about every other word. We do this by calculating the dot product between Queries and Keys.
$$\text{Scores} = \frac{Q \cdot K^T}{\sqrt{d_k}}$$

**Task:**
1.  Multiply $Q$ by the **transpose** of $K$.
    * $Q$ shape: `(128, 16)`
    * $K^T$ shape: `(16, 128)`
    * Result shape should be `(128, 128)` (A relationship map between every word and every other word).
2.  Scale the result by dividing by $\sqrt{16}$ (the dimension of the keys).

In [None]:
# TODO: Calculate Scores
# Hint: Use Q @ K.T

# TODO: Scale the scores

### **Question 4: The Softmax (Probability Distribution)**
**Context:**
The scores can be any number (negative, positive, huge). We want to turn them into probabilities that sum to 1.0 for each word.

The formula for Softmax is:
$$ \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} $$

**Task:**
1.  Exponentiate the scores using `np.exp()`.
2.  Divide each row by the sum of that row.
    * **Crucial:** Ensure you are summing along `axis=1` (the rows) and use `keepdims=True` to allow for correct division.

In [None]:
def softmax(x):
    # TODO: Implement simple softmax (exp / sum)
    pass

# TODO: Apply to scores matrix

### **Question 5: The Final Representation**
**Context:**
Finally, we create the new representation for each word. This is the weighted sum of the Values ($V$), weighted by the attention probabilities.
$$\text{Output} = \text{AttentionWeights} \cdot V$$

**Task:**
1.  Multiply your `weights` matrix `(128, 128)` by the `V` matrix `(128, 16)`.
2.  Print the final shape. It should be `(128, 16)`â€”the same as your input $X$, but now every word contains context from the whole sentence!

In [None]:
# TODO: Calculate Output