[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-08/exercise-04.ipynb)

# ðŸ§ª Exercise 4 â€” Build One Transformer Layer from Scratch

## ðŸŽ¯ Goal

By the end of this notebook, you will:

- Convert tokens into embeddings
- Inject positional information
- Compute Q, K, V projections
- Perform scaled dot-product attention
- Apply residual connections
- Run a feed-forward MLP
- Produce the final output of one Transformer layer

Everything is small. Everything is explicit. Nothing is hidden.

## ðŸ§  Cell 1 â€” Setup

**Goal:** Import numpy and configure clean printing

In [None]:
import numpy as np

np.set_printoptions(precision=3, suppress=True)

## ðŸ§  Cell 2 â€” Define Token Embeddings

**Goal:** Represent tokens as vectors (learned embeddings)

These are learned numbers. Not symbols. Not grammar. Geometry.

In [None]:
# Sentence: "the cat sleeps"

X = np.array([
    [0.2, 0.1, 0.0, 0.3],   # "the"
    [0.9, 0.7, 0.1, 0.0],   # "cat"
    [0.8, 0.2, 0.6, 0.4]    # "sleeps"
])

print("Initial token embeddings:\n", X)

## ðŸ§  Cell 3 â€” Add Positional Embeddings

**Goal:** Inject order information numerically

Now meaning + position are baked into the vectors.

In [None]:
pos = np.array([
    [0.05, 0.00, 0.00, 0.00],
    [0.00, 0.05, 0.00, 0.00],
    [0.00, 0.00, 0.05, 0.00]
])

X = X + pos

print("After adding positional embeddings:\n", X)

## SELF-ATTENTION

## ðŸ§  Cell 4 â€” Define Attention Weight Matrices

**Goal:** Define learned projection matrices for Q, K, V

These are learned during training.
Change them, change behavior.

In [None]:
W_Q = np.array([
    [0.5, 0.1, 0.0, 0.2],
    [0.0, 0.6, 0.1, 0.0],
    [0.3, 0.0, 0.7, 0.1],
    [0.0, 0.2, 0.0, 0.5]
])

W_K = np.array([
    [0.4, 0.0, 0.1, 0.2],
    [0.1, 0.5, 0.0, 0.1],
    [0.2, 0.1, 0.6, 0.0],
    [0.0, 0.2, 0.1, 0.4]
])

W_V = np.array([
    [0.3, 0.2, 0.1, 0.0],
    [0.0, 0.4, 0.2, 0.1],
    [0.1, 0.0, 0.5, 0.2],
    [0.2, 0.1, 0.0, 0.6]
])

## ðŸ§  Cell 5 â€” Compute Q, K, V

**Goal:** Project tokens into query, key, and value spaces

Same tokens. Different roles.

In [None]:
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

print("Q:\n", Q)
print("\nK:\n", K)
print("\nV:\n", V)

## ðŸ§  Cell 6 â€” Compute Scaled Dot-Product Attention

**Goal:** Measure similarity between tokens

Each entry (i, j) answers:
How relevant is token j to token i?

In [None]:
d_model = 4

scores = Q @ K.T
scaled_scores = scores / np.sqrt(d_model)

print("Scaled attention scores:\n", scaled_scores)

## ðŸ§  Cell 7 â€” Apply Softmax

**Goal:** Convert similarity scores into attention weights

Each row is now a probability distribution.

Attention never picks one token.
It blends.

In [None]:
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

A = softmax(scaled_scores)

print("Attention weights:\n", A)
print("Row sums (should equal 1):\n", A.sum(axis=1))

## ðŸ§  Cell 8 â€” Compute Attention Output

**Goal:** Mix value vectors according to attention weights

Each token is now contextualized.

In [None]:
attention_output = A @ V

print("Attention output:\n", attention_output)

## Residual Connection

## ðŸ§  Cell 9 â€” Add Residual

**Goal:** Preserve original information while refining it

Depth becomes refinement, not replacement.

In [None]:
residual_1 = X + attention_output

print("After first residual connection:\n", residual_1)

## MLP Block

## ðŸ§  Cell 10 â€” Define MLP Weights

**Goal:** Define feedforward network weights

The MLP transforms each token independently.

In [None]:
# Expanding to hidden size 8, then projecting back to 4
W1 = np.random.randn(4, 8) * 0.5
W2 = np.random.randn(8, 4) * 0.5

## ðŸ§  Cell 11 â€” MLP Forward Pass

**Goal:** Apply nonlinear transformation to each token

Attention decides what to look at.
MLP decides what to become.

In [None]:
def relu(x):
    return np.maximum(0, x)

mlp_hidden = relu(residual_1 @ W1)
mlp_output = mlp_hidden @ W2

print("MLP output:\n", mlp_output)

## Final Residual

## ðŸ§  Cell 12 â€” Final Output of One Transformer Layer

**Goal:** Produce final output of the Transformer layer

You have just executed one complete Transformer layer.

In [None]:
final_output = residual_1 + mlp_output

print("Final layer output:\n", final_output)

## ðŸŽ“ What You Just Built

- Token embeddings
- Positional encoding
- Q/K/V projections
- Scaled dot-product attention
- Residual connections
- Feedforward MLP
- Final layer output

No abstractions.
No deep learning framework.
No mythology.

Just linear algebra and nonlinearity arranged carefully.

Run it. Modify weights. Change embeddings. Watch how attention shifts.

The diagram stops being intimidating the moment you realize it's just disciplined matrix multiplication repeated at scale.

And that's the strange elegance of it.