[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-08/exercise-04.ipynb)

# ðŸ§ª Exercise 4 â€” Build One Transformer Layer from Scratch

In this exercise, we build a complete Transformer layer from scratch using only NumPy. We will:

- Convert tokens into embeddings
- Inject positional information
- Compute Q, K, V projections
- Perform scaled dot-product attention
- Apply residual connections
- Run a feed-forward MLP
- Produce the final output of one Transformer layer

Everything is small. Everything is explicit. Nothing is hidden.

## ðŸ§  Setup

We start by importing NumPy and configuring clean printing for readable output.

In [1]:
import numpy as np

np.set_printoptions(precision=3, suppress=True)

## ðŸ§  Define Token Embeddings

We represent tokens as vectors (learned embeddings). These are learned numbers. Not symbols. Not grammar. Geometry.

In [2]:
# Sentence: "the cat sleeps"

X = np.array([
    [0.2, 0.1, 0.0, 0.3],   # "the"
    [0.9, 0.7, 0.1, 0.0],   # "cat"
    [0.8, 0.2, 0.6, 0.4]    # "sleeps"
])

print("Initial token embeddings:\n", X)

Initial token embeddings:
 [[0.2 0.1 0.  0.3]
 [0.9 0.7 0.1 0. ]
 [0.8 0.2 0.6 0.4]]


## ðŸ§  Add Positional Embeddings

We inject order information numerically by adding positional embeddings. Now meaning + position are baked into the vectors.

In [3]:
pos = np.array([
    [0.05, 0.00, 0.00, 0.00],
    [0.00, 0.05, 0.00, 0.00],
    [0.00, 0.00, 0.05, 0.00]
])

X = X + pos

print("After adding positional embeddings:\n", X)

After adding positional embeddings:
 [[0.25 0.1  0.   0.3 ]
 [0.9  0.75 0.1  0.  ]
 [0.8  0.2  0.65 0.4 ]]


## SELF-ATTENTION

## ðŸ§  Define Attention Weight Matrices

We define learned projection matrices for Q, K, V. These are learned during training. Change them, change behavior.

In [4]:
W_Q = np.array([
    [0.5, 0.1, 0.0, 0.2],
    [0.0, 0.6, 0.1, 0.0],
    [0.3, 0.0, 0.7, 0.1],
    [0.0, 0.2, 0.0, 0.5]
])

W_K = np.array([
    [0.4, 0.0, 0.1, 0.2],
    [0.1, 0.5, 0.0, 0.1],
    [0.2, 0.1, 0.6, 0.0],
    [0.0, 0.2, 0.1, 0.4]
])

W_V = np.array([
    [0.3, 0.2, 0.1, 0.0],
    [0.0, 0.4, 0.2, 0.1],
    [0.1, 0.0, 0.5, 0.2],
    [0.2, 0.1, 0.0, 0.6]
])

## ðŸ§  Compute Q, K, V

We project tokens into query, key, and value spaces. Same tokens. Different roles.

In [5]:
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

print("Q:\n", Q)
print("\nK:\n", K)
print("\nV:\n", V)

Q:
 [[0.125 0.145 0.01  0.2  ]
 [0.48  0.54  0.145 0.19 ]
 [0.595 0.28  0.475 0.425]]

K:
 [[0.11  0.11  0.055 0.18 ]
 [0.455 0.385 0.15  0.255]
 [0.47  0.245 0.51  0.34 ]]

V:
 [[0.135 0.12  0.045 0.19 ]
 [0.28  0.48  0.29  0.095]
 [0.385 0.28  0.445 0.39 ]]


## ðŸ§  Compute Scaled Dot-Product Attention

We measure similarity between tokens. Each entry (i, j) answers: How relevant is token j to token i?

In [6]:
d_model = 4

scores = Q @ K.T
scaled_scores = scores / np.sqrt(d_model)

print("Scaled attention scores:\n", scaled_scores)

Scaled attention scores:
 [[0.033 0.083 0.084]
 [0.077 0.248 0.248]
 [0.099 0.279 0.368]]


## ðŸ§  Apply Softmax

We convert similarity scores into attention weights using softmax. Each row is now a probability distribution. Attention never picks one token. It blends.

In [7]:
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

A = softmax(scaled_scores)

print("Attention weights:\n", A)
print("Row sums (should equal 1):\n", A.sum(axis=1))

Attention weights:
 [[0.322 0.339 0.339]
 [0.296 0.352 0.352]
 [0.285 0.342 0.373]]
Row sums (should equal 1):
 [1. 1. 1.]


## ðŸ§  Compute Attention Output

We mix value vectors according to attention weights. Each token is now contextualized.

In [8]:
attention_output = A @ V

print("Attention output:\n", attention_output)

Attention output:
 [[0.269 0.296 0.264 0.226]
 [0.274 0.303 0.272 0.227]
 [0.278 0.303 0.278 0.232]]


## Residual Connection

## ðŸ§  Add Residual

We add the residual connection to preserve original information while refining it. Depth becomes refinement, not replacement.

In [9]:
residual_1 = X + attention_output

print("After first residual connection:\n", residual_1)

After first residual connection:
 [[0.519 0.396 0.264 0.526]
 [1.174 1.053 0.372 0.227]
 [1.078 0.503 0.928 0.632]]


## MLP Block

## ðŸ§  Define MLP Weights

We define feedforward network weights. The MLP transforms each token independently.

In [10]:
# Expanding to hidden size 8, then projecting back to 4
W1 = np.random.randn(4, 8) * 0.5
W2 = np.random.randn(8, 4) * 0.5

## ðŸ§  MLP Forward Pass

We apply nonlinear transformation to each token. Attention decides what to look at. MLP decides what to become.

In [11]:
def relu(x):
    return np.maximum(0, x)

mlp_hidden = relu(residual_1 @ W1)
mlp_output = mlp_hidden @ W2

print("MLP output:\n", mlp_output)

MLP output:
 [[-0.897 -0.223  0.175 -0.338]
 [-0.815 -0.383  0.554 -1.01 ]
 [-1.429 -0.394 -0.16  -0.939]]


## Final Residual

## ðŸ§  Final Output of One Transformer Layer

We produce the final output by adding the residual connection. This completes one full Transformer layer.

In [12]:
final_output = residual_1 + mlp_output

print("Final layer output:\n", final_output)

Final layer output:
 [[-0.378  0.173  0.438  0.188]
 [ 0.359  0.67   0.926 -0.783]
 [-0.351  0.108  0.768 -0.307]]


## ðŸŽ“ What We Built

In this exercise, we constructed:

- Token embeddings
- Positional encoding
- Q/K/V projections
- Scaled dot-product attention
- Residual connections
- Feedforward MLP
- Final layer output

No abstractions.
No deep learning framework.
No mythology.

Just linear algebra and nonlinearity arranged carefully.

Run it. Modify weights. Change embeddings. Watch how attention shifts.

The diagram stops being intimidating the moment you realize it's just disciplined matrix multiplication repeated at scale.

And that's the strange elegance of it.