[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-08/exercise-02.ipynb)

# ðŸ§ª Exercise 3 â€” Show Why MLP Is Necessary

## Goal

Prove that attention alone cannot transform.

Run attention twice without MLP:

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

batch = 1
seq = 4
dim = 8

x = torch.randn(batch, seq, dim)

Wq = torch.randn(dim, dim)
Wk = torch.randn(dim, dim)
Wv = torch.randn(dim, dim)

Q = x @ Wq
K = x @ Wk
V = x @ Wv

scores = Q @ K.transpose(-2, -1) / dim**0.5
weights = F.softmax(scores, dim=-1)

out1 = weights @ V
out2 = weights @ out1

print("Difference:", (out2 - out1).norm())

Difference: tensor(15.5557)


It mostly remixes.

Now add MLP:

In [2]:
mlp = nn.Sequential(
    nn.Linear(dim, dim*4),
    nn.ReLU(),
    nn.Linear(dim*4, dim)
)

out_mlp = mlp(out1)
print("After MLP difference:", (out_mlp - out1).norm())

After MLP difference: tensor(11.9028, grad_fn=<LinalgVectorNormBackward0>)


## ðŸ’¡ Teaching Point

Attention mixes.

MLP transforms.

Without MLP, depth collapses.

# ðŸ§ª Exercise 4 â€” Residual Connection Stability Demo

## Goal

Show residual protects gradient.

Without residual:

In [3]:
layer = nn.Sequential(
    nn.Linear(dim, dim),
    nn.ReLU()
)

x = torch.randn(1, 10, dim, requires_grad=True)
out = layer(layer(layer(x)))
loss = out.sum()
loss.backward()

print("Gradient norm without residual:", x.grad.norm())

Gradient norm without residual: tensor(0.5977)


Now with residual:

In [4]:
x.grad.zero_()

def residual_block(x):
    return x + layer(x)

out = residual_block(residual_block(residual_block(x)))
loss = out.sum()
loss.backward()

print("Gradient norm with residual:", x.grad.norm())

Gradient norm with residual: tensor(12.7411)


You'll see:

Residual preserves gradient magnitude better.

## ðŸ’¡ Teaching Point

Residuals make depth survivable.