# LLM From Scratch — Interactive Student Workbook (v2)

This workbook is now visual + interactive. For each exercise:
1. Predict what will happen.
2. Run the code.
3. Change one parameter and re-run.
4. Explain what changed.


In [None]:
import torch
import matplotlib.pyplot as plt
import math


## Exercise 1 — Dot Product Visual


In [None]:
a = torch.tensor([2.0, 1.0])
b = torch.tensor([1.0, 3.0])
c = torch.tensor([-1.0, -3.0])

vals = [torch.dot(a,b).item(), torch.dot(a,c).item()]
plt.bar(['a·b','a·c'], vals)
plt.axhline(0, color='black')
plt.title('Positive vs Negative Similarity')
plt.show()

# TODO: change vectors and interpret


## Exercise 2 — Softmax Temperature Explorer


In [None]:
scores = torch.tensor([2.0, 1.0, 0.1])
for T in [2.0, 1.0, 0.5, 0.2]:
    p = torch.softmax(scores / T, dim=0)
    print(f'T={T}:', p.tolist())

# TODO: plot these distributions


## Exercise 3 — Tokenization Timeline


In [None]:
text = 'students can build tiny llms'
tokens = text.split()
vocab = {w:i for i,w in enumerate(sorted(set(tokens)))}
ids = [vocab[t] for t in tokens]
print(tokens)
print(ids)

# TODO: make a scatter timeline plot token->id


## Exercise 4 — Attention Heatmap


In [None]:
def attention(Q,K,V):
    d = Q.size(-1)
    w = torch.softmax((Q @ K.transpose(-2,-1)) / math.sqrt(d), dim=-1)
    return w @ V, w

X = torch.randn(1,4,8)
Wq, Wk, Wv = [torch.randn(8,8) for _ in range(3)]
out, w = attention(X@Wq, X@Wk, X@Wv)

plt.imshow(w[0].detach().numpy(), cmap='viridis')
plt.colorbar()
plt.title('Attention heatmap')
plt.show()

# TODO: label axes with custom tokens


## Exercise 5 — Causal Mask Visual


In [None]:
T = 6
mask = torch.tril(torch.ones(T,T))
plt.imshow(mask.numpy(), cmap='gray_r')
plt.title('Causal mask')
plt.show()

# TODO: explain why upper-right is blocked


## Exercise 6 — Mini Training Curve


In [None]:
# TODO: build a tiny model and plot loss over 20+ steps
# Hint: keep a list called losses, then plt.plot(losses)
pass


## Reflection Questions
1. Which visualization made attention easiest to understand?
2. How does temperature change randomness in generation?
3. Why do we need both embeddings and positional info?
