[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-09/exercise-00.ipynb)

# ðŸ§ª Exercise 1 â€” Tiny next-token pretraining (toy Transformer)

## Goal

Make next-token prediction explicit: train a tiny Transformer on tiny Shakespeare (or short corpus), plot loss â†’ perplexity.

In [1]:
# minimal dependencies
%pip install datasets -q

import torch, math, time
import torch.nn as nn
from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

# --- Data: tiny shakespeare (small slice) ---
ds = load_dataset("tiny_shakespeare")["train"]["text"][:50000]  # small
text = "\n".join(ds)[:20000]  # keep tiny for speed
chars = sorted(list(set(text)))
stoi = {c:i for i,c in enumerate(chars)}
itos = {i:c for c,i in stoi.items()}

def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(t): return "".join(itos[int(x)] for x in t)

seq_len = 64
examples = [encode(text[i:i+seq_len+1]) for i in range(0, len(text)-seq_len, seq_len)]
loader = DataLoader(examples, batch_size=32, shuffle=True)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/opt/python@3.10/bin/python3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


RuntimeError: Dataset scripts are no longer supported, but found tiny_shakespeare.py

In [None]:
# --- Tiny Transformer LM ---
class TinyLM(nn.Module):
    def __init__(self, vocab, d=128, nhead=4, nlayers=2):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab, d)
        self.pos_emb = nn.Embedding(seq_len, d)
        layer = nn.TransformerEncoderLayer(d_model=d, nhead=nhead, batch_first=True)
        self.enc = nn.TransformerEncoder(layer, num_layers=nlayers)
        self.ln = nn.LayerNorm(d)
        self.out = nn.Linear(d, vocab)
    def forward(self, x):
        b, t = x.shape
        positions = torch.arange(t, device=x.device).unsqueeze(0)
        h = self.tok_emb(x) + self.pos_emb(positions)
        h = self.enc(h)
        h = self.ln(h)
        return self.out(h)  # logits shape (b, t, vocab)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = TinyLM(len(chars)).to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_f = nn.CrossEntropyLoss()

In [None]:
# --- train one epoch quickly ---
model.train()
start=time.time()
avg_loss=0.0
for i,b in enumerate(loader):
    b = b.to(device)
    inp = b[:, :-1]
    tgt = b[:, 1:]
    logits = model(inp)  # (B, T, V)
    loss = loss_f(logits.view(-1, logits.size(-1)), tgt.view(-1))
    opt.zero_grad(); loss.backward(); opt.step()
    avg_loss += loss.item()
    if i>200: break
print("avg loss:", avg_loss/(i+1), "perplexity:", math.exp(avg_loss/(i+1)), "time:", time.time()-start)

## ðŸ”Ž What to Observe

Training minimizes cross-entropy (NLL).

Report NLL and perplexity = exp(NLL). Perplexity is the "effective branching factor."

This is exactly next-token prediction.

## ðŸ’¡ Teaching Note

We Emphasize that the model never "answers" â€” it learns statistical continuity.

# ðŸ§ª Exercise 2 â€” Sampling & decoding controls (temperature, top-k, top-p)

## Goal

Show how the same probability distribution yields very different answers depending on decoding.

In [None]:
import torch.nn.functional as F
import random

@torch.no_grad()
def sample_logits(model, prompt, length=100, temp=1.0, top_k=0, top_p=0.0):
    model.eval()
    idxs = encode(prompt).unsqueeze(0).to(device)
    out = idxs
    for _ in range(length):
        logits = model(out)[0, -1] / (temp if temp>0 else 1e-9)
        # top-k
        if top_k>0:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[-1]] = -1e10
        # top-p (nucleus)
        if top_p>0.0:
            sorted_logits, sorted_idx = torch.sort(logits, descending=True)
            probs = F.softmax(sorted_logits, dim=-1)
            cumsum = probs.cumsum(dim=0)
            sorted_logits[cumsum > top_p] = -1e10
            logits = torch.zeros_like(logits).scatter_(0, sorted_idx, sorted_logits)
        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        out = torch.cat([out, next_id.unsqueeze(0)], dim=1)
    return decode(out[0].cpu().numpy())

print("temp=0.2:", sample_logits(model, "To be, or not", length=80, temp=0.2))
print("temp=1.0:", sample_logits(model, "To be, or not", length=80, temp=1.0))
print("top_k=5:", sample_logits(model, "To be, or not", length=80, temp=1.0, top_k=5))
print("top_p=0.9:", sample_logits(model, "To be, or not", length=80, temp=1.0, top_p=0.9))

## ðŸ”Ž What to Observe

Low temperature â†’ conservative, repetitive text (closer to argmax).

High temperature â†’ creative but often incoherent.

Top-k / top-p shape the tails and control hallucination vs creativity.

## ðŸ’¡ Teaching Note

This is why phrasing matters and why LLMs can sound confident but be wrong â€” sampling draws from the learned distribution.

# ðŸ§ª Exercise 3 â€” Pretrain â†’ Instruction-tune (tiny simulation)

## Goal

Demonstrate how instruction-tuning is just more next-token examples, and how behaviour shifts.

## Plan

Pretrain toy LM (exercise 1).

Create a tiny supervised dataset of instructionâ†’response pairs ( ~100 examples).

Fine-tune the same LM on concatenated "<INST>prompt</INST>response" examples with teacher forcing (maximize likelihood of response tokens given prompt).

Compare model outputs on test prompts before/after.

In [None]:
# assume `model` from Ex 1 is available and device set
# 1) make toy SFT dataset
sft_pairs = [
    ("Summarize: Why is the sky blue?", "Because air scatters sunlight; blue scatters more."),
    ("Explain like I'm 5: gravity", "Gravity pulls things together; heavy things pull stronger."),
    # ... add ~100 small pairs; keep tiny
]

def encode_pair(prompt, response):
    seq = (prompt + " " + response)
    return torch.tensor([stoi[c] for c in seq if c in stoi], dtype=torch.long)

sft_examples = [encode_pair(p,r) for p,r in sft_pairs]

In [None]:
# 2) fine-tune with teacher forcing (very short)
opt = torch.optim.Adam(model.parameters(), lr=1e-4)
for epoch in range(2):
    for ex in sft_examples:
        ex = ex.to(device)
        inp = ex[:-1].unsqueeze(0)
        tgt = ex[1:].unsqueeze(0)
        logits = model(inp)
        loss = loss_f(logits.view(-1, logits.size(-1)), tgt.view(-1))
        opt.zero_grad(); loss.backward(); opt.step()

# 3) sample before/after: use sample_logits() earlier
print("Post SFT sample:", sample_logits(model, "Summarize: Why is the sky blue?", length=60, temp=0.7))

## ðŸ”Ž What to Observe

After SFT, the model is more likely to produce helpful continuations matching training style.

It's still next-token prediction; you shifted the distribution of contexts it sees.

## ðŸ’¡ Teaching Note

SFT reweights model behaviour; it does not implant new internal modules.

## ðŸ”Ž What to Observe

After SFT, the model is more likely to produce helpful continuations matching training style.

It's still next-token prediction; you shifted the distribution of contexts it sees.

## ðŸ’¡ Teaching Note

SFT reweights model behaviour; it does not implant new internal modules.