[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-09/exercise-02.ipynb)

# üöÄ Notebook: From Pretraining to Tiny RLHF

This will include:

- Real dataset (Tiny Shakespeare)
- Tiny decoder Transformer
- Log-loss & perplexity visualization
- Gradient norm tracking
- Token probability inspection
- Confidence calibration analysis
- Minimal RLHF-style PPO loop (educational, not production)

Everything wired end-to-end.

This is long because it's real.

## 1Ô∏è‚É£ Setup

In [None]:
%pip install -q torch transformers datasets matplotlib numpy

import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer
import matplotlib.pyplot as plt
import numpy as np

## 2Ô∏è‚É£ Load Real Dataset (Tiny Shakespeare)

In [None]:
dataset = load_dataset("tiny_shakespeare")
text = dataset["train"][0]["text"]

print(text[:500])

## 3Ô∏è‚É£ Tokenization

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

tokens = tokenizer(text, return_tensors="pt")["input_ids"][0]
print("Total tokens:", len(tokens))

## 4Ô∏è‚É£ Dataset Windows

In [None]:
block_size = 64

class TextDataset(Dataset):
    def __init__(self, tokens, block_size):
        self.tokens = tokens
        self.block_size = block_size

    def __len__(self):
        return len(self.tokens) - self.block_size

    def __getitem__(self, idx):
        x = self.tokens[idx:idx+self.block_size]
        y = self.tokens[idx+1:idx+self.block_size+1]
        return x, y

train_dataset = TextDataset(tokens, block_size)
loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

## 5Ô∏è‚É£ Tiny Decoder Transformer

In [None]:
class TinyTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=128, n_heads=4, n_layers=2):
        super().__init__()

        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(block_size, d_model)

        layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=256,
            batch_first=True
        )

        self.transformer = nn.TransformerEncoder(layer, num_layers=n_layers)
        self.lm_head = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        B, T = x.size()

        pos = torch.arange(0, T, device=x.device).unsqueeze(0)
        x = self.token_emb(x) + self.pos_emb(pos)

        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        x = self.transformer(x, mask)

        logits = self.lm_head(x)
        return logits

## 6Ô∏è‚É£ Initialize

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model = TinyTransformer(tokenizer.vocab_size).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4)
criterion = nn.CrossEntropyLoss()

## 7Ô∏è‚É£ Pretraining Loop with Full Instrumentation

In [None]:
loss_history = []
perplexity_history = []
grad_history = []

epochs = 2

for epoch in range(epochs):
    model.train()
    total_loss = 0

    for x, y in loader:
        x, y = x.to(device), y.to(device)

        logits = model(x)
        loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))

        optimizer.zero_grad()
        loss.backward()

        total_norm = 0
        for p in model.parameters():
            if p.grad is not None:
                total_norm += p.grad.data.norm(2).item() ** 2
        grad_norm = total_norm ** 0.5
        grad_history.append(grad_norm)

        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(loader)
    loss_history.append(avg_loss)
    perplexity_history.append(math.exp(avg_loss))

    print(f"Epoch {epoch+1} | Loss {avg_loss:.4f} | Perplexity {math.exp(avg_loss):.2f}")

## 8Ô∏è‚É£ Log-Loss & Perplexity Visualization

In [None]:
plt.plot(loss_history)
plt.title("Training Log Loss")
plt.show()

plt.plot(perplexity_history)
plt.title("Perplexity")
plt.show()

## 9Ô∏è‚É£ Token Probability Inspection

In [None]:
model.eval()

prompt = "To be or not to"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

with torch.no_grad():
    logits = model(input_ids)
    probs = F.softmax(logits[:, -1, :], dim=-1)

topk = torch.topk(probs, 10)

for idx, prob in zip(topk.indices[0], topk.values[0]):
    print(tokenizer.decode(idx.item()), float(prob))

This shows probability mass ‚Äî not "answer retrieval."

## üîü Confidence Calibration Analysis

We compare predicted confidence vs actual correctness.

In [None]:
model.eval()

confidences = []
correctness = []

with torch.no_grad():
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        logits = model(x)
        probs = F.softmax(logits, dim=-1)

        preds = probs.argmax(dim=-1)
        max_probs = probs.max(dim=-1).values

        correct = (preds == y).float()

        confidences.extend(max_probs.cpu().flatten().numpy())
        correctness.extend(correct.cpu().flatten().numpy())

confidences = np.array(confidences)
correctness = np.array(correctness)

bins = np.linspace(0, 1, 10)
bin_ids = np.digitize(confidences, bins)

bin_acc = []
bin_conf = []

for b in range(1, len(bins)):
    mask = bin_ids == b
    if mask.sum() > 0:
        bin_acc.append(correctness[mask].mean())
        bin_conf.append(confidences[mask].mean())

plt.plot(bin_conf, bin_acc, marker="o")
plt.plot([0,1],[0,1],"--")
plt.xlabel("Confidence")
plt.ylabel("Accuracy")
plt.title("Calibration Curve")
plt.show()

Perfect calibration would lie on diagonal.

It won't.

## 1Ô∏è‚É£1Ô∏è‚É£ Minimal RLHF-Style PPO Loop (Educational)

This is simplified.

We simulate a "reward model" that prefers shorter responses.

### Generate Function

In [None]:
def generate_ids(model, input_ids, max_new_tokens=20):
    model.eval()
    for _ in range(max_new_tokens):
        logits = model(input_ids)
        probs = F.softmax(logits[:, -1, :], dim=-1)
        next_token = torch.multinomial(probs, 1)
        input_ids = torch.cat([input_ids, next_token], dim=1)
    return input_ids

### Fake Reward Model

In [None]:
def reward_function(generated_ids):
    return -generated_ids.size(1)

### PPO-Like Update (Simplified)

In [None]:
ppo_optimizer = optim.AdamW(model.parameters(), lr=1e-5)

for step in range(5):
    prompt = "To be"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    generated = generate_ids(model, input_ids)

    reward = reward_function(generated)

    logits = model(generated[:, :-1])
    log_probs = F.log_softmax(logits, dim=-1)

    selected_log_probs = log_probs.gather(
        2, generated[:, 1:].unsqueeze(-1)
    ).squeeze(-1)

    policy_loss = -(selected_log_probs.mean() * reward)

    ppo_optimizer.zero_grad()
    policy_loss.backward()
    ppo_optimizer.step()

    print("Step", step, "Reward", reward.item())

This is not production PPO.

It demonstrates:

- sampling
- reward
- policy gradient
- small update

## üß† What This Notebook Actually Teaches

You now showed:

- Cross-entropy minimization
- Perplexity
- Gradient norms
- Calibration gaps
- Probabilistic decoding
- Reward-based fine-tuning

This is the entire LLM training story at educational scale.

Not magic.

Optimization + probability + scaling.

vi