# Tiny tranfformer LM (GPT-style)

This notebook implements a tint decoder-only Transformer language model with:

1. Configuration & Reproducible Seeding
2. Data Utilities (character-level corpus, vocabulary, batching)
3. Core modules:
    - scaled dot product attention
    - multi-head self attention
    - feedforward block
    - transformer block
4. "TinyTransformerLM" model class
5. Training loop with cross-entropy loss
6. Logging, simple run-tracking and checkpoints
7. Sampling / text generation

# The Process

1. 'TinyTransformerLM' embeds tokens amd positions -> produces initial representations.
2. It sends then throgh a stack of *TransformerBlocks*
    Each block consists of:
        - *MultiHeadSelfAttention*: tokens communicate with past tokens and gather context
        - *FeedForward*: transforms each tokens representation non-linearly
        - Residual + LayerNorm help stability and flow
3. The final hidden stattes go through a linear projection -> logits
4. Logits give a probability distribution over the *next token* at each position.
5. Training teaches the model to reduce negative log-liklihood (cross entropy).
6. At generation time, the model samples tokens autoregressively using teh same logic.

# Imports

In [1]:
import math
import json
import random
import numpy as np
from dataclasses import dataclass, asdict
from datetime import datetime
from pathlib import Path
import re

# PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor

# Device and Seeding

In [2]:
# device
device = "cuda" if torch.cuda.is_available() else "cpu"
print("using device: ", device)

using device:  cpu


In [3]:
def set_seed(seed: int, deterministic: bool = True) -> None:
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    if deterministic:
        torch.use_deterministic_algorithms(True)

# Config and Run Tracking

In [29]:
@dataclass
class Config:
    # Data
    block_size: int = 8
    batch_size: int = 32
    train_val_split: float = 0.9
    level: str = "word"            # "char" or "word"

    # Model
    d_model: int = 128
    num_heads: int = 4
    num_layers: int = 2
    d_ff: int = 256

    # Optimization
    learning_rate: float = 3e-4
    max_steps: int = 3000
    eval_interval: int = 200

    # Reproducibility
    seed: int = 123

@dataclass
class RunRecord:
    config: dict
    created_at: str
    run_dir: str
    final_step: int = 0
    final_train_loss: float = float("nan")
    final_val_loss: float = float("nan")

cfg = Config()
cfg

Config(block_size=8, batch_size=32, train_val_split=0.9, level='word', d_model=128, num_heads=4, num_layers=2, d_ff=256, learning_rate=0.0003, max_steps=3000, eval_interval=200, seed=123)

# Run Directory and JSON helpers

In [30]:
BASE_RUN_DIR = Path("runs")

def make_run_dir(base_dir: Path, cfg: Config) -> Path:
    stamp = datetime.now().strftime("%Y%m%dT%H%M%S")
    name = f"{stamp}_seed{cfg.seed}_tiny_transformer"
    run_dir = base_dir / name
    (run_dir / "checkpoints").mkdir(parents=True, exist_ok=True)
    return run_dir

def save_json(path: Path, data: dict) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Set seed and create run directory

In [31]:
set_seed(cfg.seed, deterministic=True)
run_dir = make_run_dir(BASE_RUN_DIR, cfg)
print("Run directory:", run_dir)

# Initial run record
record = RunRecord(
    config=asdict(cfg),
    created_at=datetime.now().isoformat(timespec="seconds"),
    run_dir=str(run_dir),
)

save_json(run_dir / "config.json", asdict(cfg))
save_json(run_dir / "run_record.json", asdict(record))

Run directory: runs/20251207T113926_seed123_tiny_transformer


# Add text pf Gekko speech

In [32]:
raw_text = """[Gekko:] Well, I appreciate the opportunity you're giving me, Mr. Cromwell, as the single largest 
shareholder in Teldar Paper, to speak. Well, ladies and gentlemen, we're not here to indulge in fantasy, but 
in political and economic reality. America, America has become a second-rate power. Its trade deficit and 
its fiscal deficit are at nightmare proportions. Now, in the days of the free market, when our country was a 
top industrial power, there was accountability to the stockholder. The Carnegies, the Mellons, the men that 
built this great industrial empire, made sure of it because it was their money at stake. Today, management 
has no stake in the company! All together, these men sitting up here own less than 3 percent of the company. 
And where does Mr. Cromwell put his million-dollar salary? Not in Teldar stock; he owns less than 1 percent. 
You own the company. That's right -- you, the stockholder. And you are all being royally screwed over by these, 
these bureaucrats, with their steak lunches, their hunting and fishing trips, their corporate jets and golden 
parachutes. [Cromwell:] This is an outrage! You're out of line, Gekko! [Gekko:] Teldar Paper, Mr. Cromwell, Teldar 
Paper has 33 different vice presidents, each earning over 200 thousand dollars a year. Now, I have spent the last 
two months analyzing what all these guys do, and I still can't figure it out. One thing I do know is that our paper 
company lost 110 million dollars last year, and I'll bet that half of that was spent in all the paperwork going back 
and forth between all these vice presidents. The new law of evolution in corporate America seems to be survival of 
the unfittest. Well, in my book you either do it right or you get eliminated. In the last seven deals that I've been 
involved with, there were 2.5 million stockholders who have made a pretax profit of 12 billion dollars. Thank you. I am 
not a destroyer of companies. I am a liberator of them!The point is, ladies and gentleman, that greed -- for lack of a better 
word -- Greed is good. Greed is right. Greed works. Greed clarifies, cuts through, and captures the essence of the evolutionary 
spirit. Greed, in all of its forms -- greed for life, for money, for love, knowledge -- has marked the upward surge of mankind. 
And greed -- you mark my words -- will not only save Teldar Paper, but that other malfunctioning corporation called the USA. 
Thank you very much."""

print("Corpus length (characters):", len(raw_text))
print(raw_text[:200])

Corpus length (characters): 2439
[Gekko:] Well, I appreciate the opportunity you're giving me, Mr. Cromwell, as the single largest 
shareholder in Teldar Paper, to speak. Well, ladies and gentlemen, we're not here to indulge in fanta


# Vocabulary and Enconde / Decode

In [33]:
print("Tokenization level:", cfg.level)

if cfg.level == "char":
    # ----- Character-level: each character is a token -----
    tokens = list(raw_text)               # e.g. ["T", "o", " ", "b", "e", ...]
    vocab = sorted(set(tokens))
    
    stoi = {ch: i for i, ch in enumerate(vocab)}
    itos = {i: ch for ch, i in stoi.items()}
    
    def encode(s: str):
        """Encode string to list of integer IDs (char-level)."""
        return [stoi[ch] for ch in s]
    
    def decode(ids):
        """Decode list of integer IDs back to string (char-level)."""
        return "".join(itos[i] for i in ids)

elif cfg.level == "word":
    # ----- Word-level: lightweight word + punctuation tokenizer -----
    def tokenize(s: str):
        """
        Split text into words and punctuation.
        Example:
          "To be, or not to be." ->
          ["To", "be", ",", "or", "not", "to", "be", "."]
        """
        # \w+  = one or more word chars (letters/digits/_)
        # \S   = any non-whitespace character (captures punctuation)
        return re.findall(r"\w+|\S", s)
    
    def detokenize(tokens):
        """
        Join tokens back into a string, fixing spaces before punctuation.
        """
        text = " ".join(tokens)
        # Remove space before common punctuation: "word , next" -> "word, next"
        text = re.sub(r"\s+([.,!?;:])", r"\1", text)
        # Optional: "(" + " word" -> "(word"
        text = re.sub(r"([\(\[\{])\s+", r"\1", text)
        return text
    
    tokens = tokenize(raw_text)
    vocab = sorted(set(tokens))
    
    stoi = {tok: i for i, tok in enumerate(vocab)}
    itos = {i: tok for tok, i in stoi.items()}
    
    def encode(s: str):
        """Encode string to list of integer IDs (word-level)."""
        return [stoi[t] for t in tokenize(s)]
    
    def decode(ids):
        """Decode list of integer IDs back to string (word-level)."""
        toks = [itos[i] for i in ids]
        return detokenize(toks)

else:
    raise ValueError(f"Unknown cfg.level: {cfg.level!r}")

vocab_size = len(vocab)
print("Vocab size:", vocab_size)

# Quick sanity test
test = "Greed is good."
encoded = encode(test)
decoded = decode(encoded)
print("test:   ", test)
print("encoded:", encoded)
print("decoded:", decoded)   

Tokenization level: word
Vocab size: 251
test:    Greed is good.
encoded: [22, 126, 111, 4]
decoded: Greed is good.


# Quick test

In [34]:
test = "Greed is good"
encoded = encode(test)
decoded = decode(encoded)
print("test:", test)
print("encoded:", encoded)
print("decoded:", decoded)

vocab_size

test: Greed is good
encoded: [22, 126, 111]
decoded: Greed is good


251

# Create tensor and Train/Val split

In [35]:
# Turn full corpus into a long 1D tensor of token IDs
data = torch.tensor(encode(raw_text), dtype=torch.long)
print("Data shape (number of tokens):", data.shape)

# Train/val split
n = int(len(data) * cfg.train_val_split)
train_data = data[:n]
val_data   = data[n:]
print("Train tokens:", len(train_data), "Val tokens:", len(val_data))

Data shape (number of tokens): torch.Size([532])
Train tokens: 478 Val tokens: 54


# Batch sampling

In [36]:
def get_batch(split: str):
    source = train_data if split == "train" else val_data
    B, T = cfg.batch_size, cfg.block_size

    if len(source) <= T + 1:
        raise ValueError(
            f"Not enough tokens in {split} split "
            f"for block_size={T}. len(source)={len(source)}."
        )

    max_start = len(source) - T - 1
    ix = torch.randint(0, max_start, (B,))
    x = torch.stack([source[i : i + T] for i in ix])
    y = torch.stack([source[i + 1 : i + 1 + T] for i in ix])
    return x.to(device), y.to(device)

# Scaled dot product attention

In [37]:
def scaled_dot_product_attention(q: Tensor, k: Tensor, v: Tensor, mask: Tensor | None = None) -> Tensor:
    """
    q, k, v: [B, H, T, Hd]
    mask:   [T, T] or None (1 = keep, 0 = mask out)
    Returns: [B, H, T, Hd]
    """
    d_k = q.size(-1)
    scores = q @ k.transpose(-2, -1) / math.sqrt(d_k)  # [B, H, T, T]

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
        
    attn = torch.softmax(scores, dim=-1) # should be [B, H, T, T]
    out = attn @ v                       # should be [B, H, T, Hd]
    return out

# Multi Head Self Attention

This class asks "what tokens should I pay attention to". Multi head self attention enables each token in the sequence to look at other tokens, decide which ines arerelevant, and gather information from them.

for language modelling this is how the model:

    - understands dependencies
    - tracks long range relationships
    - carries semantic information forward
    - builds contextual meaning

<b><u>What happens inside this class:</b></u>

At a high level for each position t:
    1. the model creates a *query, key, value* for every token.
    2. it computes *similarity scores* between the query at position t and the keys at all earlier positions.
    3. Using softmax it converts these scores into *attention weights*.
    4. It uses these weights to take a *weighted sum of the value vectors*.
    5. This becomes the representation of the token in the next layer

<b><u>Why "multi-head":</b></u>
Each head can learn a different pattern eg it is possible that:
    * One head tracks syntax
    * One tracks names
    * One tracks verb tense
    * One tracks quotation boundaries, and so on...

<b><u>Why Causal Masking:</b></u>
It is important that token t does not see the future (i.e. t+1, t+2,...) otherwise themodel would cheat during training. Masking ensure that the model is truly able to learn * Given the past, predict the next token*

### MultiHeadSelfAttention teaches the model how tokens relate to each other and forms the backbone of reasoning and contextual understanding in LLM's.

In [38]:
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int, block_size: int):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

        mask = torch.tril(torch.ones(block_size, block_size))
        self.register_buffer("causal_mask", mask)

    def forward(self, x: Tensor) -> Tensor:
        B, T, D = x.shape
        H, Hd = self.num_heads, self.head_dim

        q = self.W_q(x)  # [B, T, D]
        k = self.W_k(x)
        v = self.W_v(x)

        # [B, T, D] -> [B,, H, T, Hd]
        q = q.view(B, T, H, Hd).transpose(1, 2)
        k = k.view(B, T, H, Hd).transpose(1, 2)
        v = v.view(B, T, H, Hd).transpose(1, 2)

        out = scaled_dot_product_attention(q, k, v, mask=self.causal_mask[:T, :T])

        # [B, H, T, Hd] -> [B, T, D]
        out = out.transpose(1, 2).contiguous().view(B, T, D)
        out = self.W_o(out)
        return out

# FeedForward 

### Purpose in an LLM - "Transform the information that I've gathered

After self attention mixes information across positions, the feedforward network applies a *nonlinear transformation* to each token independently. This step allows the model to:

    - extract higher level features
    - create richer innternal representations
    - perform local computations over each tokesn embedding
    - increase the models expressive capacity

<b><u>What is does</b></u>
Inside the TransformerBlock, FeedForward component is implemented using PyTorch's nn.Sequential, which applies a small neural network of the form:

$\text{FFN}(x) = W_2\,\mathrm{ReLU}(W_1 x)$

This: 

    - Increases dimensionality (*expansion* via W1)
    - Applies a nonlinerity (ReLU)
    - Project back to the original dimension (*compression* via W2)

This pattern (expand -> nonlinear -> compress) lets the transformer model complex functions that cannot be captured by attention alone.
                                

In [39]:
class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )

    def forward(self, x: Tensor) -> Tensor:
        return self.net(x)

# TransformerBlock

A *TransformerBlock* combines several key components that work together to progressivley refine token representations as information flows through the model. Each block begins with <i>layer normalisation</i>, followed by <i>multi-head self attention</i>, which allows eaxh token to gather contextual information from earlier tokens in the sequence. A <i>residual connection</i> then preserves the original representation while adding the newly computed attention output, stabilising learning and improving gradient flow. This block applies a second layer normalisation before passing each token through a <i>feedforward network</i>, which performs a nonlinear transformation that enriches and expands the representation locally. Another residdual connection integrates this transformation with the tokens prior state. Stacking multiple blocks in depth enables the model to build a representational hierarchy: lower layers capture syrface-level patterns such as character sequences, middle layers learn grammatical structure and short range semantics, and upper layers develop broader more abstract meaning. Residual pathways throughout the block - mathematically in the form $x_{\text{next}} = x + f(x)$ ensure stable optimisation by allowing gradients to propogate effectively, preventing representational collaose and making deeper transformers trainable

In [40]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, block_size: int):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadSelfAttention(d_model, num_heads, block_size)
        self.ff = FeedForward(d_model, d_ff)

    def forward(self, x: Tensor) -> Tensor:
        # GPT-style pre-norm + residual
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x      

# TinyTransformerLM Model

The TinyTransformer "Token -> Meaning -> Prediction" wraps the entire transformer architecture and produces logits at each position, which we use for next-token prediction.

<b><u>Workflow inside the forward</u></b>

1. Token embeddings - convert token id's into vectors
2. Position embeddings - add information about position in the sequence
3. Pass through L Transformer blocks - each block refines and contextualises the embeddings
4. FInal `LayerNorm` and `Linear` Layer - convert the final hidden states into Logits over the vocabulary

<b><u>Why Logits</u></b>

Becuase `CrossEntropyLoss` takes Logits and computes - logP0(Xt | X < t). This creates a next token prediction becuase given input tokens X0, X1, X2,...,X(t - 1) the model predicts a districution over x(t) for every position t in the sequence simultaneously.

In [41]:
class TinyTransformerLM(nn.Module):
    def __init__(self, vocab_size: int, cfg: Config):
        super().__init__()
        self.cfg = cfg
        self.token_emb = nn.Embedding(vocab_size, cfg.d_model)
        self.pos_emb = nn.Embedding(cfg.block_size, cfg.d_model)

        self.blocks = nn.ModuleList([
            TransformerBlock(cfg.d_model, cfg.num_heads, cfg.d_ff, cfg.block_size)
            for _ in range(cfg.num_layers)
        ])

        self.ln_f = nn.LayerNorm(cfg.d_model)
        self.head = nn.Linear(cfg.d_model, vocab_size, bias=False)

        # Optional: weight tying
        # self.head.weight = self.token_emb.weight

    def forward(self, idx: Tensor) -> Tensor:
        B, T = idx.shape
        assert T <= self.cfg.block_size, "Sequence length > block_size"

        tok = self.token_emb(idx)                         # [B, T, D]
        pos_ids = torch.arange(T, device=idx.device)
        pos = self.pos_emb(pos_ids)[None, :, :]           # [1, T, D]
        x = tok + pos                                     # [B, T, D]

        for block in self.blocks:
            x = block(x)

        x = self.ln_f(x)
        logits = self.head(x)                             # [B, T, V]
        return logits

model = TinyTransformerLM(vocab_size, cfg).to(device)
print("Number of parameters:", sum(p.numel() for p in model.parameters()))

Number of parameters: 329472


# Loss Estimation Helper

In [42]:
@torch.no_grad()
def estimate_loss():
    model.eval()
    out = {}
    for split in ["train", "val"]:
        losses = []
        for _ in range(10):  # a few mini-batches
            xb, yb = get_batch(split)      # <<--- use get_batch, not get_batch_split
            logits = model(xb)
            B, T, V = logits.shape
            loss = F.cross_entropy(
                logits.view(B * T, V),
                yb.view(B * T),
            )
            losses.append(loss.item())
        out[split] = sum(losses) / len(losses)
    model.train()
    return out

# Training Loop with logging & checkpoints

In [43]:
optimizer = torch.optim.Adam(model.parameters(), lr=cfg.learning_rate)

best_val_loss = float("inf")

for step in range(cfg.max_steps):
    # Periodic evaluation
    if step % cfg.eval_interval == 0:
        losses = estimate_loss()
        train_loss, val_loss = losses["train"], losses["val"]
        print(f"step {step:5d}: train loss {train_loss:.4f}, val loss {val_loss:.4f}")

        # Update run record
        record.final_step = step
        record.final_train_loss = float(train_loss)
        record.final_val_loss = float(val_loss)
        save_json(run_dir / "run_record.json", asdict(record))

        # Save checkpoint if best so far
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            ckpt_path = run_dir / "checkpoints" / "best.pt"
            torch.save(model.state_dict(), ckpt_path)
            print(f"  → New best val loss, checkpoint saved to {ckpt_path}")

    # Sample a training batch
    xb, yb = get_batch("train")

    # Forward
    logits = model(xb)                         # [B, T, V]
    B, T, V = logits.shape
    loss = F.cross_entropy(
        logits.view(B * T, V),
        yb.view(B * T),
    )

    # Backward
    optimizer.zero_grad()
    loss.backward()
    # Optional gradient clipping
    # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

# Final save of model
final_ckpt = run_dir / "checkpoints" / "last.pt"
torch.save(model.state_dict(), final_ckpt)
print("Final checkpoint saved to:", final_ckpt)

step     0: train loss 5.6555, val loss 5.7886
  → New best val loss, checkpoint saved to runs/20251207T113926_seed123_tiny_transformer/checkpoints/best.pt
step   200: train loss 1.3826, val loss 5.2300
  → New best val loss, checkpoint saved to runs/20251207T113926_seed123_tiny_transformer/checkpoints/best.pt
step   400: train loss 0.3611, val loss 5.8539
step   600: train loss 0.2407, val loss 5.9890
step   800: train loss 0.2000, val loss 6.2337
step  1000: train loss 0.1777, val loss 6.4424
step  1200: train loss 0.1874, val loss 6.5780
step  1400: train loss 0.1748, val loss 6.7443
step  1600: train loss 0.1733, val loss 6.9648
step  1800: train loss 0.1896, val loss 7.0861
step  2000: train loss 0.1766, val loss 7.0600
step  2200: train loss 0.1802, val loss 7.2909
step  2400: train loss 0.1615, val loss 7.1715
step  2600: train loss 0.1646, val loss 7.2619
step  2800: train loss 0.1837, val loss 7.4530
Final checkpoint saved to: runs/20251207T113926_seed123_tiny_transformer/chec

# Sampling / Text generation

In [53]:
@torch.no_grad()
def generate(model: TinyTransformerLM, idx: Tensor, max_new_tokens: int) -> Tensor:
    model.eval()
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -cfg.block_size:]   # crop context
        logits = model(idx_cond)              # [1, T, V]
        logits = logits[:, -1, :]             # last position [1, V]
        probs = torch.softmax(logits, dim=-1) # [1, V]
        next_id = torch.multinomial(probs, num_samples=1)  # [1, 1]
        idx = torch.cat([idx, next_id], dim=1)
    return idx

# Example: Start from "T"
start_gekko_letter="accountability"
generate_gekko_tokens=50
start_ids = torch.tensor([[stoi[start_gekko_letter]]], dtype=torch.long, device=device)
sample_ids = generate(model, start_ids, max_new_tokens=generate_gekko_tokens)
print(decode(sample_ids[0].tolist()))

accountability to the stockholder. The Carnegies, the Mellons, the men that built this great industrial empire, made sure of it because it was their money at stake. Today, management has no stake in the company! All together, these men sitting up here own


# Conclusion

This toy experiment highlights how strongly a model’s behaviour depends on both the <b><u>tokenisation level and the amount of training data</b></u>. When using word-level tokenisation on an extremely small corpus, the model has very few unique tokens and very limited contextual variety. In this setting, the model cannot learn meaningful generalisations about language; instead, it effectively builds a shallow transition table over the handful of words it has seen. As a result, when prompted with a word such as “Greed” from the Gekko text, the model simply predicts one of the next words it encountered in the tiny training text. The apparent “generation” is therefore little more than a recombination or repetition of the original training sentences. This is expected and entirely appropriate for a dataset and model of this scale.

In contrast, character-level tokenisation offers a much larger vocabulary and a finer-grained modelling space, even when the dataset itself is small. Although a tiny character-level model still cannot learn complex semantics, it can capture patterns in spelling, punctuation, and short substrings. This often allows it to generate novel character sequences—sometimes slightly garbled versions of the training text—that reflect an attempt to generalise beyond direct memorisation. In other words, character-level models tend to overfit differently: they memorise local structure rather than whole words, which gives them a little more flexibility.

Together, these observations demonstrate a central principle in language modelling: <b><u>the capacity of the model and the richness of the token vocabulary must be matched to the amount of data available</b></u>. With tiny datasets, both word- and character-level models inevitably overfit, but in distinct ways. Word-level models tend to reproduce the training corpus almost verbatim, while character-level models exhibit more local creativity but remain fundamentally constrained. Understanding this helps build intuition for why modern LLMs rely on subword tokenisation and large-scale corpora—both choices that dramatically expand a model’s ability to generalise beyond memorisation.