<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>

# Building a Large Language Model from Scratch — A Step-by-Step Guide Using Python and PyTorch
## Chapter 13 — Improvements & Extensions
**© Dr. Yves J. Hilpisch**<br>AI-Powered by GPT-5.

## How to Use This Notebook

- Prototype improvement ideas such as LoRA, adapters, or data augmentation.
- Measure the trade-offs between accuracy gains and computational costs.
- Document experiments thoroughly so you can reproduce winners later.

### Roadmap

We experiment with several upgrade paths, each isolated so you can evaluate impact independently.

### Study Tips

Change one variable at a time. Rapid iteration is tempting, but disciplined ablations reveal what truly matters.

This notebook demonstrates training improvements from Chapter 13: 
- Mixed precision (AMP) on CUDA for speed
- Gradient clipping for stability
- Warmup + cosine learning-rate schedule
- Gradient accumulation to emulate larger batches
Each cell creates one object and shows it immediately to match the book's 
creation rule. Comments explain why each step matters.


In [None]:
# Torch setup
import sys, subprocess
import contextlib
try:
    import torch  # noqa: F401
except Exception:
    idx = 'https://download.pytorch.org/whl/cpu'
    subprocess.check_call([sys.executable, '-m', 'pip', 'install',
                           '--index-url', idx, 'torch'])
    import torch  # noqa: F401
torch.manual_seed(0); device = ('cuda' if torch.cuda.is_available() else
    'cpu'); device


In [None]:
# Tiny language model: embedding + linear head.
# Used to demonstrate mechanics without heavy compute.
class TinyLM(torch.nn.Module):
    def __init__(self, V=64, D=64):
        """Create a minimal LM with vocabulary V and hidden size D.
        Embedding maps ids->vectors; linear head maps vectors->logits.
        """
        super().__init__(); self.emb = torch.nn.Embedding(V, D)
        self.lin = torch.nn.Linear(D, V)
    def forward(self, x, y=None):
        """Return (logits, loss). Loss is CE if targets y are given.
        """
        h = self.emb(x); logits = self.lin(h)
        loss = None
        if y is not None:
            B,T,V = logits.shape
            loss = torch.nn.functional.cross_entropy(
                logits.reshape(B*T, V), y.reshape(B*T))
        return logits, loss
TinyLM()


In [None]:
# Data: random ids to exercise the loop
V, T, B = 64, 32, 64
ids = torch.randint(0, V, (B, T))
ids.shape


In [None]:
# Warmup + cosine schedule: scale base LR in [minr, 1].
# Warmup ramps 0->1; cosine glides 1->minr over remaining steps.
import math
def warmup_cosine_lambda(warmup, total, minr=0.1):
    """Return a LambdaLR-compatible function.
    warmup: warmup steps; total: total steps; minr: floor ratio.
    """
    def f(step):
        s = step + 1
        if s <= warmup: return s/float(warmup)
        t = s - warmup; frac = t/max(1,total-warmup)
        return minr + (1-minr)*0.5*(1+math.cos(math.pi*frac))
    return f
warmup_cosine_lambda(10, 100)(0)


In [None]:
import contextlib
# Train with AMP (CUDA), clipping, accumulation, and schedule.
# On CPU/MPS, AMP is disabled and training runs in full precision.
model = TinyLM().to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
sched = torch.optim.lr_scheduler.LambdaLR(
    opt, warmup_cosine_lambda(10, 100, 0.1))
try:
    scaler = torch.amp.GradScaler('cuda', enabled=(device == 'cuda'))
except Exception:
    scaler = torch.cuda.amp.GradScaler(enabled=(device == 'cuda'))
accum, clip = 4, 1.0
hist = []
model.train()
opt.zero_grad(set_to_none=True)
for step in range(100):
    x = ids.to(device)
    y = ids.to(device)
    autocast_ctx = (
        torch.amp.autocast('cuda', dtype=torch.float16)
        if device == 'cuda' else contextlib.nullcontext()
    )
    with autocast_ctx:
        _, loss = model(x, y)
    if device == 'cuda':
        scaler.scale(loss).backward()
    else:
        loss.backward()
    if (step + 1) % accum == 0:
        if device == 'cuda':
            scaler.unscale_(opt)
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        if device == 'cuda':
            scaler.step(opt)
            scaler.update()
        else:
            opt.step()
        opt.zero_grad(set_to_none=True)
        sched.step()
    if step % 10 == 0:
        hist.append(float(loss.detach().cpu().item()))
hist[:5]


## Exercises

- Implement LoRA for one transformer layer and compare training speed.
- Try a data augmentation technique and report its effect on validation metrics.
- Create a decision matrix that scores each extension by cost, complexity, and expected impact.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>