<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>

# Building a Large Language Model from Scratch — A Step-by-Step Guide Using Python and PyTorch
## Chapter 5 — PyTorch Essentials
**© Dr. Yves J. Hilpisch**<br>AI-Powered by GPT-5.

## How to Use This Notebook

- Rehearse core tensor manipulations you will reuse in every subsequent notebook.
- Probe autograd with small, interpretable examples before scaling up.
- Experiment with optimizers and schedulers on toy problems to gauge their behavior.

### Roadmap

We begin with tensors and broadcasting, explore gradients and computational graphs, and wrap up with a compact training loop that mirrors the full attoLLM pipeline.

### Study Tips

Annotate the outputs of each section. Understanding *why* gradients or shapes look the way they do now will save hours when debugging the real model.

In [None]:
# Install PyTorch in Colab or when torch is missing.
try:
    import torch  # noqa:F401
    print('torch:', torch.__version__)
except Exception:
    import os
    gpu = os.system('nvidia-smi > /dev/null 2>&1') == 0
    index = 'https://download.pytorch.org/whl/cu121' if gpu else 'https://download.pytorch.org/whl/cpu'
    get_ipython().run_line_magic('pip', f'install -q torch --index-url {index}')
    import torch
    print('torch:', torch.__version__, 'cuda?', torch.cuda.is_available())


In [None]:
# Pick device
import torch
def pick_device():
    if torch.cuda.is_available():
        return torch.device('cuda')
    mps = getattr(torch.backends, 'mps', None)
    if mps and torch.backends.mps.is_available():
        return torch.device('mps')
    return torch.device('cpu')
device = pick_device()
device


## Tensors and Shapes

In [None]:
x = torch.arange(6, dtype=torch.float32).reshape(2, 3)
y = torch.ones_like(x)
x.shape, y.shape, x.device


In [None]:
b = torch.tensor([10.0, 20.0, 30.0])
(x + b).shape, (x + b)[0]


In [None]:
x = x.to(device)
x.device


## Autograd in a Nutshell

In [None]:
w = torch.tensor([2.0, -3.0, 0.5], requires_grad=True)
v = torch.tensor([1.0, 2.0, 3.0])
loss = (w * v).sum()
loss.backward()
w.grad


## Optimizers and Parameters

In [None]:
model = torch.nn.Linear(3, 1).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-3)
crit = torch.nn.MSELoss()
type(model), type(opt), type(crit)


## Training Loop Pattern

In [None]:
# CPU generator for determinism
g_cpu = torch.Generator().manual_seed(0)
g_cpu


In [None]:
# Create inputs X
X = torch.randn(64, 3, generator=g_cpu).to(device)
X


In [None]:
# Ground-truth weights
true_w = torch.tensor([1.0, -2.0, 0.5], device=device)
true_w


In [None]:
# Targets with a touch of noise
y = (X @ true_w) + 0.1 * torch.randn(64, generator=g_cpu).to(device)
y[:8]


In [None]:
# Model, optimizer, loss
model = torch.nn.Linear(3, 1).to(device)
model


In [None]:
opt = torch.optim.AdamW(model.parameters(), lr=5e-3)
opt


In [None]:
loss_fn = torch.nn.MSELoss()
loss_fn


In [None]:
# Train
losses = []
for step in range(201):
    opt.zero_grad()
    pred = model(X).squeeze(-1)
    loss = loss_fn(pred, y)
    loss.backward()
    opt.step()
    losses.append(loss.detach().item())
    if step % 50 == 0:
        print(step, round(losses[-1], 4))
losses[-1]


In [None]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'
# Plot loss vs step
plt.figure(figsize=(4.5,3))
plt.plot(losses, label='train loss')
plt.xlabel('step')
plt.ylabel('MSE')
plt.title('Linear regression: loss vs. step')
plt.legend(); plt.tight_layout()


## Linear Regression (End‑to‑End Example)

In [None]:
def linreg_demo(epochs: int = 400, lr: float = 3e-2):
    g_cpu = torch.Generator().manual_seed(42)
    w_true = torch.tensor([2.0, -3.5], device=device)
    b_true = torch.tensor(0.5, device=device)
    X = torch.randn(128, 2, generator=g_cpu).to(device)
    y = (X @ w_true) + b_true + 0.1 * torch.randn(128, generator=g_cpu).to(device)
    model = torch.nn.Linear(2, 1).to(device)
    opt = torch.optim.AdamW(model.parameters(), lr=lr)
    loss_fn = torch.nn.MSELoss()
    for step in range(epochs + 1):
        opt.zero_grad()
        pred = model(X).squeeze(-1)
        loss = loss_fn(pred, y)
        loss.backward()
        opt.step()
        if step % 100 == 0:
            print(step, round(loss.item(), 4))
    return model.weight.detach().squeeze(0), model.bias.detach().squeeze(0)

w, b = linreg_demo()
w.cpu().tolist(), float(b)


## Exercises

- Rewrite the mini training loop to use the functional optimizers in `torch.optim.lr_scheduler`.
- Implement gradient clipping and observe how it affects convergence on the toy dataset.
- Create a unit test that checks whether your custom tensor operation preserves shape invariants.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>