<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>

# Building a Large Language Model from Scratch — A Step-by-Step Guide Using Python and PyTorch
## Chapter 7 — Attention & Self-Attention Mechanism
**© Dr. Yves J. Hilpisch**<br>AI-Powered by GPT-5.

## How to Use This Notebook

- Implement scaled dot-product attention step by step before abstracting it away.
- Inspect attention weights on curated toy sequences to build intuition.
- Connect the math to code by tracing shapes and broadcasting rules carefully.

### Roadmap

We derive attention scores, apply masking, and then batch the operation so it scales to transformer blocks.

### Study Tips

Print intermediate tensors as you go. Seeing the score matrices and masks makes it easier to reason about what each line of code accomplishes.

In [None]:
# Ensure torch is available (Colab friendly)
try:
    import torch  # noqa
    print('torch:', torch.__version__)
except Exception:
    import os
    gpu = os.system('nvidia-smi > /dev/null 2>&1') == 0
    index = (
        'https://download.pytorch.org/whl/cu121'
        if gpu else 'https://download.pytorch.org/whl/cpu'
    )
    get_ipython().run_line_magic('pip', f'install -q torch --index-url {index}')
    import torch
    print('torch:', torch.__version__)


In [None]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'


In [None]:
from torch import Tensor
def scaled_dot_product_attention(
    q: Tensor,
    k: Tensor,
    v: Tensor,
    mask: Tensor | None = None,
) -> Tensor:
    d = q.size(-1)
    scores = (q @ k.transpose(-2, -1)) / (d ** 0.5)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    w = torch.softmax(scores, dim=-1)
    return w @ v
def causal_mask(batch: int, time: int, device=None):
    base = torch.tril(torch.ones(time, time, device=device))
    return base.unsqueeze(0).expand(batch, -1, -1)


In [None]:
# Set random seed
torch.manual_seed(0)


In [None]:
# Define shapes
B, T, D = 1, 6, 4
(B, T, D)


In [None]:
# Create a toy input
x = torch.randn(B, T, D)
x


In [None]:
# Build a causal mask
mask = causal_mask(B, T)
mask


In [None]:
# Apply attention
y = scaled_dot_product_attention(x, x, x, mask)
y


In [None]:
# Visualize a row of attention weights
with torch.no_grad():
    d = x.size(-1)
    scores = (x @ x.transpose(-2, -1)) / (d ** 0.5)
    scores = scores.masked_fill(mask == 0, float('-inf'))
    w = torch.softmax(scores, dim=-1)[0]  # [T, T]
plt.figure(figsize=(4, 3))
plt.imshow(w, cmap='viridis', aspect='auto')
plt.colorbar(label='weight')
plt.xlabel('key\npositions')
plt.ylabel('query positions')
plt.title('Causal attention weights (toy)')
plt.tight_layout()


## Exercises

- Implement additive attention and compare its behavior with scaled dot-product attention.
- Visualize attention maps for sequences with padding to confirm masking works as expected.
- Modify the notebook to support multi-head attention and measure the parameter count increase.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>