<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>


# Deep Learning Basics with PyTorch

**Dr. Yves J. Hilpisch with GPT-5**


# Chapter 14 â€” Recurrent and Attention-based Models

## Overview

This notebook provides a concise, hands-on walkthrough of Deep Learning Basics with PyTorch.
Use it as a companion to the chapter: run each cell, read the short notes,
and try small variations to build intuition.

Tips:
- Run cells top to bottom; restart kernel if state gets confusing.
- Prefer small, fast experiments; iterate quickly and observe outputs.
- Keep an eye on shapes, dtypes, and devices when using PyTorch.


## Scaled dot-product attention (masked vs unmasked)

In [None]:
import math
import torch


def scaled_dot_product_attention(Q, K, V, mask=None):
    """Apply scaled dot-product attention and return outputs plus weights."""
    scores = (Q @ K.transpose(-2, -1)) / math.sqrt(Q.size(-1))
    if mask is not None:
        scores = scores.masked_fill(~mask, float('-inf'))
    weights = torch.softmax(scores, dim=-1)
    return weights @ V, weights


T, D = 4, 3
Q = torch.randn(T, D)
K = torch.randn(T, D)
V = torch.randn(T, D)

causal_mask = torch.tril(torch.ones(T, T, dtype=torch.bool))
_, masked_weights = scaled_dot_product_attention(Q, K, V, causal_mask)
_, unmasked_weights = scaled_dot_product_attention(Q, K, V)

masked_weights.shape, unmasked_weights.shape

## Multi-head attention shapes

In [None]:
BATCHES, T, D_MODEL, HEADS = 2, 5, 8, 2
D_HEAD = D_MODEL // HEADS
x = torch.randn(BATCHES, T, D_MODEL)

Wq = torch.randn(D_MODEL, D_MODEL)
Wk = torch.randn(D_MODEL, D_MODEL)
Wv = torch.randn(D_MODEL, D_MODEL)

Q = x @ Wq
K = x @ Wk
V = x @ Wv

Q = Q.view(BATCHES, T, HEADS, D_HEAD).transpose(1, 2)
K = K.view(BATCHES, T, HEADS, D_HEAD).transpose(1, 2)
V = V.view(BATCHES, T, HEADS, D_HEAD).transpose(1, 2)

scores = (Q @ K.transpose(-2, -1)) / (D_HEAD ** 0.5)
scores.shape  # torch.Size([2, 2, 5, 5])

## Exercises

1. Change the number of heads or d_model in a tiny transformer; compare training curves.
2. Visualize attention for a few prompts and discuss patterns you see.


<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>
