<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>


# Deep Learning Basics with PyTorch

**Dr. Yves J. Hilpisch with GPT-5**


# Chapter 14 — Recurrent and Attention-based Models

## Scaled dot-product attention (masked vs unmasked)

In [None]:
import torch, math

def attn(Q, K, V, mask = None):
    # Scaled dot-product attention
    S = (Q @ K.transpose(-2, -1)) / math.sqrt(Q.size(-1))
    if mask is not None:
        S = S.masked_fill(~mask, float('-inf'))
    A = torch.softmax(S, dim = -1)
    return A @ V, A

T, d = 4, 3
Q = torch.randn(T, d) # query vectors # query vectors  # query vectors
K = torch.randn(T, d) # key vectors # key vectors  # key vectors
V = torch.randn(T, d) # value vectors # value vectors  # value vectors

# Causal mask (lower triangular: prevent attending to future positions)
causal = torch.tril(torch.ones(T, T, 
    dtype = torch.bool)) # lower-triangular mask (no future attention)
attn(Q, K, V, causal)[1].shape, attn(Q, K, V, None)[1].shape # (torch.Size([4, 4]),     torch.Size([4, 4]))


## Multi-head attention shapes

In [None]:
B, T, d_model, h = 2, 5, 8, 2
 d_head = d_model//h
x = torch.randn(B, T, d_model)
Wq = Wk = Wv = Wo = torch.randn(d_model, d_model)
Q = x@Wq # query vectors # query vectors  # query vectors
 K = x@Wk # key vectors # key vectors  # key vectors
 V = x@Wv # value vectors # value vectors  # value vectors
Q = Q.view(B, T, h, d_head).transpose(1, 2) # query vectors # query vectors  # query vectors
K = K.view(B, T, h, d_head).transpose(1, 2) # key vectors # key vectors  # key vectors
V = V.view(B, T, h, d_head).transpose(1, 2) # value vectors # value vectors  # value vectors
((Q@K.transpose(-2, -1))/ (d_head**0.5)).shape
# torch.Size([2, 2, 5, 5])

<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>
