<img src='https://theaiengineer.dev/tae_logo_gw_flat.png' alt='The Python Quants' width='35%' align='right'>


# Python & Mathematics for Data Science and Machine Learning

**© Dr. Yves J. Hilpisch | The Python Quants GmbH**<br>
AI-powered by GPT-5.x.

# Chapter 22 — Embeddings & Attention as Linear Algebra

This notebook mirrors the chapter’s NumPy formulations with tiny, fast checks.

Set up NumPy and readable printing.


In [None]:
import numpy as np  # numerical arrays and linear algebra

np.set_printoptions(precision=3, suppress=True)  # compact, readable array printing


Stable row-wise softmax (subtract max to avoid overflow).


In [None]:
def softmax_rows(S):  # row-wise softmax helper
    S = S - S.max(axis=1, keepdims=True)  # shift by row max for numerical stability
    E = np.exp(S)  # elementwise exponential
    return E / (E.sum(axis=1, keepdims=True) + 1e-12)  # row-normalize probs


Scaled dot-product attention with optional causal mask.


In [None]:
def attention(Q, K, V, causal=False):  # scaled dot-product attention
    d = Q.shape[1]  # key/query dimension d
    S = (Q @ K.T) / np.sqrt(d)  # scaled similarity scores
    if causal:  # apply causal mask if requested
        n = S.shape[0]  # sequence length n
        mask = np.triu(np.ones((n, n), dtype=bool), k=1)  # future positions
        S = S.copy()  # avoid modifying S in-place
        S[mask] = -1e9  # push future positions to ~zero prob
    A = softmax_rows(S)  # attention weights (row-stochastic)
    O = A @ V  # output: weighted sum of values
    return O, A  # return output and weights


Toy example to sanity-check attention shapes and masking.


In [None]:
rs = np.random.default_rng(22)  # reproducible random generator
n, d, dv = 6, 4, 2  # sequence length, model dim, value dim
X = rs.normal(size=(n, d))  # toy inputs (n×d)
Wq = rs.normal(size=(d, d))  # query weight (d×d)
Wk = rs.normal(size=(d, d))  # key weight (d×d)
Wv = rs.normal(size=(d, dv))  # value weight (d×dv)
Q, K, V = X @ Wq, X @ Wk, X @ Wv  # project to Q, K, V
O_nc, A_nc = attention(Q, K, V, causal=False)  # non-causal attention
O_c, A_c = attention(Q, K, V, causal=True)  # causal attention
print('rowsum(non-causal) →', np.round(A_nc.sum(1)[:3], 6))  # rows in A_nc sum to 1
print(
    'future mass (causal) →',
    float(np.triu(A_c, 1).sum())
)  # ~0 for causal mask


<img src='https://theaiengineer.dev/tae_logo_gw_flat.png' alt='The Python Quants' width='35%' align='right'>
