In [12]:
%pip install pandas numpy scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp311-cp311-macosx_10_9_x86_64.whl.metadata (31 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.0-cp311-cp311-macosx_14_0_x86_64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.0-cp311-cp311-macosx_10_9_x86_64.whl (11.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading joblib-1.5.1-py3-none-any.whl (307 kB)
Downloading scipy-1.16.0-cp311-cp311-macosx_14_0_x86_64.whl (23.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.4/23.4 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)


In [2]:
import pandas as pd
import numpy as np

# Self-Attention Mechanism for One Head

Self-Attention is a mechanism introduced in the paper "Attention is All You Need" to compute relationships between elements in a sequence. For a simple vocabulary like "I am Max" with dimensions of 768, the process involves:

1. **Input Representation**: Each word is represented as a vector of size 768.
2. **Query, Key, and Value Matrices**: These (Q, K, V) are randomly initialized for the weights based on the sequence length (L) and the dimensions (768).

3. **Scaled Dot-Product Attention**:
   - Compute the dot product between the Query and Key matrices.
   - Scale the result by the square root of the dimension (768).
   - Apply a mask to ensure causality (future words do not influence past words).
   - Use the softmax function to normalize the scores.
4. **Weighted Sum**: Multiply the attention scores with the Value matrix to get the final representation.

This mechanism is important because it allows the model to focus on relevant parts of the input sequence, enabling better understanding and generation of context-aware outputs.

---

*This demonstration of how self-attention works was inspired by [this video](https://www.youtube.com/watch?v=QCJQG4DuHT0&t=3s), [this repository](https://github.com/ajhalthor/Transformer-Neural-Network), and the instructor [Ajay Halthor](https://github.com/ajhalthor).*

In [23]:
L, d_k, d_v = 3, 768, 768

Q = np.random.rand(L, d_k)
K = np.random.rand(L, d_k)
V = np.random.rand(L, d_v)

In [24]:
pd.DataFrame(Q)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.781469,0.864657,0.791608,0.237844,0.840169,0.902662,0.242498,0.66956,0.557935,0.295527,...,0.929783,0.31291,0.194774,0.60739,0.166716,0.129626,0.01084,0.951116,0.423293,0.017081
1,0.940682,0.168691,0.126298,0.102024,0.632703,0.729001,0.271093,0.530755,0.971604,0.200153,...,0.301399,0.895101,0.791637,0.300842,0.33368,0.580032,0.205204,0.086156,0.339412,0.283901
2,0.898093,0.705961,0.558028,0.762148,0.675514,0.103356,0.865074,0.445679,0.593788,0.677311,...,0.922109,0.202598,0.119604,0.824416,0.670968,0.609849,0.037851,0.351445,0.2551,0.128911


In [25]:
pd.DataFrame(K)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.358344,0.071543,0.02497,0.9508,0.041414,0.036947,0.847188,0.158953,0.033655,0.814012,...,0.455313,0.42436,0.406919,0.340354,0.331903,0.265585,0.037457,0.313205,0.417403,0.177898
1,0.372984,0.017191,0.01422,0.149774,0.568816,0.284216,0.847346,0.758527,0.844118,0.553839,...,0.941243,0.716409,0.092678,0.0316,0.341622,0.891035,0.667869,0.125007,0.999342,0.243626
2,0.418939,0.897291,0.38474,0.073313,0.090361,0.357796,0.296533,0.94465,0.092306,0.817969,...,0.705718,0.051845,0.694444,0.880999,0.737336,0.616627,0.110043,0.443907,0.240237,0.720614


In [26]:
pd.DataFrame(V)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.153764,0.530367,0.386699,0.42386,0.966091,0.111646,0.676884,0.672639,0.316213,0.022723,...,0.08143,0.529522,0.219022,0.985391,0.674502,0.720504,0.066238,0.164949,0.053637,0.207311
1,0.161113,0.69301,0.488317,0.712771,0.14025,0.197266,0.25973,0.447841,0.206286,0.801529,...,0.06755,0.487519,0.957209,0.267735,0.869852,0.028212,0.378867,0.722665,0.20155,0.325811
2,0.398923,0.279205,0.000382,0.915593,0.180819,0.311837,0.928588,0.680243,0.257427,0.752605,...,0.643342,0.5781,0.180915,0.32455,0.956626,0.801293,0.226475,0.796885,0.055271,0.886396


Scaling the dot product of Q and K by $\sqrt{d_k}$ keeps the attention scores in a reasonable range, making the softmax more stable and improving training.

In [27]:
import math
scaled = np.matmul(Q, K.T) / math.sqrt(d_k)
pd.DataFrame(scaled)

Unnamed: 0,0,1,2
0,6.750395,6.313665,6.597572
1,7.234502,6.706969,7.232652
2,6.77703,6.590438,6.899066


In [None]:
from scipy.special import softmax

softmax_scores = softmax(scaled, axis=1)
df = pd.DataFrame(softmax_scores)
df['Sum'] = df[[0,1,2]].sum(axis=1)
df.T

Unnamed: 0,0,1,2
0,0.399293,0.386367,0.337886
1,0.258001,0.227979,0.280372
2,0.342706,0.385653,0.381742
3,1.0,1.0,1.0


In the code above, we create a **mask matrix** `M` to enforce causality in the self-attention mechanism. The mask is a lower triangular matrix where the upper diagonal elements are set to $-\infty$. This ensures that, when added to the attention scores before applying the softmax, the softmax output for those positions becomes zero. As a result, each position in the sequence can only attend to itself and previous positions, not to any future positions. This is crucial for tasks like language modeling, where future information should not be accessible.

In [9]:
M = np.tril(np.ones((L,L)))
M[M== 0] = -np.inf
M[M == 1] = 0

In [10]:
pd.DataFrame(M)

Unnamed: 0,0,1,2
0,0.0,-inf,-inf
1,0.0,0.0,-inf
2,0.0,0.0,0.0


After computing the scaled dot product of $Q$ and $K$ (i.e., $(QK^T)/\sqrt{d_k}$), we add the mask matrix $M$ to the result. This mask ensures that each position in the sequence can only attend to itself and previous positions, not future ones. The masked and scaled attention scores are then passed through the softmax function, which converts them into a probability distribution. This distribution determines how much focus (attention) each word should give to every other word in the sequence, while respecting the causality constraint imposed by the mask.

In [13]:
from scipy.special import softmax

attention = softmax(scaled + M)

After obtaining the softmax distribution (the attention weights), we multiply it by the Value matrix $V$ to produce the final output of the self-attention mechanism:

$$\text{Output} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right) \cdot V$$

**Why is this multiplication needed?**

- The softmax distribution tells us how much attention each word in the sequence should pay to every other word (including itself).
- By multiplying these attention weights with the Value matrix $V$, we compute a weighted sum of the value vectors for each position in the sequence.
- This means each output vector is a blend of the value vectors, where the contribution of each value is determined by the attention score.
- This allows the model to aggregate information from relevant positions in the sequence, enabling it to capture dependencies and context effectively.

In summary, this multiplication enables the self-attention mechanism to produce context-aware representations for each word, based on how much attention it gives to other words in the sequence.

In [16]:
new_V = np.matmul(attention, V)
pd.DataFrame(new_V)

Unnamed: 0,0,1,2
0,0.053619,0.038687,0.126166
1,0.093947,0.059735,0.099384
2,0.228177,0.121844,0.180339
