$$\Large\boxed{\text{AME 5202 Deep Learning, Even Semester 2026}}$$

$$\large\text{Theme}: \underline{\text{computational foundations of the self-attention mechanism}}$$

---

Load essential libraries

---

In [2]:
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
plt.style.use('dark_background')
%matplotlib inline
import sys
import pickle
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
import nltk
from nltk.tokenize import word_tokenize
import seaborn as sns

---

Mount Google Drive folder if running Google Colab

---

In [None]:
## Mount Google drive folder if running in Colab
if('google.colab' in sys.modules):
    from google.colab import drive
    drive.mount('/content/drive', force_remount = True)
    DIR = '/content/drive/MyDrive/Colab Notebooks/MAHE/MSIS Coursework/EvenSem2026MAHE'
    DATA_DIR = DIR+'/Data/'
else:
    DATA_DIR = 'Data/'

---

Some LLM Magic

---



In [None]:
from transformers import BertTokenizer, BertForMaskedLM

sentence = "I swam across the river to the other [MASK]."
#sentence = "I ran across the street to get cash from the [MASK]."

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased') # about 440MB large

inputs = tokenizer(sentence, return_tensors = "pt")
logits = model(**inputs).logits
predicted_id = logits[0, inputs.input_ids[0] == tokenizer.mask_token_id].argmax(dim=-1)

print(tokenizer.decode(predicted_id))   

---

Here we create a simple sentence in English and tokenize it

---

In [None]:
sentence = 'i swam quickly across the river to get to the other bank'
nltk.download('punkt_tab')
tokens = word_tokenize(sentence)
print(tokens)
print(len(tokens))

---

Download the word embedding model trained on Wikipedia articles, generate the word embeddings for the tokens in the sentence above, and store them in a matrix $\mathbf{X}$ such that each row of the matrix corresponds to a token.

---

In [None]:
# Load the Wikipedia-trained GLoVe word vectors (50-dimensional) from the pickle file
with open(DATA_DIR + 'glove_wiki_gigaword_50.pkl', 'rb') as f:
    loaded_word_vectors = pickle.load(f)

X = np.empty((len(tokens), 50))
X = torch.stack([torch.tensor(loaded_word_vectors.get(token, None),
                               dtype = torch.float64)
                                 for token in tokens])
print(X.shape)
print(X[1]) # embedding vector for the word "swam"

---

Calculate the similarities between the words (1) swam and bank (2) river and bank. What do you observe?

$$\texttt{tokens = ['i',
 'swam',
 'quickly',
 'across',
 'the',
 'river',
 'to',
 'get',
 'to',
 'the',
 'other',
 'bank']}$$

---

In [None]:
# Token-1 is 'swam', token-5 is 'river', and token-11 is 'bank'
print(torch.dot(X[1], X[-1])) # similarity between 'swam' and 'bank'
print(torch.dot(X[5], X[-1])) # similarity between 'river' and 'bank'

---

The similarity between each pair of words represented in the word embeddings matrix $\mathbf{X}$ is the matrix-matrix product $\mathbf{X}\mathbf{X}^\mathrm{T}.$ This is because:

- the $(i,j)\text{th}$ element of the matrix $\mathbf{X}\mathbf{X}^\mathrm{T}$ is the dot-product between the $i\text{th}$ row of $\mathbf{X}$ and the $j\text{th}$ column of $\mathbf{X}^\mathrm{T}.$
- But the $j\text{th}$ column of $\mathbf{X}^\mathrm{T}$ is also the $j\text{th}$ row of $\mathbf{X}.$
- Therefore, $\left[\mathbf{X}\mathbf{X}^\mathrm{T}\right]_{i,j} = \mathbf{x}^{(i)}\cdot\mathbf{x}^{(j)},$ which is the similarity between words $i$ and $j$ calculated using their embeddings.

---

In [None]:
S = torch.matmul(X, X.T)
print(S[-1])

---

The similarity matrix can be scaled (values from 0 through 1) by applying softmax row-by-row to the matrix $\mathbf{X}\mathbf{X}^\mathrm{T}.$ The resulting values are called the scaled similarity scores or more commonly as the attention coefficients.

---

In [None]:
# Scaled similarities
S_scaled = ?

# Transformed or new embeddings for the word swam
?

---

The transformed embeddings matrix can be calculated as ${\mathbf{Y}} = \text{softmax}\left(\mathbf{X}\mathbf{X}^\mathrm{T}\right)\mathbf{X}.$

Note that the $i\text{th}$ row of $\mathbf{Y}$ is a linear combination of the rows of $\mathbf{X}$ (the original embeddings for the words) using the attention coefficients corresponding to the $i\text{th}$ word--which are in the $i\text{th}$ row of the matrix $\text{softmax}\left(\mathbf{X}\mathbf{X}^\mathrm{T}\right)$-- as multipliers.

Thus, the $i\text{th}$ row of $\mathbf{Y}$ represents the new or transformed embeddings for the $i\text{th}$ word.

Intuitively, the new embeddings matrix ${\mathbf{Y}}$ (a *transformed* version of $\mathbf{X}$) captures the contextual similarities between the words in the sentence well.

---

In [None]:
Y = ?

---

The attention mechanism with learning:

![](https://1drv.ms/i/c/37720f927b6ddc34/IQSRMkyKcxZaRqR5kaD6nSJTAcUM0iMFoJ_z8DvjdkB_olw?width=550)

- When processing a sentence, each word ${\color{red}{\text{queries}}}$ all other words using the learned ${\color{red}{\text{query weights}}}$ ${\color{red}{\mathbf{W^{(q)}}}}.$
- Each word provides ${\color{blue}{\text{keys}}}$ to see if it matches a query, based on the learned ${\color{blue}{\text{key weights}}}$ ${\color{blue}{\mathbf{W^{(k)}}}}.$
- If a match is found, the ${\color{cyan}{\text{value weights}}}$ ${\color{cyan}{\mathbf{W^{(v)}}}}$determine how much of the word's information is passed to the final attention output.

The result of this process is to (re)produce context-aware embeddings for each word. As an example, consider the following sentence ``*the bank raised interest rates*'' with the initial embeddings for
$$\begin{align*}
\textit{the} &= [0.1, 0.2, 0.1],\\
\textit{bank} &= [0.4, 0.5, 0.3],\\
\textit{raised} &= [0.1, 0.3, 0.4],\\
\textit{interest} &= [0.7, 0.1, 0.3],\\
\textit{rates} &= [0.2, 0.4, 0.5].
\end{align*}$$

1. Calculate the pairwise scaled similarities between all the words assuming no learning using the query, key, and value weights.
\question What is the scaled similarity between the words *bank* and *interest*? Is the similarity strong or weak given the financial context?
2. Now consider the following query, key and value weights:
$$\begin{align*}
{\color{red}{\mathbf{W^{(q)}}}} &= \begin{bmatrix}
1.8 & 4.5 & 3.9 \\
4.3 & 0.2 & 2.9 \\
5.0 & 0.8 & 4.8
\end{bmatrix},
\ {\color{blue}{\mathbf{W^{(k)}}}} =\begin{bmatrix}
2.7 & 3.7 & 3.2\\
0.2 & 1.9 & 3.6\\
2.6 & 4.9 &1.9
\end{bmatrix},
\ {\color{cyan}{\mathbf{W^{(v)}}}} =\begin{bmatrix}
0.5 & 2.6 & 2.4\\
3.6 & 3.1 & 3.2\\
0.3 & 0.6 & 0.9
\end{bmatrix}{\color{black}.}
\end{align*}$$
Write down the queries, keys, and values associated with the words *bank* and *interest*.
3. Calculate the pairwise scaled similarities between all the words using the query, key, and value weights above.
4. What is the scaled similarity between the words *bank* and *interest* now? Is the similarity strong or weak given the financial context?
5. In calculating the transformed embeddings for the word *bank*, which two words are weighted the most, and what are those weights?

---

In [None]:
# Original embeddings of the words
the = torch.tensor([0.1, 0.2, 0.1])
bank = torch.tensor([0.4, 0.5, 0.3])
raised = torch.tensor([0.1, 0.3, 0.4])
interest = torch.tensor([0.7, 0.1, 0.3])
rates = torch.tensor([0.2, 0.4, 0.5])

# Sample
X = torch.vstack([the, bank, raised, interest, rates])
print(f'Original embeddings: \n{X}')

# Scaled similarities with no learning
S = torch.nn.functional.softmax(torch.matmul(X, X.T), dim = -1)
print(f'Scaled similarities with no learning:\n {S}')

# Query, key, and value neurons' weights matrices
W_q = torch.tensor([[1.8, 4.5, 3.9],
                    [4.3, 0.2, 2.9],
                    [5.0, 0.8, 4.8]], dtype = torch.float32)

W_k = torch.tensor([[2.7, 3.7, 3.2],
                    [0.2, 1.9, 3.6],
                    [2.6, 4.9, 1.9]], dtype = torch.float32)

W_v = torch.tensor([[0.5, 2.6, 2.4],
                    [3.6, 3.1, 3.2],
                    [0.3, 0.6, 0.9]], dtype = torch.float32)

# Calculate learned queries, keys, nad values
Q = torch.matmul(X, W_q)
K = torch.matmul(X, W_k)
V = torch.matmul(X, W_v)

# Calculate scaled similarities with learning
d_k = K.shape[-1]
S_learned = torch.nn.functional.softmax(torch.matmul(Q, K.T)/torch.sqrt(torch.tensor(d_k, dtype = torch.float32)), dim = -1)
print(f'Scaled similarities with learning:\n {S_learned}')

# Calculate transformed embeddings of the words
Y = torch.matmul(S_learned, V)
print(f'Transformed embeddings: \n{Y}')