### Attention Mechanisms
In this notebook, we will sequentially implement different variants of attention mechanisms. These variants will build on each other, with the goal of finally creating a compact, efficient implementation of an attention mechanism, which we can then plug into our LLM architecture.

#### Why Attention?
In machine translation, it is not possible to merely translate word by word. The translation process requires contextual understanding and grammatical alignment.

- "Kannst du mir helfen diesen Satz zu uebersetzen" should not be translated to "Can you me help this sentence to translate", but rather to "Can you help me translate this sentence".
- Certain words require access to words appearing before or later in the original sentence. For instance, the verb "to translate" should be used in the context of "this sentence", and not independently.

Typically, to overcome this challenge, deep neural networks with two submodules are used:

- **encoder**: first read in and process the entire text (already done in the `preprocessing.ipynb` notebook).

- **decoder**: produces the translated text.

Pre-LLM architectures typically involved recurrent neural networks, a type of neural network where outputs from previous steps are fed as inputs to the current step, making them well-suited for sequential data. In this many-to-one RNN architecture, the input text is fed token by token into the encoder, which processes it sequentially. The terminal state of the encoder is a memory cell, known as the hidden state, which encodes the entire input. This hidden state is then fed to a decoder that would then generate the translated sentence, word by word, one word at a time.

|     ![RNNEncoder-Decoder](images/RNNencoderdecoder.png)     |
|:-----------------------------------------------------------:|
| *RNN Encoder Decoder* (Dive Into Deep Learning Chapter 10.7 |

- While the encoder is many-to-one, the decoder is a one-to-many architecture, since the hidden state is passed at every step of the decoding process.
- While it is not strictly necessary to understand the inner workings of encoder-decoder RNNs to develop an LLM, the `seq2seq.ipynb` notebook in stage 1 aims to more deeply explore these architectures.

**encoder-decoder RNNs had many shortcomings that motivated the design of attention mechanisms**, namely that the it was not possible to access earlier hidden states from the encoder during the decoding phase, since we rely on a single hidden state containing all the relevant information. Context was lost, especially in complex sentences where dependencies span larger distances.



## Simple Self-Attention

The original *transformer* architecture includes a 'Self-Attention' mechanism inspired by the Badhanau attention mechanism mentioned in `seq2seq.ipynb`.

A mechanism that uses self-attention allows each position in the input sequence (each word), to consider the importance of all other positions (all other words) in the same sequence when creating the embedding of such sequence.

In short, **the goal of a self-attention mechanism is to, for each position in the input sequence, compute a context vector that captures, quantifies, and combines information from all other positions.** For example, given an input vector $X = {x^{(1)}, x^{(2)},..., x^{(T)}}$ (which represents a text that has already been transformed into token embeddings), we want to compute the context vector $z^{(3)}$ of the 3rd position, $x^{(3)}$. The importance of each input for computing $z^{(3)}$ is determined by attention weights ${\alpha_{3,1}, \alpha_{3,2},..., \alpha_{3, T}}$, being $z^{(3)}$ a combination of all input vectors weighted with respect to input element $x^{(3)}$.

- Our goal is to compute context vectors $z^{(i)}$ for all $x^{(i)}$ in the input sequence. The resulting context vectors, we can say, are enriched, more informational embedding vectors.

#### Why 'Self-Attention'

The 'self' refers to the fact that the mechanism computes weights by assessing and learning the dependencies of different positions **within the same sequence**. The relationships of the various parts of the input itself are considered. This is in contrast to attention mechanisms that focus on assessing and learning the relationship of elements that are part of distinct sequences. Sequence-to-sequence models, for instance, where assessment might be done over an input sequence and a distinct output sequence is not a self-attentive mechanism.




In [3]:
import torch
import numpy as np

inputs = torch.tensor(
    [[0.35, 0.15, 0.89], #These (x^1)
     [0.97, 0.8, 0.3], #are (x^2)
     [0.65, 0.34, 0.24], #random (x^3)
     [0.2, 0.87, 0.34], #words (x^4)
     [0.86, 0.13, 0.05], #and (x^5)
     [0.10, 0.20, 0.30]] #embeddings (x^6)
)

print(f'This is the third token: {inputs[2]}')

print(f'We initialize an empty vector to calculate intermediate attention scores, one for each row (word) of our embeddings: {torch.empty(inputs.shape[0])}')

This is the third token: tensor([0.6500, 0.3400, 0.2400])
We initialize an empty vector to calculate intermediate attention scores, one for each row (word) of our embeddings: tensor([-4.7826e-30,  1.6129e-42,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00])


In [4]:

def calculate_intermediate_attention_score_for_one_position(inputs):
    query = inputs[2] # we are working with the third input token as the query
    attn_scores_3 = torch.empty(inputs.shape[0])
    for i, x_i in enumerate(inputs):
        attn_scores_3[i] = torch.dot(query, x_i)
    return attn_scores_3


attention_scores_3 = calculate_intermediate_attention_score_for_one_position(inputs)
print(f'The intermediate attention scores for the second token are: {attention_scores_3}')

The intermediate attention scores for the second token are: tensor([0.4921, 0.9745, 0.5957, 0.5074, 0.6152, 0.2050])


#### Dot Product Intuition

In the code above, we have calculated intermediate values for the attention weights of the embedded query token $x^{(2)}$. We have simply initialized an empty, 6-dimension vector (one for each embedded input token) and set each value to be the dot product between input $x^{(i)}$ and query $x^{(2)}$.

Recall that, for two vectors $u = [u_{1}, u_{2}, ..., u_{n}]$ and $v = [v_{1}, v_{2}, ..., v_{3}]$ in $R^{n}$, the dot product is defined as the sum of the element-wise multiplication:

$$
\begin{align*}
u \bullet v = \sum_{i=1}^{n} u_{i}v_{i} = u_{1}v_{1} + u_{2}v_{2} + ... + u_{n}v_{n}
\end{align*}
$$

Alternatively, the dot product can also be geometrically expressed as:

$$
\begin{align*}
u \bullet v = |u||v|cos(\theta)
\end{align*}
$$

where $|u|$ and $|v|$ are the magnitudes of the vectors, and $\theta$ is the angle between them.

- When $\theta = 0^\circ$, the vectors are parallel and point in the same direction. $cos(0) = 1$, and therefore $u \bullet v = |u||v|$. This is the maximum possible value of the dot product for given magnitudes.
- When $\theta = 90^\circ$, the vectors are perpendicular. $cos(90) = 0$, and $u \bullet v = 0$. This indicates no alignment or similarity. The vectors share no directional overlap.
- When $\theta = 180^\circ$, the vectors are parallel but point to opposite directions. $cos(180) = -1$ and $u \bullet v = -|u||v|$. This negative value indicates dissimilarity.

**The dot product depends on both alignment and magnitude.** Without normalization, it is possible that two long vectors at a small angle have a larger dot product than two short, perfectly aligned vectors. Therefore, **vectors are often normalized to provide a range from -1 to 1 as a pure measure of directional similarity**:

$$
\begin{align*}
Cosine Similarity = \frac{u \bullet v}{|u||v|} = cos(\theta)
\end{align*}
$$

- Even without normalization, however, the dot product captures similarity through component-wise agreement (large values in the same positions yield a large sum). **Scaling and normalization might refine this measure, but this core principle holds.**

It is extremely important to understand how dot products determine the extent to which each position in a sequence gives importance to any other element. The higher the dot product, the higher the similarity, and the higher the attention score between the two positions.


In [5]:
attn_weights_3_norm = attention_scores_3/attention_scores_3.sum()
print(f'Normalized attention weights for the third token are: {attn_weights_3_norm}')
print(f'The sum of the normalized attention weights for the third token is: {attn_weights_3_norm.sum()}')

Normalized attention weights for the third token are: tensor([0.1452, 0.2875, 0.1757, 0.1497, 0.1815, 0.0605])
The sum of the normalized attention weights for the third token is: 0.9999998807907104


#### Why Normalize?

Above, we have implemented a very straightforward method for achieving normalization. We have normalized the attention scores $w_{3,i}$ to obtain the attention weights $\alpha_{3,i}$. The weights are now more interpretable, and they now sum up to 1. Each attention weight now shows the attention that token $x^{(3)}$ (**the query**) should place on token $x^{(i)}$ (**the key**) in a standardized, proportional manner.

Using the example above (although remember they are meaningless, random embeddings), we see that for the query token $x^{(3)}$, weight $\alpha_{3,2}$ indicates that 28.75% of the attention of $x^{(3)}$ is placed on $x^{(2)}$, while weight $\alpha_{3,6}$ indicates that only 6.05% of the attention is placed on $x^{(6)}$.

In practice, it is more practical and common to use a **softmax function** for normalization, as it is better at managinng extreme values and has better gradient properties for training.

$$
\begin{align*}
softmax(w_{i,j}) = \frac{e^{w_{i,j}}}{\sum_k e^{w_{i,k}}}
\end{align*}
$$

- By converting scores into probabilities between 0 and 1 that sum to 1, normalization prevents weights from growing too large or shrinking too small. Unbounded raw scores could otherwise lead to extreme values, causing gradients to explode or vanish during backpropagation, disrupting convergence.
- The softmax function has a well-behaved, bounded derivative that ensures that gradient updates remain consistent, and that they avoid the instability that might occur with raw scores.

$$
\begin{align*}
\frac{\partial{softmax(x_{j})}}{\partial x_{k}} = softmax(x_{j})(\delta_{j, k} - softmax(x_{k}))
\end{align*}
$$

where

$$
 \begin{equation}
 \delta_{j,k} =
   \left\{\begin{array}{lr}
       1, & i = j \\
       0, & i \mathrel{\mathtt{!=}} j
    \end{array}\right.
 \end{equation}
 $$

In [6]:
def softmax_naive(x):
    return torch.exp(x)/torch.sum(torch.exp(x))

attn_weights_3_norm_softmax = softmax_naive(attention_scores_3)
print(f'Normalized attention weights for the third token after softmax are: {attn_weights_3_norm_softmax}')
print(f'The sum is {attn_weights_3_norm_softmax.sum()}')

#or, using PyTorch softmax:
attn_weights_3_norm_softmax_ = torch.nn.functional.softmax(attention_scores_3, dim=0)
print(f'Normalized attention weights for the third token after softmax are: {attn_weights_3_norm_softmax_}')
print(f'The sum is {attn_weights_3_norm_softmax_.sum()}')


Normalized attention weights for the third token after softmax are: tensor([0.1509, 0.2445, 0.1674, 0.1532, 0.1707, 0.1133])
The sum is 1.0
Normalized attention weights for the third token after softmax are: tensor([0.1509, 0.2445, 0.1674, 0.1532, 0.1707, 0.1133])
The sum is 1.0000001192092896


As a final step, we simply multiply all embedded input tokens $x^{(i)}$ with the corresponding attention weights, and sum everything to obtain context vector $z^{(3)}$.

In [7]:
query = inputs[2]
context_vec_3 = torch.zeros(query.shape)
for i, x_i in enumerate(inputs):
    context_vec_3 += attn_weights_3_norm_softmax[i] * x_i
print(f'Context vector for the third token is: {context_vec_3}')

Context vector for the third token is: tensor([0.5876, 0.4533, 0.3425])


#### Computing Attention Weights for All Token

We now generalize the code above, which computes and normalizes the attention weights of all keys, $x^{(j)}$ for a given query $x^{(i)}$, to compute the attention weights and context vectors for all inputs.

In [8]:
def calculate_intermediate_attention_score_all_positions(inputs):
    attn_scores = torch.empty(inputs.shape[0], inputs.shape[0])
    for i, x_i in enumerate(inputs):
        for j, x_j in enumerate(inputs):
            attn_scores[i,j] = torch.dot(x_i, x_j)
    return attn_scores


attention_scores = calculate_intermediate_attention_score_all_positions(inputs)
print(f'The intermediate attention scores for all positions are:\n {attention_scores}')

The intermediate attention scores for all positions are:
 tensor([[0.9371, 0.7265, 0.4921, 0.5031, 0.3650, 0.3320],
        [0.7265, 1.6709, 0.9745, 0.9920, 0.9532, 0.3470],
        [0.4921, 0.9745, 0.5957, 0.5074, 0.6152, 0.2050],
        [0.5031, 0.9920, 0.5074, 0.9125, 0.3021, 0.2960],
        [0.3650, 0.9532, 0.6152, 0.3021, 0.7590, 0.1270],
        [0.3320, 0.3470, 0.2050, 0.2960, 0.1270, 0.1400]])


For faster computation, we can avoid using for loops by using matrix multiplication:

In [9]:
attention_scores = inputs @ inputs.T
print(attention_scores)

attention_weights = torch.nn.functional.softmax(attention_scores, dim=1)
print(f'Normalized attention weights for the third token after softmax are: {attention_weights}')

print(f'Row sums: {attention_weights.sum(dim=1)}')

all_context_vectors = attention_weights @ inputs
print(f'Context vectors for all positions are:\n {all_context_vectors}')

tensor([[0.9371, 0.7265, 0.4921, 0.5031, 0.3650, 0.3320],
        [0.7265, 1.6709, 0.9745, 0.9920, 0.9532, 0.3470],
        [0.4921, 0.9745, 0.5957, 0.5074, 0.6152, 0.2050],
        [0.5031, 0.9920, 0.5074, 0.9125, 0.3021, 0.2960],
        [0.3650, 0.9532, 0.6152, 0.3021, 0.7590, 0.1270],
        [0.3320, 0.3470, 0.2050, 0.2960, 0.1270, 0.1400]])
Normalized attention weights for the third token after softmax are: tensor([[0.2376, 0.1925, 0.1522, 0.1539, 0.1341, 0.1297],
        [0.1235, 0.3176, 0.1583, 0.1611, 0.1550, 0.0845],
        [0.1509, 0.2445, 0.1674, 0.1532, 0.1707, 0.1133],
        [0.1477, 0.2408, 0.1483, 0.2224, 0.1208, 0.1201],
        [0.1371, 0.2468, 0.1760, 0.1287, 0.2033, 0.1080],
        [0.1818, 0.1846, 0.1601, 0.1754, 0.1481, 0.1500]])
Row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])
Context vectors for all positions are:
 tensor([[0.5279, 0.4187, 0.4037],
        [0.6281, 0.5036, 0.3311],
        [0.5876, 0.4533, 0.3425],
        [0.5420, 0.4984, 

### Creating Good Context Vectors with Trainable Weights

Having now implemented a basic, simplified attention mechanism, we will now add trainable weights to the attention mechanism. We will have three trainable matrices of weights: $W_{q}$, $W_{k}$ and $W_{v}$. These will be the ones projecting the embedded input tokens $x_{(i)}$ into **query, key and value vectors**.

- Queries (Q) represent the "questions" or "requests" the model is asking about the input. In self-attention, each input token (e.g., a word in a sentence) generates a query vector to "search" for relevant information.
- Keys (K) act as "labels" or "indices" that the queries compare against to find matches. Each input token also generates a key vector, and, as we will see now, the similarity between a query and a key determines the attention weight.
- Values (V) contain the actual "content" or information that the model retrieves. Once attention weights are computed, they are used to form a weighted combination of value vectors.

To better understand this, imagine a normal database of key-value pairs representing first names (keys) and favorite food (values). Database $\mathcal{D}$ consists of tuples {("Rodrigo", "Tacos"), ("Daniel", "Paella"), ("Diego", "Burger")...("Rodolfo", "Pizza")}. We can operate on $\mathcal{D}$ with the exact query for "Rodrigo", which would return the value "Tacos". If ("Rodrigo", "Tacos") did not exist as a record, then there would be no valid answer. If we allowed approximate matches, then maybe we would retrieve ("Rodolfo", "Pizza") instead. Therefore, if $\mathcal{D}$:

$$
\begin{align*}
\mathcal{D} = {(k_{1}, v_{1}), (k_{2}, v_{2}), ..., (k_{m}, v_{m})}
\end{align*}
$$

denotes a database of $m$ tuples of keys and values, and $q$ denotes a query against this database:

- We can design $q$ that operates on $(k,v)$ pairs to be valid, no matter the database size or the existence of perfect matches.
- The same $q$ can receive different answers, depending on the contents of the database, and on the way approximate matches are computed.
- There is no need to compress the database to make the operations effective.


> Note: Do not confuse attention weights with weight parameters. As we have already learned, attention weights determine the extent to which a context vector depends on the different parts of the input. Weight parameters are learned, and they define the behavior of the network's connections. Attention weights are dynamic and context-dependent.
>
> The matrices transforming the input into Q, K and V, the weight parameters are learned during training. This means that a model can fine-tune how it "asks questions" (queries), "indexes information" (keys), and "represents content" (values) based on the task (e.g., translation, classification, etc).

At each step, we compute query $(q)$, key $(k)$ and value $(v)$ vectors via matrix multiplication between the input $x^{(i)}$ and the weight matrix $W_{q}$ (for the query vector), $W_{k}$ (for the key vector) and $W_{v}$ (for the value vector).

|           ![Self-Attention Mechanism Step](images/kqv.png)           |
|:--------------------------------------------------------------------:|
| *Understanding Keys, Queries & Values* (Akshay Pachaar, on LinkedIn) |

Let's compute the context vector $z^{(3)}$ first, and then extend the logic to compute all other context vectors.

In [10]:
x_3 = inputs[2]
d_in = inputs.shape[1] # input embedding size
d_out = 2 #output embedding size

torch.manual_seed(123)
W_q = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False) # in real training scenarios, these would be set to requires_grad=True
W_k = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False) # for now it is just to avoid varying outputs
W_v = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)

query_3_first = x_3 @ W_q
key_3_first = x_3 @ W_k
value_3_first = x_3 @ W_v

print(query_3_first)

tensor([-0.4854,  0.0467])


In [11]:
print(f'Inputs have shape: {inputs.shape}')

keys = inputs @ W_k
values = inputs @ W_v

print(f'Keys has shape: {keys.shape}')
print(f'Values has shape: {values.shape}')

Inputs have shape: torch.Size([6, 3])
Keys has shape: torch.Size([6, 2])
Values has shape: torch.Size([6, 2])


The next step is to compute an unscaled attention score $\omega_{3,i}$ for all keys $k^{(i)}$ with respect to query $q^{(3)}$ (since we are trying to compute the context vector for the third token). The attention score is simply the dot product between the query and the key vectors. Computing the attention score $w_{3,3}$, for instance:

In [12]:
keys_3 = keys[2]
attn_score_33 = query_3_first.dot(keys_3)
print(attn_score_33)

# generalizing to all attention scores for query_3
attn_scores_3 = query_3_first @ keys.T
print(attn_scores_3)

tensor(0.1998)
tensor([ 0.0214,  0.2577,  0.1998, -0.0948,  0.3484, -0.0249])


|                                                                                     ![Query Intuition](images/querylogic.png)                                                                                     |
|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| *Query $q^{(3)}$ ($q^{(2)}$ in this example) is used to calculate all attention scores $w_{3,i}$* (Understanding and Coding the Self-Attention Mechanism of Large Language Models from Scratch, Sebastian Raschka |

As before, we now compute the attention weights by normalizing the attention scores through a softmax function. This time, we scale the attention scores by dividing them by the square root of the embedding dimension of the keys. The **scaled-dot product attention weights** can be written as:

$$
\alpha_{i,j} = \text{softmax}\left(\frac{\mathbf{q}_i \bullet \mathbf{k}_j}{\sqrt{d_k}}\right)
$$

where

$$
\begin{align*}
\text{softmax}(x_j) = \frac{e^{x_j}}{\sum_k e^{x_k}}
\end{align*}
$$

> The scaling factor $\frac{1}{\sqrt{d_k}}$ is very important. Without it, dot product grows with the dimension $d_k$, leading to potentially very large values, especially when scaling up the embedding dimension (typically greater than 1000). For example, if $d_k$ = 64, the dot product's magnitude might be around 8 (since variance scales with $d_k$). Large scores push the softmax function into a 'saturated' regime, where one weight approaches 1 while others approach 0, resulting in very tiny gradients for most tokens, which can slow down or completely stagnate learning.
>
> Without scaling, scores might be [8,0,0]:
$$
\begin{align*}
\text{softmax}([8,0,0]) \approx [1,0,0]
\end{align*}
$$

>All gradients but one are nearly 0, stalling learning. With scaling $\frac{8}{\sqrt{64}} = 1$:

$$
\begin{align*}
\text{softmax}([1,0,0]) \approx [0.576, 0.211, 0.211]
\end{align*}
$$
>Now, all tokens receive non-zero gradients.



In [14]:
d_k = keys.shape[-1]
print(f'Embedding dimension of keys is: {d_k}')
attn_weights_3 = torch.nn.functional.softmax(attn_scores_3/d_k**0.5, dim=0)
print(f'Attention weights for the third token are: {attn_weights_3}')


Embedding dimension of keys is: 2
Attention weights for the third token are: tensor([0.1547, 0.1828, 0.1755, 0.1425, 0.1949, 0.1497])


As a last step, we multiply all value vectors $v^{(i)}$ with their respective attention weights $\alpha_{3,i}$, and then sum everything to obtain the context vector $z^{(3)}$.

In [15]:
context_vec_3 = attn_weights_3 @ values
print(f'Context vector for the third token is: {context_vec_3}')

Context vector for the third token is: tensor([0.2618, 0.4683])
