# Chapter 3: Coding Attention Mechanisms
* Will implement four different variations of the attention mechanisms that will build upon eachother, the goal is to arrive at a compact and efficient implementation of multi-head attention

1. Simplified self-attention -> simplified version of self-attention before adding trainable weights
2. Self-attention -> self attention with the trainable weights
3. Casual attention -> adds a mask to self-attention that allows the LLM to generate one word at a time
4. Multi-head attention -> organizes attention and allows model to capture various aspects of the input data in parallel

### 3.3.1 Simple self-attention mechanism without trainable weights

In [8]:
import torch
inputs = torch.tensor(
   [[0.43, 0.15, 0.89], # Your     
    [0.55, 0.87, 0.66], # journey  
    [0.57, 0.85, 0.64], # starts    
    [0.22, 0.58, 0.33], # with
    [0.77, 0.25, 0.10], # one
    [0.05, 0.80, 0.55]] # step
)
inputs.shape

torch.Size([6, 3])

In [13]:
# Compute attention scores (omega), between the query and all other inputs elements as a dot product

query = inputs[1]
attn_scores_2 = torch.empty(inputs.shape[0])

for i, x_i, in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)
    print(f"omega 2,{i}: {attn_scores_2[i]}")

omega 2,0: 0.9544000625610352
omega 2,1: 1.4950001239776611
omega 2,2: 1.4754000902175903
omega 2,3: 0.8434000015258789
omega 2,4: 0.7070000171661377
omega 2,5: 1.0865000486373901


In [15]:
# Normalize these attention scores and obtain attention weights (alpha) that sum to 1

attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights: ", attn_weights_2_tmp)
print("Sum: ", attn_weights_2_tmp.sum())

Attention weights:  tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum:  tensor(1.0000)


In [16]:
# Softmax is better for normalizing values
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights: ", attn_weights_2_naive)
print("Sum: ", attn_weights_2_naive.sum())

Attention weights:  tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum:  tensor(1.)


In [17]:
# Just use the PyTorch implementation of softmax which has been optimized for performance

attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights: ", attn_weights_2)
print("Sum: ", attn_weights_2.sum())

Attention weights:  tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum:  tensor(1.)


In [24]:
# Now, compute context vector z(2), a combination of all input vectors weighted by the attention weights
query = inputs[1]
context_vector = torch.zeros(query.shape)

for i, x_i in enumerate(inputs):
    context_vector += attn_weights_2[i] * x_i
    
print(context_vector)

tensor([0.4419, 0.6515, 0.5683])
