To code out a sample attention block which will be later used inside transformer block to build an llm, coding out the attention mechanism in steps -

1. Simple Self Attention 
2. Self Attention with trainable weights
3. Causal Attention and Dropout
4. Single Head to Multi Head Attention

Why Attention?

Starting with an input embedding of a sentance - "Your journey starts with one step", embedding dimension being 3 (emb_dim=3)

In [3]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

<pre>To calculate context vector for a token, lets say the 3rd token - 'starts':
    1. calculate attention scores of this token wrt to every other token by dot product
    2. normalize attention scores to get attention weights
    3. addition of (attention weight * input token embedding) -> context vector of this token
</pre>
![Context Vector of a token](self-attention-1.png)


For single input token

In [31]:
# the dimension of attention score matrix should be the number of input tokens in our example sentence
query = inputs[2]

# getting attentions scores
attention_scores_3 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attention_scores_3[i] = torch.dot(x_i, query)

# getting attention weights
# attention_weights_3 = attention_scores_3/attention_scores_3.sum()
attention_weights_3 = torch.softmax(attention_scores_3, dim=0)

# getting context vector
context_vector_3 = torch.zeros(query.shape)
for i, x_emd in enumerate(inputs):
    context_vector_3 += attention_weights_3[i]*x_emd

print(context_vector_3)


tensor([0.4431, 0.6496, 0.5671])


For the whole input

In [37]:
attention_scores = inputs @ inputs.T
attention_weights = torch.softmax(attention_scores, dim=-1)
context_vectors = attention_weights @ inputs

print(context_vectors)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


Calculating Self-Attention with trainable weights

![Self-Attention](self-attention-2.png)