# Self Attention

The below personal learning notes made use of [Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html)

In [None]:
#|hide
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(123)
torch.set_printoptions(precision=1, sci_mode=False, profile='short')

## What is self-attention?

Self-Attention started out from research work in translation and was introduced to give access to all elements in a sequence at each time step.  In language tasks, the meaning of a word can depend on the context within a larger text document.  Attention enables the model to weigh the importance of different elements in the input sequence and adjust their influence on the output.

## Embedding an Input Sentence

Our input is: "Playing music makes me very happy".  We'll create an embeding for this entire sentence first.

In [None]:
sentence = "Playing music makes me very happy"

sentence_words = sentence.split()
sentence_words

['Playing', 'music', 'makes', 'me', 'very', 'happy']

In [None]:
sentence_words_sorted = sorted(sentence_words)
sentence_words_sorted

['Playing', 'happy', 'makes', 'me', 'music', 'very']

In [None]:
dict = {word_str:word_idx for word_idx, word_str in enumerate(sentence_words_sorted)}
dict

{'Playing': 0, 'happy': 1, 'makes': 2, 'me': 3, 'music': 4, 'very': 5}

`dict` is our dictionary, conveniently restricted to just the words we're using here.  Every word we're using has a number associated (the index in our dictionary.  

We can now translate our sentence in an array of integers:

In [None]:
sentence_int = torch.tensor([dict[word] for word in sentence_words])
sentence_int

tensor([0, 4, 2, 3, 5, 1])

Now that our sentence is translated into a list of integers, we can use those with an embedding layer to encode the inputs into a real vector embedding.  Let's use 16 dimensions, so that each word is translated/mapped onto an embedding of 16 floats.

If our sentence is 6 words (or whatever is the context length we end up choosing), the resulting vector after our embedding layer will be: $6 \times 16$.  We'll create a pytorch embedding layer with 6 possible indices and a 16-dimensional embedding vector for each index.

In [None]:
embed = torch.nn.Embedding(6,16)
sentence_embedded = embed(sentence_int).detach()
print(sentence_embedded)
print(sentence_embedded.shape)

tensor([[-1.0, -2.1, -0.3, -0.4,  1.1, -0.6, -2.3, -1.4,  1.2, -0.4, -0.4,  0.7,
          1.0, -0.0, -0.1, -0.1],
        [ 0.2,  0.1, -1.3, -2.9,  0.1, -1.2, -0.3,  0.1, -1.3, -0.5, -2.1,  0.9,
         -0.6, -0.1,  0.7, -2.8],
        [ 1.4, -0.1, -0.8,  0.2, -0.6, -1.2,  1.0,  0.1,  1.1, -0.1, -0.1,  0.4,
         -1.4,  0.1,  0.1, -1.0],
        [ 1.3, -0.0,  0.2, -0.0,  1.9,  2.1, -0.5, -0.8, -1.1, -1.0, -0.5,  1.2,
         -0.9,  1.3,  0.1, -0.1],
        [ 1.1,  0.9,  0.9,  0.0, -0.9,  0.7, -0.4, -0.2, -0.9,  0.3, -0.0,  0.3,
          0.2, -0.5, -0.6,  0.2],
        [-0.1,  0.2, -0.9, -1.0,  1.3,  0.4, -0.5, -0.6, -1.1,  1.8, -0.1, -0.3,
          0.9,  0.5, -1.3,  0.8]])
torch.Size([6, 16])


So, we gave the embedding a tensor of 6 integers, which got translated in $6 \times 16$ tensors, meaning: each index, representing a word, it translated into an array of 16 floats.

## Defining weight matrices

Self-attention has 3 weight matrices which are each adjusted, like other model parameters, during training.

- $W_{q}$: projects our input to the *query*
- $W_{k}$: projects our input to the *key*
- $W_{v}$: projects our input to the *value*

each of *query* $q$, *key* $k$ and *value* $v$ are vectors of an input element.  We can calculate those through matrix multiplication between those $W$ matrices and the embedded inputs $x$.  Our sequence has length $T$.

- $q^{i} = W_{q} x^{(i)}$ for every element on index i, ranging from $1$ to $T$
- $k^{i} = W_{k} x^{(i)}$ for every element on index i, ranging from $1$ to $T$
- $v^{i} = W_{v} x^{(i)}$ for every element on index i, ranging from $1$ to $T$

This will give us three vectors for each input element (token) in our sequence.

Let's assume that $d$ is the size (number of dimensions) of each (embedded) word vector x.  Our vector $q^{i}$ is the query vector for word at index $i$ and has a dimension we can choose.  We'll call this $d_q$.  In the same way we'll call $d_k$ as the dimension for $k^{i}$.

We'll calculate the dot product between the query and key vectors, this means that each of them needs to have the same dimensions: $d_q = d_k$. Let's choose $d_q = d_k = 24$ in this case.<br/>
If $q^{i} = W_{q} x^{(i)}$ then the dimension for $q^{i}$ is $d_q \times d$; dimension for $W_{q}$ is $d_q \times d$; and dim for $x^{(i)}$ is $d \times 1$

## Resources

- [Attention is all you need](https://arxiv.org/abs/1706.03762)
- [Thinking Like Transformers](https://arxiv.org/abs/2106.06981)