# Self Attention

The below personal learning notes made use of [Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html)

In [None]:
#|hide
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.nn import functional as F

## What is self-attention?

Self-Attention started out from research work in translation and was introduced to give access to all elements in a sequence at each time step.  In language tasks, the meaning of a word can depend on the context within a larger text document.  Attention enables the model to weigh the importance of different elements in the input sequence and adjust their influence on the output.

## Embedding an Input Sentence

Our input is: "Music makes me happy".  We'll create an embeding for this entire sentence first.

In [None]:
sentence = "Music makes me happy"

sentence_words = sentence.split()
sentence_words

['Music', 'makes', 'me', 'happy']

In [None]:
sentence_words_sorted = sorted(sentence_words)
sentence_words_sorted

['Music', 'happy', 'makes', 'me']

In [None]:
dict = {word_str:word_idx for word_idx, word_str in enumerate(sentence_words_sorted)}
dict

{'Music': 0, 'happy': 1, 'makes': 2, 'me': 3}

`dict` is our dictionary, conveniently restricted to just the words we're using here.  Every word we're using has a number associated (the index in our dictionary.  

We can now translate our sentence in an array of integers:

In [None]:
sentence_int = [dict[word] for word in sentence_words]
sentence_int

[0, 2, 3, 1]

## Resources

- [Attention is all you need](https://arxiv.org/abs/1706.03762)
- [Thinking Like Transformers](https://arxiv.org/abs/2106.06981)