### Attention Mechanisms
In this notebook, we will sequentially implement different variants of attention mechanisms. These variants will build on each other, with the goal of finally creating a compact, efficient implementation of an attention mechanism, which we can then plug into our LLM architecture.

**Simple Self-Attention**: Introduce the broader idea behind attention.

**Self-Attention**: Trainable weights that forms the basis of the mechanisms used in LLMs.

**Causal Attention**: A self-attention variant allowing a model to consider only previous and current inputs in a sequence, ensuring temporal order during text generation.

**Multi-head Attention**: A self-attention and causal attention extension, which enables the model to simultaneously attend to information from different representation subspaces.

#### Why Attention?
In machine translation, it is not possible to merely translate word by word. The translation process requires contextual understanding and grammatical alignment.

- "Kannst du mir helfen diesen Satz zu uebersetzen" should not be translated to "Can you me help this sentence to translate", but rather to "Can you help me translate this sentence".
- Certain words require access to words appearing before or later in the original sentence. For instance, the verb "to translate" should be used in the context of "this sentence", and not independently.

Typically, to overcome this challenge, deep neural networks with two submodules are used:

- **encoder**: first read in and process the entire text (already done in the `preprocessing.ipynb` notebook).

- **decoder**: produces the translated text.

Pre-LLM architectures typically involved recurrent neural networks, a type of neural network where outputs from previous steps are fed as inputs to the current step, making them well-suited for sequential data. In this many-to-one RNN architecture, the input text is fed token by token into the encoder, which processes it sequentially. The terminal state of the encoder is a memory cell, known as the hidden state, which encodes the entire input. This hidden state is then fed to a decoder that would then generate the translated sentence, word by word, one word at a time.

|     ![RNNEncoder-Decoder](images/RNNencoderdecoder.png)     |
|:-----------------------------------------------------------:|
| *RNN Encoder Decoder* (Dive Into Deep Learning Chapter 10.7 |

- While the encoder is many-to-one, the decoder is a one-to-many architecture, since the hidden state is passed at every step of the decoding process.
- While it is not strictly necessary to understand the inner workings of encoder-decoder RNNs to develop an LLM, the `seq2seq.ipynb` notebook in stage 1 aims to more deeply explore these architectures.

**encoder-decoder RNNs had many shortcomings that motivated the design of attention mechanisms**, namely that the it was not possible to access earlier hidden states from the encoder during the decoding phase, since we rely on a single hidden state containing all the relevant information. Context was lost, especially in complex sentences where dependencies span larger distances.



## Simple Self-Attention

The original *transformer* architecture includes a 'Self-Attention' mechanism inspired by the Badhanau attention mechanism mentioned in `seq2seq.ipynb`.

A mechanism that uses self-attention allows each position in the input sequence (each word), to consider the importance of all other positions (all other words) in the same sequence when creating the embedding of such sequence.

In short, **the goal of a self-attention mechanism is to, for each position in the input sequence, compute a context vector that captures, quantifies, and combines information from all other positions.** For example, given an input vector $X = {x^{(1)}, x^{(2)},..., x^{(T)}}$ (which represents a text that has already been transformed into token embeddings), we want to compute the context vector $z^{(3)}$ of the 3rd position, $x^{(3)}$. The importance of each input for computing $z^{(3)}$ is determined by attention weights ${\alpha_{3,1}, \alpha_{3,2},..., \alpha_{3, T}}$, being $z^{(3)}$ a combination of all input vectors weighted with respect to input element $x^{(3)}$.

- Our goal is to compute context vectors $z^{(i)}$ for all $x^{(i)}$ in the input sequence. The resulting context vectors, we can say, are enriched, more informational embedding vectors.

#### Why 'Self-Attention'

The 'self' refers to the fact that the mechanism computes weights by assessing and learning the dependencies of different positions **within the same sequence**. The relationships of the various parts of the input itself are considered. This is in contrast to attention mechanisms that focus on assessing and learning the relationship of elements that are part of distinct sequences. Sequence-to-sequence models, for instance, where assessment might be done over an input sequence and a distinct output sequence is not a self-attentive mechanism.




In [2]:
import torch
import numpy as np

inputs = torch.tensor(
    [[0.35, 0.15, 0.89], #These (x^1)
     [0.97, 0.8, 0.3], #are (x^2)
     [0.65, 0.34, 0.24], #random (x^3)
     [0.2, 0.87, 0.34], #words (x^4)
     [0.86, 0.13, 0.05], #and (x^5)
     [0.10, 0.20, 0.30]] #embeddings (x^6)
)