### Attention Mechanisms
In this notebook, we will sequentially implement different variants of attention mechanisms. These variants will build on each other, with the goal of finally creating a compact, efficient implementation of an attention mechanism, which we can then plug into our LLM architecture.

**Simple Self-Attention**: Introduce the broader idea behind attention.
**Self-Attention**: Trainable weights that forms the basis of the mechanisms used in LLMs.
**Causal Attention**: A self-attention variant allowing a model to consider only previous and current inputs in a sequence, ensuring temporal order during text generation.
**Multi-head Attention**: A self-attention and causal attension extension, which enables the model to simultaneously attend to information from different representation subspaces.

#### Why Attention?
In machine translation, it is not possible to merely translate word by word. The translation process requires contextual understandng and grammatical alignment.

- "Kannst du mir helfen diesen Satz zu uebersetzen" should not be translated to "Can you me help this sentence to translate", but rather to "Can you help me translate this sentence".
- Certain words require access to words appearing before or later in the original sentence. For instance, the verb "to translate" should be used in the context of "this sentence", and not independently.

Typically, to overcome this challenge, deep neural networks with two submodules are used:

- **encoder**: first read in and process the entire text (already done in the `preprocessing.ipynb` notebook).
- **decoder**: produces the translated text.

Pre-LLM architectures typically involved recurrent neural networks, a type of neural network where outputs from previous steps are fed as inputs to the current step, making them well-suited for sequential data. In this many-to-one RNN architecture, the input text is fed token by token into the encoder, which processes it sequentially. The terminal state of the encoder is a memory cell, known as the hidden state, which encodes the entire input. This hidden state is then fed to a decoder that would then generate the translated sentence, word by word, one word at a time.

- While the encoder is many-to-one, the decoder is a one-to-many architecture, since the hidden state is passed at every step of the decoding process.

**encoder-decoder RNNs had many shortcomings that motivated the design of attention mechanisms**, namely that the it was not possible to access earlier hidden states from the encoder during the decoding phase, since we rely on a single hidden state containing all the relevant information. Context was lost, especially in complex sentences where dependencies span larger distances.

### *Badhanau Attention Mechanism*

In this modification, the decoder can selectively access different parts of the input sequence at each decoding step.

- When generating an output token, the model has a way to access all input tokens.
- Input tokens have, fed to the encoder, contain a measure of how important the input token is for the respective output token.



In [None]:
import torch
import torch.autograd
from torch import optim
import torch.nn.functional as F

class EncoderRNN(torch.nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=1):
        super(EncoderRNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding = torch.nn.Embedding(input_size, hidden_size)
        self.gru = torch.nn.GRU(hidden_size, hidden_size, num_layers)

    def forward(self, word_inputs, hidden):
        seq_len = len(word_inputs)
        embedded = self.embedding(word_inputs).view(seq_len, 1, -1)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def init_hidden(self):
        hidden = torch.autograd.Variable(torch.zeros(self.num_layers, 1, self.hidden_size))
        return hidden

The class EncoderRNN above depicts how a non-modified encoder RNN works. The relevant building blocks are:

- `torch.nn.Embedding`: As described before, it is simply a lookup table storing the embeddings of a fixed dictionary, of a fixed size. The input to this module is a list of indices, and the output is the corresponding word embeddings.

    - Note that the module has learnable weights, a tensor, of the shape (num_embeddings, embedding_dim). These are traditionally initialized from N(0,1).

- `torch.nn.GRU`: Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.

