# Transformer layer

A transformer layer takes a sequence of $T$ word vectors and returns another sequence of $T$ vectors. Intuitively, it computes the contextual re-representation of each word vector - where the context are all the other word vectors in the sequence.

Given $T$ input word vectors $x_{1:T}$, it computes $T$ weighted averages $z_{1:T}$ of the inputs. Each $z_T$ is a representation of $x_{1:T}$ in the context of all other $x_{\ne t}$. It does this by fitting $T$ attention functions $a_{1:T}$ (before we only had one).

<img src="img/5_neural_networks_transformer.drawio.svg" width="500">

That's the idea, but more is going on. Here is a more detailed depiction of an encoder-decoder transformer. The image is taken from Jay Alammar's awesome blog post, [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/).
In the image: the thicker dotted lines are skip-connections.

<img src="img/8_transformers/transformer_architecture.png" width="650">

Here is another awesome visualisation.

<img src="img/8_transformers/transformer_decoding_2.gif" width="550">

Let's look at each component. In what follows:
* We use row vectors.
* We assume $T$ input word vectors $x$ stacked into a matrix $X \in \mathbb{R}^{\text{doc len} \times \text{embed dim}}$. For instance, the first image under the Self-Attention section below illustrates $X \in \mathbb{R}^{2 \times 4}$.

# Tokenization

Limitations of Word (space) tokenization:
* Large vocabularies. Even more likely to occur for languages with a righ inflectional morphology, where different suffixes attached to word roots express different genders, numbers, or cases (accusative, genitive, dative).
* We will impose a limit on vocabualary size. But then we'll have out-of-vocabulary tokens
* Going back to inflectional morphology: loss of meaning between very similar words. E.g. "phone" and "phones" will be two separate vocabulary entries.

Limitations of character-based tokenization:
* Very long token sequences, each token being only one character long;
* When braking words into individual characters, we miss the meaning resulted from glueing characters together.

Subword tokenisation aims to increase coverage of dictionaries. Combining subword tokens seen during training, we can represent words unseen during training.
Here are popular subword tokenization algorithms.

Byte-pair encoding:
* Keep frequent words in their original form.
* Break down infrequent words.

Wordpiece vs BPE:
* BPE adds most frequent tokens to the vocabulary; Wordpiece adds those that maximise the likelihood of the training data. Merge t1 and t2 if p(t1t2)/[p(t1)p(t2)] is largest among all t1 and t2 pairs.

Wordpiece and BPE assume input string is tokenised, e.g. space tokenised. 
Sentencepiece does not. Ja and zh do not have explicit space.

After tokenization, we can add special tokens, such as the start of sequence token, often denoted as `<s>`.

# Input encoding

The output of tokenization is a list of integers of length (doc_len), each being a vocabulary index, e.g. `[1, 8, 3, 2]`. Given two or more lists, e.g. `[1, 8, 3, 2]` and `[1, 10, 2]`, we can batch them into a tensor of dimension `(batch_size, max batch doc_len)` `[[1, 8, 3, 2], [1, 10, 2, 0]]`. Note we have padded the shorter list with the index of the padding token in the vocabulary, i.e. 0 in our example. This is the input tensor.

Next, we have an embedding layer; this is parametrised by an embedding matrix $E$ of dimension `(vocab_len, embed_dim)`.
Row $e_i$ of this matrix, of dimension `embed_dim`, contains the vector representation, i.e. embedding, of word at index $i$ in the vocabulary.
We replace each vocabulary index in the input tensor with the corresponding embedding: $[[e_1, e_8, e_3, e_2], [e_1, e_{10}, e_2, e_0]]$.
This replacement can be simulated by: computing the one-hot (row) vector representation of each position; and multiplying that vector with the embedding matrix.

Next, we have a positional encoding layer; this is parametrised by a matrix $P$ of dimension `(max allowed doc_len, embed_dim)`. We create the position tensor $[[p_0, p_1, p_2, p_3], [p_0, p_1, p_2, p_3]]$, i.e. simply take as many position vectors as we have tokens.

Finally, we sum the embedding tensor and the input tensor. The result is the input to the first transformer layer (encoder or decoder).

# Encoder

## Self-attention

Here are the steps performed **for each** input word vector $x_t$.

First, compute three vectors: a query $q_t$, a key $k_t$, and a value $v_t$. To create the query, multiply $x_t$ by a query matrix that we learn (i.e. params in this matrix adapted during training): $q_t = x_t W^Q$. Proceed analogously for the key and value vectors. Note: we can stack all $T$ querries into a matrix $Q = X  W^Q$; In our example, $Q \in \mathbb{R}^{2 \times 3}$, $X \in \mathbb{R}^{2 \times 4}$, and $W^Q \in \mathbb{R}^{4 \times 3}$.

<img src="img/8_transformers/self-attention-matrix-calculation.png" width="250">

Second, compute $T$ scores $\sigma_{1:T}$, where $\sigma_{i} = q_t k_i$ (dot product). Intuitively, $\sigma_{i}$ measures how much focus to place on word $i$ when encoding current word $t$. For instance, in "I ate a pizza, it was delicious", when encoding "it", the score corresponding to its referrent "pizza" should be largest, which the score corresponding to "I" should be small.

Third, ivide each $\sigma_i$ by 8, i.e. square root of 64, i.e. dimension of the key vectors used in the "Attention is all you Need" paper. Then compute softmax over $\sigma_{1:T}$. Dividing by 8 leads to a more even distribution of the probability mass, thus less mass values close to 0, thus more stable gradients.

Finally, multiply each value vector with the corresponding score and sum the results to compute a representation $z_t$ of the current word: $z_t = \sum_{i=1}^T \sigma_i v_i$.

<img src="img/8_transformers/self-attention-output.png" width="450">

## Intuition

Self-attention:
* Given a token, call it the current token, compute an affinity score between the current token and every token in the sequence, resulting in T affinity scores.
* To accomplish this, every token emits three vectors: a query, a key, and a value:
  * query of a token: intuitively expresses what that token is looking for when considring other tokens. "It" looking for a referrant.
  * key: information content that other queries can match against.
  * value: what the token communicates about itself to other tokens.
* The affinity score between the current token and another token is the dot product between query(current token) and key(other token).
* When re-representing current token, we compute affinity between the current token and all previous tokens, including current token. We then represent current token as the weighted average of the values of all these tokens, where the weights are the affinity scores.
* Summary: key (what I contain), query (what I am looking for), value (what I will communicate about myself).

Analogy:
* Directed graph. Every node has some value vector. This expresses the information content it communicates to nodes that it points to.
* Given a node, compute a weighted average of the nodes that point to this node. We want the weights to be data dependent.

When re-representing a specific token:
* Query: A representation of that specific token; representation that we score against all other tokens.
* Keys: For each token, a representation that we match the query to.
* (We match the query to each key to compute a relevance score; dot product and softmax).
* Values: The representations we average (weighted by the scores above) to compute a representation of the current word.

Analogy: searching in a filing cabinet.
* Query: A sticky note summarising the information we are looking for.
* Keys: For each folder, the key is the label summarising the information in that folder.
* (We match the query to each key to compute a relevance score for each folder)
* Values: For each folder, the information unit in the folder. We merge all information units, each being assigned a "trust" score computed before.


## Implementation

Assume we are given a tensor as the one below, of size (batch size, doc_len, emb_dim).

In [1]:
x = torch.tensor([
    [[1, 3],
     [2, 1],
     [0, 1]],

    [[0, 1],
     [5, 4],
     [0, 0]]
]).float()

NameError: name 'torch' is not defined

We would like to produce another where the representation of word $t \in \{1, \ldots, \text{doc\_len}\}$ is the sum representations of words $1, \ldots, t$.

In [2]:
target = torch.tensor([
    [[1, 3],
     [3, 4],
     [3, 5]],

    [[0, 1],
     [5, 5],
     [5, 5]]
]).float()

NameError: name 'torch' is not defined

Here is the inefficient way.

In [None]:
x_bow = torch.zeros_like(x)
batch_size, doc_len, emb_dim = x.size()

for b in range(batch_size):
    for t in range(doc_len):
        x_bow[b, t] = x[b, :t + 1].sum(dim=0)
x_bow.allclose(target)

True

And now for the efficient way.

In [None]:
mask = torch.tril(torch.ones(3,3))
print(mask)
print(x)
mask @ x

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
tensor([[[1., 3.],
         [2., 1.],
         [0., 1.]],

        [[0., 1.],
         [5., 4.],
         [0., 0.]]])


tensor([[[1., 3.],
         [3., 4.],
         [3., 5.]],

        [[0., 1.],
         [5., 5.],
         [5., 5.]]])

We are multiplying mask of size (3, 3) with x of size (2, 3, 2). Pytorch will know to multiply mask with each (3, 2) element in x.

Now let's do averaging instead of summing.

In [None]:
mask = torch.tril(torch.ones(3, 3))
mask[mask == 0] = float("-inf")
mask = torch.softmax(mask, dim=1)
print(mask)

mask @ x

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])


tensor([[[1.0000, 3.0000],
         [1.5000, 2.0000],
         [1.0000, 1.6667]],

        [[0.0000, 1.0000],
         [2.5000, 2.5000],
         [1.6667, 1.6667]]])

At this point, when computing the representation of the current token, all previous tokens are given the same weight.
This is not great. Consider "I had a pizza, it was very tasty". When representing "it", (the representation of) "pizza" should have a higher weight that that of "had", for instance.

For a character-level model, the representation of a e.g. vowel should depend on specific previous consonants.

We want such weights to be data dependent. This is the problem that self-attention solves.

Self-attention:
* Given a token, call it the current token, compute an affinity score between the current token and every token in the sequence, resulting in T affinity scores.
* To accomplish this, every token emits three vectors: a query, a key, and a value:
  * query of a token: intuitively expresses what that token is looking for when considring other tokens. "It" looking for a referrant.
  * key: information content that other queries can match against.
  * value: what the token communicates about itself to other tokens.
* The affinity score between the current token and another token is the dot product between query(current token) and key(other token).
* When re-representing current token, we compute affinity between the current token and all previous tokens, including current token. We then represent current token as the weighted average of the values of all these tokens, where the weights are the affinity scores.
* Summary: key (what I contain), query (what I am looking for), value (what I will communicate about myself).

Analogy:
* Directed graph. Every node has some value vector. This expresses the information content it communicates to nodes that it points to.
* Given a node, compute a weighted average of the nodes that point to this node. We want the weights to be data dependent.

In [None]:
batch_size, doc_len, emb_dim = x.size()
x, x.size()

(tensor([[[1., 3.],
          [2., 1.],
          [0., 1.]],
 
         [[0., 1.],
          [5., 4.],
          [0., 0.]]]),
 torch.Size([2, 3, 2]))

In [None]:
B, T, C = x.size()
head_size = 16 # the dimension of the key, query, and value vectors

key_layer = torch.nn.Linear(emb_dim, head_size, bias=False)
query_layer = torch.nn.Linear(emb_dim, head_size, bias=False)
value_layer = torch.nn.Linear(emb_dim, head_size, bias=False)

query = query_layer(x) # (B, T, C)
key = key_layer(x)     # (B, T, C)
value = value_layer(x) # (B, T, C)

# (B, T, C) @ (B, C, T) -> (B, T, T)
wei = query @ key.transpose(2, 1)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float("-inf"))
wei = torch.softmax(wei, dim=-1)

out = wei @ value

out.size()

torch.Size([2, 3, 16])

## Other self attention layers

* There is no notion of space. This is why we need positional encoding. Meaning of a token might differ with position. E.g. verb particles in German added to the end of the sentence Self-attention vs convolution: notion of space exists in convolution.
* Above we are implementing a _decoder_ / _masked_ self-attention block; when representing the current token, we do NOT consider future tokens. By contrast, in _encoder_ block, this restriction is removed; implemented by removing the masking, i.e. removing `wei = wei.masked_fill(tril == 0, float("-inf"))`. The decoder is appropriate for language modelling; encoder for e.g. classification tasks, such as sentiment analysis.
* In self-attention, keys, querries, and values, are computed from the same source. In _cross-attention_, also referred to as _encoder-decoder attention_ when using an encoder-decoder architecture, the querries are computed from current source, but keys and values from another source, from which we would like to pull information; for instance, from the (output of the) encoder blocks, that represent some context we'd like to condition on.
* _Scaled_ attention divides `wei` by `sqrt(head_size)`. Assume querries and keys are unit Gaussians, i.e. follow a Gaussian distribution with unit variance. If we do not divide, the variance of wei will be multiplied by `head_size`. If we divide, it stays unit variance. Stated differently, the values in `wei` will simply be larger. As a result, the softmax will sharper, getting further away from a uniform, sharping towards the max, converging to a one-hot vector.
* _Multi-head_ attention: apply multiple attentions in parallel (using different query, key, and value matrices) and concatenate their results. Having different initialisations, they capture different interactions. Different local minima.

In [None]:
Q = torch.randn(B, T, head_size)
K = torch.randn(B, T, head_size)
wei_1 = Q @ K.transpose(1, 2)
wei_2 = Q @ K.transpose(1, 2) / head_size**0.5

print(Q.var())
print(K.var())
print(wei_1.var())
print(wei_2.var())

tensor(0.8659)
tensor(0.7215)
tensor(8.2634)
tensor(0.5165)


## Multiple heads

We can stack all $z_{1:T}$ into a matrix $Z$ computed as $\text{softmax}\left( \frac{QK^{Tr}}{\sqrt{64}} \right) V $, where softmax is applied row-wise.<br/>
<img src="img/8_transformers/self-attention-matrix-calculation-2.png" width="400">

We can perform the same operations $H$ times, using $H$ query, key, and value matrices, resulting in different outputs $z_{t,h}$. Each corresponds to what is referred to as a *head*. Intuitively, we want each $z_{t,h}$ to capture different interactions of word $t$ with its contextual words. 
In this purpose, we initialise different query, key, and value matrices **to different values** across different heads. Eventually we can concatenate all $z_{t, h}$ into a large vector of dimension $H|z_t|$; and map it to a vector $z_t$ using a linear layer, i.e. multiplying by a param matrix $W^0$.

# Decoder

Different from the encoder, the decoder:
* replaces self-attention with masked self-attention; and
* has an extra encoder-decoder layer, also referred to as a cross-attention layer.

## Masked self-attention

Say we are at decoding step $t$. When computing the scores $\sigma_{1:S}$ ($S$ denotes the maximum number of decoding steps), make sure $\sigma_{>t} = 0$. This is achieved by setting the scores on positions $>t$ to `-inf` before applying the softmax.

Intuitively, a self-attention layer inputs a sequence of word vectors, i.e. embeddings; for each embedding, call it the current embedding, the layer computes its contextual re-representation, where the context are all the other embeddings in the sequence.
In a masked self-attention layer, the context are all embeddings that occur before the current embedding in the sequence (and the current embedding itself).

<img src="img/8_transformers/masked-self-attention-2.png" width="500">


## Encoder-decoder attention

The output of the top encoder is transformed into key and value vectors, respectively; assume they are stacked into key and value matrices $K_{\text{enc-dec}}$ and $V_{\text{enc-dec}}$, respectively.

These are used in the encoder-decoder attetion layer in the decoder. This works like self-attention, except:
* The key and value matrices are $K_{\text{enc-dec}}$ and $V_{\text{enc-dec}}$; while the query matrix is computed from the output of previous decoder blocks.
* Say we are at decoding step $t$. When computing the scores $\sigma_{1:T}$ ($T$ now denotes the maximum number of decoding steps), make sure $\sigma_{>t} = 0$. This is achieved by setting the scores on positions $>t$ to `-inf` before applying the softmax.


## Output

At each decoding step, the top decoder produces a vector; this is the final contextual re-representation of the input token at that step (where the context are the current and previous tokens).

Next, using a linear layer, we map this final vector to a vector of probabilities over the vocabulary.
The linear layer can be parametrised by the original embedding matrix. As such, position $i$ in the vector of probabilities is the dot product between the final vector and embedding $e_i$.

This vector of probabilities is then sampled to produce an output token at this step.

<img src="img/8_transformers/gpt2-output.png" width="700">

Decoding continues until:
* an end-of-sequence token is produced; or
* the maximum number of tokens was generated, e.g. 1024.

# Helpful tips for optimising deep neural networks

## Skip-connections

Also known as residual connections. The idea originates in the paper "Deep Residual Learning for Image Recognition" by [He et al. (2015)](https://arxiv.org/pdf/1512.03385).

There is a residual pathway. We branch off, perform computation (e.g. self-attention), and integrate the result in the residual pathway via addition.

Addition distributes gradients equally to both branches during backprop. The gradients this way flow from loss to the input via the residual path. Helps optimisation: preventing vanishing gradients; and importantly, every param will be optimised in terms of the supervision signal directly.

## Layer Normalisation

Batch normalisation: across the batch dimension, every individual neuron output has unit Gaussian distribution, i.e. 0 mean, unit standard deviation.

Pre-norm formulation: layer normalisation applied before transformations (self-attention and linear). As opposed to after, as in the original transformer paper.


# Objectives

## Language modelling

Training data:
* Start with a corpus of text, e.g. "robots must protect humans" - in our example, this is the entire corpus.
* Generate training examples: (input=robots, target=must), (input=robots must, target=protect), (input=robots must protect, target=humans), (input=robots must protect humans, target=\<\/s\>).

In a maximum likelihood approach to parameter estimation (i.e. to training):
* At each training step, provide the input to the model; compare its output to the target to compute the loss; fine-tune model parameters to minimise the loss.
* In more detail, the target one-hot encoded; a trained model output is a vector of probabilities over the vocabulary.

The loss could be the sum of cross-entropies, one for each position.

<img src="img/8_transformers/output_target_probability_distributions.png" width="400">
<img src="img/8_transformers/output_trained_model_probability_distributions.png" width="400">

# Specific Models

Models:
* BERT: stack of encoders
* GPT-2: stack of decoders, bype-pair encoding
* GPT-3: trained on 300B tokens, 96 transformer layers 1.8B params each, 2048 context window (maximum input length), 175B params.

A stack of encoders can be used for masked language modelling.

A decoder in a stack of decoders does not have the encoder-decoder attention layer - obviously.
A stack of decoders can be used for autoregressive language modelling: output one token at a time; and token generated at step $t$ becomes an input for generation at step $t+1$.

<img src="img/8_transformers/gpt-2-autoregression-2.gif" width="600">

# Fine-tuning

salut ce mai faci [toEnglish] hello how are you

Generate code based on the input prompt
[example] an input that says "search" [toCode] Class App extends Reach Component ... }


# RLHF

Terminology:
* There is an agent interacting with the environment by taking an action. It uses a policy (e.g. LLM) to map a state to an action.
* In response, the environment returns a state (state of the word as a result of the action) and a reward (to maximise).

1. Pre-train language model on a large corpus of text. Optionally, include text generated by humans; for instance, given a popular question, employ humans to write a high quality answer.
2. Start with a dataset of prompts, such as questions asked by ChatGPT users. Provide each prompt to one or more language models to generate multiple answers. Employ human annotators to rank the answers, providing each answer a score. The result is a dataset of pairs (prompt, answer, score).
3. Train a reward model on the dataset above that, given a prompt and and answer, predicts the score.

Pipeline:
* Terminology:
  * policy: an LLM;
  * state: model inputs, i.e. natural language;
  * action: model output, i.e. natural language.
* policy(state, e.g. question) = action e.g. answer.
* reward model(action) = reward.
* The policy can overfit the data generated by the reward model, i.e. its params can be adapted in such a way to always maximise the reward, but the coherence of the generated text might suffer. To prevent this, we impose a maximum distance (KL divergence) between the final vocabulary distributions generated by the policy and that of the initial language model (before RLHF).