# Introduction to transformers

The transformer architecture {cite}`vaswani2017attention` is a deep learning architecture, which has powered many of the recent advances across a range of deep learning applications, including text modelling, image modelling {cite}`dosovitskiy2021image`, and many others.
This is an overview of the transformer architecture, including a self-contained mathematical description of the architectural details, and a concise implementation.
All of this exposition is based off an excellent introduction paper on transformers by Rich Turner {cite}`turner2023introduction`.

## Modelling with tokens
__One architecture, many applications.__
The purpose of the transformer architecture was, originally, to model sequence data such as text.
The approach for achieving this with the transformer is to first convert individual words, or characters, into one-dimensional arrays called _tokens_, and then operate on these tokens with a neural network.
This approach extends far beyond word modelling.
For example, the transformer can be applied to tasks as diverse as modelling of images and video, proteins, or weather.
In all these applications, the data are first converted into sets of tokens.
After this step, the transformer architecture can be applied in roughly the same way, irrespective of the original representation of the data.
This versatility, together with their empirical performance, are some of the main appealing features of the transformer.

__Inputs as tokens.__
In particular, for the moment, we will assume that the input data have already been converted into tokens and defer the details of this tokenisation for later.
For now, let us assume that each data example, e.g. a sentence or an image, has been conerted into a set of tokens $\{x_n\}_{n=1}^N,$ where each $x_n$ is a $D$ dimensional array $x_n \in \mathbb{R}^D.$
We can collect these tokens into a single $D \times N$ array $X^{(0)} \in \mathbb{R}^{D \times N},$ forming a single data input for the transformer.

## Transformer block
Much like in other deep architectures, such as residual networks, the transformer architecture maintains a representation of the input data, and progressively refines it using a sequence of so-called transformer blocks.
In particular, given an initial representation $X^{(0)}$ the archtecture comprises of $M$ transformer blocks, i.e. for each $m = 1, \dots, M,$ it computes

$$X^{(m)} = \texttt{TransformerBlock}(X^{(m-1)}).$$

Each of these blocks consists of two main operations, namely a self-attention operation and a pointwise multi-layer perceptron (MLP) operation.
The self-attention operation has the role of combining the representations of different tokens in a sequence, in order to model dependencies between the tokens.
It is applied collectively to all tokens within the transformer block.
The MLP operation has the role of refining the representation of each token.
It is applied separately to each token and is shared across all tokens within a transformer block.
Let's look at these two operations in detail.

### Self-attention

__Attention:__
The role of the first operation in a transformer block is to combine the representations of different tokens in order to model dependencies between them.
Given a $D \times N$ input array $X^{(m)} = (x_1, \dots, x_N^{(m)})$ the output of the self-attention layer is another $D \times N$ array $Y^{(m)} = (y_1, \dots, y_N^{(m)}),$ where each column is simply a weighted average of the input features, that is

$$y^{(m)}_n = \sum_{n' = 1}^N x^{(m - 1)}_{n'} A_{n', n}^{(m)}.$$

The weighting array $A_{n', n}^{(m)}$ is of size $N \times N$ and has the property that its columns normalise to one, that is $\sum_{n'=1}^N A_{n', n}^{(m)} = 1.$
It is referred to the attention matrix because it weights the extent to which the feature $y^{(m)}_n$ should depend on each $x^{(m)}_{n'},$ i.e. it determines the extent to which each $y^{(m)}_n$ should attend to each $x^{(m)}_{n'}.$
For compactness, we can collect these equations to a single linear operation, that is

$$Y^{(m)} = X^{(m - 1)} A^{(m)}.$$

Clearly, the value of the attention weights is central in this step.
In fact, a host of different operations, such as convolution layers in convolutional neural networks (CNNs) can be written as weighted sums of input arrays.
Let's next look at the specifics of how the attention weights are themselves computed in the transformer.

__Self-attention:__
One of the innovations within the transformer architecture is that the attention weights are adaptive, meaning that they are computed based on the input itself.
This is in contrast with other deep learning architectures such CNNs, where weighted sums are also used, but these weights are fixed and shared across all inputs.
One straightforward way to compute attention weights would be to compare them by a simple similarity metric, such as an inner product.
For example, given two tokens $x_i$ and $x_j,$ we can compute a dot-product between them, which acts as a similarity metric, exponetiate the result to make it positive and then normalise each column to ensure the resulting weights are between $0$ and $1,$ that is

$$A^{(m)}_{ij} = \frac{\exp(x_i^\top x_j^\top)}{\sum_{n' = 1}^N \exp(x_i^\top x_{n'}^\top)}.$$

An alternative, slightly more flexible approach is to transform each token in the sequence by a linear map, say by applying a matrix $U \in \mathbb{R}^{K \times D}$ to each token first, that is

$$A^{(m)}_{ij} = \frac{\exp(x_i^\top U^\top U x_j^\top)}{\sum_{n' = 1}^N \exp(x_i^\top U^\top U x_{n'}^\top)}.$$

This allows the tokens to be compared in a different space.
For example, if $K < D$ this approach automatically projects out some of the components of the tokens, comparing them in a lower-dimensional space.
However, this approach still has an important limitation, namely symmetry.
Specifically, the attention matrix above would be symmetric, which means that any two tokens would attend to each other with equal strengths.
This might be undesirable because, for example, we could imagine that one token might be important for informing the representation of another token, but not the other way around.
To address this, we can apply different linear operations, say $U_k$ and $U_q$ to each of the tokens being compared, and instead compute

$$A^{(m)}_{ij} = \frac{\exp(x_i^\top U_k^\top U_q x_j^\top)}{\sum_{n' = 1}^N \exp(x_i^\top U_k^\top U_q x_{n'}^\top)}.$$

In this way, the resulting attention matrix that is not necessarily symmetric and an overall more expressive architecture, and tokens no longer have to attend to each other with the same strength.
This type of weighting is known as self-attention, since each token in the sequence attends to every other token of the same sequence.
It is also possible to generalise this to attention between different sequences, which might be useful for some applications such as, for example joint modelling of text and images.
This generalisation is called cross-attention, and we defer its discussion for later.

__Multi-head self-attention.__
In order to increase the capacity of the self-attention layer

Now, let's look at the other central operation in the transformer architecture, namely the MLP.

### Multi-layer perceptron
The self-attention layer has the role of aggregating information across tokens in a sequence to model joint dependencies.
In order to refine the representations themselves, a simple MLP is applied to each token in isolation, in a relatively simple step

$$x^{(m)_n} = \text{MLP}(y^{(m)}_n).$$


### Skip connections and normalisation

### Positional embeddings

### Putting it together

In summary, we can collect

$$\begin{align}
\bar{X}^{(m-1)} &= \texttt{LayerNorm}\left(X^{(m-1)}\right) \\
Y^{(m)} &= \bar{X}^{(m-1)} + \texttt{MHSA}\left(\bar{X}^{(m-1)}\right) \\
\bar{Y}^{(m)} &= \texttt{LayerNorm}\left(Y^{(m)}\right) \\
X^{(m)} &= Y^{(m)} + \texttt{MLP}(\bar{Y}^{(m)})
\end{align}$$

## Implementation

## Complementary view: keys, queries and values

## Extensions

## Relations to other architectures

## References

```{bibliography}
:filter: docname in docnames
```