# Introduction to transformers

The transformer architecture {cite}`vaswani2017attention` is a deep learning architecture, which has powered many of the recent advances across a range of deep learning applications, including text modelling, image modelling {cite}`dosovitskiy2021image`, and many others.
This is an overview of the transformer architecture, including a self-contained mathematical description of the architectural details, and a concise implementation.
All of this exposition is based off an excellent introduction paper on transformers by Rich Turner {cite}`turner2023introduction`.

## Sequence modelling

## Transformer block

Much like in other deep architectures, such as residual networks, the transformer architecture maintains a representation of the input data, and progressively refines it using a sequence of so-called transformer blocks.
In particular, given an initial representation $X^{(0)}$ the archtecture comprises of $M$ transformer blocks

$$X^{(m)} = \texttt{TransformerBlock}(X^{(m-1)}).$$

Each of these blocks consists of two main operations, namely a self-attention operation and a pointwise multi-layer perceptron (MLP) operation.
The self-attention operation has the role of combining the representations of different tokens in a sequence, in order to model dependencies between the tokens.
It is applied collectively to all tokens within the transformer block.
The MLP operation has the role of refining the representation of each token.
It is applied separately to each token and is shared across all tokens within a transformer block.


### Self-attention

__Attention:__
The role of the first operation in a transformer block is to combine the representations of different tokens in order to model dependencies between the tokens.
Given a $D \times N$ input array $X^{(m)} = (x_1, \dots, x_N^{(m)})$ the output of the self-attention layer is another $D \times N$ array $Y^{(m)} = (y_1, \dots, y_N^{(m)}),$ where each column is simply a weighted average of the input features, that is

$$y^{(m)}_n = \sum_{n' = 1}^N x^{(m - 1)}_{n'} A_{n', n}^{(m)}.$$

The weighting array $A_{n', n}^{(m)}$ is of size $N \times N$ and has the property that its columns normalise to one, that is $\sum_{n'=1}^N A_{n', n}^{(m)} = 1.$
It is referred to the attention matrix because, intuitively speaking, it weights the extent to which the feature $y^{(m)}_n$ should depend on each $x^{(m)}_{n'},$ i.e. it determines the extent to which each $y^{(m)}_n$ should attend to each $x^{(m)}_{n'}.$
For compactness, we can collect these equations to a single linear operation, that is

$$Y^{(m)} = X^{(m - 1)} A^{(m)}.$$

__Self-attention:__
Now we turn to how the attention weights are themselves computed.
One of the innovations within the transformer architecture is that the attention weights are adaptive, meaning that they are computed based on the input itself.
This is in contrast with other deep learning architectures such as convolutional neural networks (CNNs) where weighted sums are also used, but these weights are fixed and shared across all inputs.


### Multi-layer perceptron

### Skip connections and normalisation

### Positional embeddings

### Putting it together

In summary, we can collect

$$\begin{align}
\bar{X}^{(m-1)} &= \texttt{LayerNorm}\left(X^{(m-1)}\right) \\
Y^{(m)} &= \bar{X}^{(m-1)} + \texttt{MHSA}\left(\bar{X}^{(m-1)}\right) \\
\bar{Y}^{(m)} &= \texttt{LayerNorm}\left(Y^{(m)}\right) \\
X^{(m)} &= Y^{(m)} + \texttt{MLP}(\bar{Y}^{(m)})
\end{align}$$

## Implementation

## Extensions

## References

```{bibliography}
:filter: docname in docnames
```