# Transformers

This task is split into 2 parts:

 - Some high level questions
 - Implement a decoder-only transformer from scratch

## Questions

### 1. What is different architecturally from the Transformer, vs a normal RNN, like an LSTM? (Specifically, how are recurrence and time managed?)

 - __RNN:__ At each time step the model combines one input vector (e.g. a word) with the previous state vector, to get an output. This can be run recurrently (lots of times) so that it takes e.g. one word from a sentence at a time.
 - __LSTM:__ A problem with RNNs is they suffer from decay of information, as the one state vector must compress information from all previous runs (each time with a weighting between the input vector and the previous state vector). LSTM's help solve this by having gates that let the model be more selective about which information enters the state vector, and which information it retreives on each run. This lets it store information for longer time spans.
 - __Transformer:__ Can also be run recurrently to generate a new series of tokens, but each time it's run it typically has inputs of all previous tokens (e.g. words). Then it uses the attention pattern (key-value queries on all tokens) to select the most relevant information.

### 2. Attention is defined as, $\text{Attention}(Q,K,V) = \text{softmax}(QK^T/\sqrt{d_k})V$. What are the dimensions for $Q$, $K$, and $V$? Why do we use this setup? What other combinations could we do with $(Q,K)$ that also output weights?

There are 2 key dimensions - $d_\text{model}$ (which is the size of each layer) and $d_\text{head}$ which is the size of each head output. Usually $d_\text{model} = d_{head} \times \text{number of heads}$, as this way the attention sub-layer outputs the same number of activations as the input ($d_\text{model}$) (which keeps things simple, although it doesn't have to be this way!).

$$
d_\text{vocab} = \text{Number of tokens (e.g. words)} \\
d_\text{head} = \text{Dimension of each head} \\
d_\text{model} = \text{Dimension of each model layer}
$$

$$
Q \in \mathbb{R}^{d_\text{vocab} \times d_\text{head}} \\
K \in \mathbb{R}^{d_\text{vocab} \times d_\text{head}} \\
V \in \mathbb{R}^{d_\text{vocab} \times d_\text{head}}
$$

The steps are as follows:

 1. We start with $QK^T \in \mathbb{R}^{d_\text{vocab} \times d_\text{vocab}}$ . 
 
The underlying calculation here is that for each head we want to dot-product multiply the query from that head by the key of every other head. This gives us a set of number scores which determin how much importance we should place on every other head's values). The result would be a vector of $\text{1} \times d_\text{number of heads}$.

Rather than doing one head at a time however, we do this for all heads 