# Transformers

## Processing text data

Consider the following review on some restuarant:

```
"The restaurant refused to serve me a ham sandwich because it only cooks vegetarian food.
In the end, they just gave me two slices of bread.
Their ambaince was just as good as the food and service"
```

We'd like to provess this text into a representation suitable for downsttream tasks such tasks may be: 

- Classify if this review is positive or negative.
- Answer context based questions such as "Does this restaurant serve steak?"

Notice the following three observations: 

1. Depending on how we encode this text, it's representation can become very large very quickly:
   - In words = 37
   - Embedding space = 37 x 1024 = 37888
2. This text was a single review of that size, and this size can change. 
3. Syntax alone isn't enough to resolve some of the tasks at hand, we require context and words that are important and thus should be paid $attention$.
---
---

## Dot Product and Self Attention

A standard NN layer $f[x]$, takes a $D \times 1$ input $\mathbf{x}$ and applies a linear transformation follwed by an activation function like ReLU:

$$f[\mathbf{x}] = \mathbf{ReLU}[\beta + \Omega \mathbf{x}]$$

A self-attention block $\mathbf{sa}$ takes $N$ inputs $\mathbf{x_1}, \dots , \mathbf{x_N}$, each of dimension $D \times 1$ and returns $N$ outputs, each of size $D \times 1$. 

The input can represent a word or a word fragment (discussed later, but relates to information theory)

**Creating Self-Attention**

1. For each input $\mathbf{x_m}$ compute it's $value$: 
   $$\mathbf{v_m} = \beta_v + \Omega_v\mathbf{x_m}$$
2. The $n^{th}$ output $\mathbf{sa}_n\left[\mathbf{x_1}, \dots, \mathbf{x_n}\right]$ is a weighted sum of the $n$ values.:
   $$\mathbf{sa}_n = \sum_{i=1}^Na[\mathbf{x}_i, \mathbf{x}_n]\mathbf{v}_i \quad | \quad a[\mathbf{x}_i, \mathbf{x}_n] \in \mathbb{R}^+ \cup \{0\}, \ \sum_{i=1}^N a[•, \mathbf{x}_i] = 1$$

**Note**

Like in RNN and LSTM we have $\beta_v$ and  $\Omega_v$ which are weights and biases independant of the the input.

### Computing attention weights

To compute the attention, we apply two linear trasnformations to the inputs: 

$$\text{Queries: }\mathbf{q}_n = \beta_q + \Omega_q \mathbf{x}_n$$

$$\text{Keys: } \mathbf{k}_m = \beta_k + \Omega_k \mathbf{x}_m$$

Indeed we compute the dot product between these **queries** and **keys** vectors and apply the **Softmax** function which defines our **attention**:

$$\begin{align}
a[\mathbf{x_m}, \mathbf{x}_n] &= \text{softmax}_m[\mathbf{k}_{•} \cdot \mathbf{q}_n] \\ 
&= \frac{\exp[\mathbf{k}_{m} \cdot \mathbf{q}_n]}{\sum_{i=1}^N \exp[\mathbf{k}_{i} \cdot \mathbf{q}_n]}
\end{align}$$

The dot product provides a measure describing similarity between two vectors, as such we're determining the relative similarities between the $n^{th}$ query and all the keys.

Applying the softmax over this measure, provides the probabilistic aspect, of how probable it is for a respective key to be similar to the current query.

---

### Dimension Check 

**Input dimension**

$ \mathbf{x_i} \in \mathbb{R}^{D_{in}}$

**Value Dimension**
You can freely choose the dimension of $\mathbf{v_m} \in \mathbb{R}^{D_1}$ this means: 

$$\boxed{\mathbf{\Omega_v} \in \mathbb{R}^{D_{in} \times D_1} \quad \text{and} \quad \mathbf{\beta_v} \in \mathbb{R}^{D_1}}$$

Mathematically allowing for: $\mathbf{v_m} = \beta_v + \Omega_v\mathbf{x_i}$

**Self Attention**

You can freely choose the dimension for Queries and the keys, constrained to being equal in dimension.

$$\boxed{\mathbf{\text{dim}(\Omega_k}) = \text{dim}(\mathbf{\Omega_q}) \in \mathbb{R}^{D_{in} \times D_2} \quad \text{and} \quad \mathbf{ \text{dim}(\beta_k)} = \mathbf{ \text{dim}(\beta_q)} \in \mathbb{R}^{D_2}}$$

Mathematically allowing for:

$\text{Queries: }\mathbf{q}_n = \beta_q + \Omega_q \mathbf{x}_n$

$\text{Keys: } \mathbf{k}_m = \beta_k + \Omega_k \mathbf{x}_m$

Thereby allowing this to occur: $[\mathbf{q}_n \cdot  \mathbf{k}_m]$

| **Component**   | **Dimension**                  | **Freedom**                |
|-----------------|-------------------------------|----------------------------|
| Input           | $\mathbf{x}_i \in \mathbb{R}^{D_{in}}$   | $D_{in}$: arbitrary       |
| Value           | $\mathbf{v}_m \in \mathbb{R}^{D_1}$      | $D_1$: freely chosen      |
| Query/Key       | $\mathbf{q}_n, \mathbf{k}_m \in \mathbb{R}^{D_2}$ | $D_2$: freely chosen (must match for both) |

---

### Matrix Form

We shall present the above in a more compact form.

<div align="center">

|**Term** | **Notation** | **Matrix Dimension** | **Calculation** | 
|---------|--------------|----------------------|-------------------|
| **Input** | $\mathbf{X}$ | $\mathbb{R}^{D_{in} \times N}$| **Design Matrix** | 
| **Value** | $\mathbf{V}[X]$ |  $\mathbb{R}^{D_1 \times N}$ | $\mathbf{\beta_v} \cdot \mathbf{1} + \mathbf{\Omega_v}\mathbf{X}$ |
| **Queries** | $\mathbf{Q}[X]$ |  $\mathbb{R}^{D_2 \times N}$ | $\mathbf{\beta_q} \cdot \mathbf{1} + \mathbf{\Omega_q}\mathbf{X}$ |
| **Keys** | $\mathbf{K}[X]$ |  $\mathbb{R}^{D_2 \times N}$ | $\mathbf{\beta_k} \cdot \mathbf{1} + \mathbf{\Omega_k}\mathbf{X}$ |
| **Self Attention** | $\mathbf{SA}[X]$ |  $\mathbb{R}^{D_1 \times N}$ | $\mathbf{V}[\mathbf{X}] \cdot \mathbf{Softmax}\left[\mathbf{K}[X]^T\mathbf{Q}[X]\right]$|

</div>

#### Scaled self-attention

One of the issue with using the softmax, is that large values can dominate the result of the overall argument, which can skew the learning process as the smaller value will have little effect on the output. As such we can mitigate this effect by scaling by the dimension $\sqrt{D_q}$:

$$\mathbf{SA}[X] = \mathbf{V}[\mathbf{X}] \cdot \mathbf{Softmax}\left[\frac{\mathbf{K}[X]^T\mathbf{Q}[X]}{\sqrt{D_q}}\right]$$



<div align="center">
<img src="../images/chap10/SAHead.png" width="710"/>

---

### Positional Encoding

Sequential processing requires importance of order between the input so we now present two methods to incorperate postional information.

**Absollute positional encodings:**

A matrix $\Pi$ is added to the input $\mathbf{X}$ that ecodes the positional information. <br> Each column of $\Pi$ is unique and gence contrains information about the absolute position in the input sequence.

- This matrix can be chosen in advance.
- This matrix can be learned. 

To gain better grasp of this consider the following example: 

Suppose our input matrix $\mathbf{X} \in \mathbb{R}^{3 \times 4}$ and define the positional encoding as follows $\Pi \in \mathbb{R}^{3 \times 4}$

```
X = [                            π = [                   
  [x11, x12, x13, x14],             [0,   1,   2,   3  ],
  [x21, x22, x23, x24],             [0.1, 0.1, 0.1, 0.1],
  [x31, x32, x33, x34]              [0.5, 0.5, 0.5, 0.5] 
]                                    ] 

X_pos = X + π                                            
```
In practice the actual encoding are more complex, but the importance is uniqueness 


**Relative Positional encodings**

In most cases the input to a self-attention mechanism can be: 
- Sentece
- Many Sentences
- Partial Sentence
- Word
- Letter

The absolute position of a word isn't as important, we care about how an **earlier input** affects a **future input**.<br>
To do this, each element of the attention matrix correcponds to a particular offset between key position $a$ and query position $b$.<br>
Relative positional encodings learna a parameter $\pi_{a,b}$ for each odffswet and use this modify the attention matrix by adding these values,multiplying by them, or alter the attention in some other way.

Example:
suppose we have 4 embedded inputs: $x_1, x_2, x_3, x_4$ 

1. **Compute Relative Positions**
    - For each pair of positions $(a,b)$, compute the offset: $r  = a-b$
    - Assume the offset are in this case $-3, -2, -1, 0, 1, 2, 3$
2. **Learnable Relative Encoding Table**
    - Create a table of learnable parameters $\pi_r$, for each possible offset $r$:

|$\text{offset:}$| $-3$ | $-2$ | $1$| $0$ |  $1$ |  $2$ |  $3$|
|----------------|------|------|----|-----|------|-----|------|
|$\pi_r$ |$v_{-3}$| $v_{-2}$| $v_{-1}$| $v_{0}$|  $v_1$ | $v_2$|  $v_3$ |

1. **Modify Attention Score**
    $$\text{score}(a,b) = q_b \cdot k_a + \pi_{a-b} \\ \text{or} \\ \text{score}(a,b) = q_b (\cdot k_a + \pi_{a-b}) $$

    $$\text{Matrix Form:} \quad  \mathbf{SA}[X] = \mathbf{V}[\mathbf{X}] \cdot \mathbf{Softmax}\left[\mathbf{K}[X]^T\mathbf{Q}[X] + \Pi\right]$$

---

### Multiple Heads 

If we're able to produce a single Self-attention output, we can produce multiple head. The purpose for doing this, is that it'll provide **Richer Representation**, **Capture Diverse Relationships** and **Improved Expressiveness**.

$\text{Let H be the number of Heads we produce}$, a single head will now be denoted as: 

$$ \mathbf{SA}_h[X] = \mathbf{V}_h[\mathbf{X}] \cdot \mathbf{Softmax}\left[\frac{\mathbf{K}_h[X]^T\mathbf{Q}_h[X]}{\sqrt{D_q}} + \Pi\right]$$

**Parameters**

$\{\beta_{vh}, \Omega_{vh}\} \ \{\beta_{qh}, \Omega_{qh}\} \ \{\beta_{kh}, \Omega_{kh}\}$

Together we produce a concatonation of these heads: 

$$\mathbf{MhSa}[\mathbf{X}] = \mathbf{\Omega_c}\left[\mathbf{Sa}_1[X]^T,\mathbf{Sa}_2[X]^T, \dots, \mathbf{Sa}_H[X]^T \right]^T + \mathbf{\beta_c}$$



<div align="center">
<img src="../images/chap10/MhSA.png" width="510"/>


## Transformer Layers

The self-attention layer is a single component in the transformer layer.<br>
It's followed by a fully connected network (Multi-Layer Perceptron).<br>
After these two layer, it's typical to add a Layer-Norm, which normalises each embedding in each batch element seperately .

$$\begin{align}
\mathbf{X'} &= \mathbf{X} + \mathbf{MhSa}[\mathbf{X}] \\
\mathbf{X_{norm1}} &= \mathbf{LayerNorm}[\mathbf{X'}] \\
\mathbf{x_n} &= \mathbf{x_n} + \mathbf{mlp[x_n]} \quad \forall n \in \{1, \dots, N\} \\
\mathbf{X_{norm2}} &= \mathbf{LayerNorm}[\mathbf{X_{mlp}}]
\end{align}$$

### Layer Normalisation

Normalise across the features for each individual sample (i.e. token/embedding)

$$\mathbf{LayerNorm(x)} = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta$$

$$\mu = \frac{1}{D}\sum_{i=1}^dx_i$$

$$\sigma = \sqrt{\frac{1}{D} \sum_{i=1}^D (x_i - \mu)^2}$$

**LayerNorm vs. BatchNorm**
- LayerNorm normalizes across **FEATURES** for each individual sample, independent of the batch.
- BatchNorm normalizes across **BATCH** for each features

We use LayerNorm since BatchNorm is sensitive to batch size and less effective for sequence models like transformer.


<div align="center">
<img src="../images/chap10/transform.png" width="710"/>
</div>

## Tokenization

A typical NLP pipeline starts with a $tokenizer$ that splits the text into words or word fragments, and for each token it is then mapped to a learned embedding.

Here are some of the difficulties with tokenization:
1. We may define a vocabulary, but it's very possible that some new words come up?
2. How do we handle punctutation, since it's critical to these model?
3. Words that have the same root are they the same or different? 
   - Walk, Walks, Walked, walking
   - Journée, journal, journaliste
   - שלם, תשלום, השלמה, משלם

**Subword Tokenization**

Break words into smaller units, allowing the model to handle unknown words, and variations of the same word.

**Character-Level Tokenization**
Split the text into individual characters, though this makes the sequence very large.

**Unigram or word-Level Tokenization**

Uses a fixed vocabulary of words, though as mentioned earlier would struggle with unknown words.

### Embedding

Each token in the vocabulary $\mathbb{V}$ is mapped to a unique $word \ embedding$, and the embeddings for the whole vocabulary are stored in a matrix $\Omega_e \in \mathbb{R}^{D \times |\mathbb{V}|}$.

The $N$ input tokens are first encoded in the matrix $\mathbf{T} \in \mathbb{R}^{|\mathbb{V}| \times N}$, where the $n^{th}$ column corresponds to the $n^{th}$ token and is a $|\mathbb{V}| \times 1 \ one-hot \ vector$

The input embeddings are computed as: $\mathbf{X} = \mathbf{\Omega_e }\mathbf{T}$ and indeed $\mathbf{\Omega_e}$ is a learned parameter.

Typically, the emding size $D=1024$, and the vocabulary size is $|\mathbb{V}| = 30,000$


<div align="center">
<img src="../images/chap10/embedding.png" width="710"/>
</div>