### Transformer (cont'd)

#### Transformer: Multi-Head Attention

> - The input word vectors are the queries, keys and values
> - In other words, the word vectors themselves select each other
> - **Problem of single attention**
    - Only one way for words to interact with one another
> - **Solution**
    - Multi-head attention maps $Q, K, V$ into the $h$ number of lower-dimensional spaces via $W$ matrices
> - hen apply attention, then concatenate outputs and pipe through linear layer 
> - $\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat(head_1, \dots, head_h)} W^O$
> - $\mathrm{where \ head_i} = \mathrm{Attention(QW_i^Q, KW_i^K, VW_i^V)}$

> - Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types
> - $n$ is the sequence length
> - $d$ is the dimension of representation
> - $k$ is the kernel size of convolutions
> - $r$ is the size of the neighborhood in restricted self-attention
> - | Layer Type | Complexity per Layer | Sequential Operations | Maximum Path Lenght |
| :--- | :---: | :---: | :---: |
| **Self-Attention** | $O(n^2 \cdot d)$ | $O(1)$ | $O(1)$ |
| **Recurrent** | $O(n \cdot d^2)$ | $O(n)$ | $O(n)$ |
| **Convolutional** | $O(k \cdot n \cdot d^2)$ | $O(1)$ | $O(log_k(n))$ |
| **Self-Attention (restricted)** | $O(r \cdot n \cdot d)$ | $O(1)$ | $O(n/r)$ |

#### Transformer: Block-Based Model

> - Each block has two sub-layers
    - Multi-head attention
    - Two-layer feed-forward NN (with ReLU)
> - Each of these two steps also has
    - Residual connection and layer normalization
    - $\mathrm{LayerNorm}(x + \mathrm{sublayer}(x))$
    
#### Transformer: Layer Normalization

> - Layer normalization changes input to have zero mean and unit variance, per layer and per training point (and adds two more parameters)
    - ex) Batch, Layer, Instance, Group Norm
> - $u^l = \frac{1}{H} \sum_{i=1}^H a_i^l$
> - $\sigma^l = \sqrt{\frac{1}{H} \sum_{i=1}^H (a_i^l - \mu^l)^2}$
> - $h_i = f(\frac{g_i}{\sigma_i}(a_i - \mu_i) + b_i)$

> - **Layer normalization consists of two steps**
    - Normalization of each word vectors to have mean of zero and variance of one
    - Affine transformation of each sequence vector with learnable parameters
    
#### Transformer: Positional Encoding

> - Use sinusoidal functions of different frequencies
> - $\mathrm{PE}_{(pos, 2i)} = \sin{(pos / 10000^{2i / d_{\mathrm{model}}})}$
> - $\mathrm{PE}_{(pos, 2i+1)} = \cos{(pos / 10000^{2i / d_{\mathrm{model}}})}$
> - Easily learn to attend by relative position, since for any fixed offset $k, \mathrm{PE}_{(pos+k)}$ can be represented as linear function of $\mathrm{PE}_{(pos)}$

#### Transformer: Decoder

> - Two sub-layer changes in decoder
> - Masked decoder self-attention on previously generated outputs
> - Encoder-Decoder attention, where queries come from previous decoder layer and keys and values come from output of encoder

#### Transformer: Masked Self-Attention

> - Those words not yet generated cannot be accessed during the inference time
> - Renormalization of softmax output prevents the model from accessing ungenerated words