<a href="https://colab.research.google.com/github/rahiakela/transformers-for-natural-language-processing/blob/main/1-model-architecture-of-the-transformer/3_architecture_of_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The Architecture of Transformer

The original Transformer model is a stack of 6 layers. The output of layer $l$ is the input of layer $l+1$ until the final prediction is reached. There is a 6-layer encoder stack on the left and a 6-layer decoder stack on the right:

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/transformer-architecture.png?raw=1' width='800'/>

On the left, the inputs enter the encoder side of the Transformer through an attention sub-layer and FeedForward Network (FFN) sub-layer. 

On the right, the target outputs go into the decoder side of the Transformer through two attention sub-layers and an FFN sub-layer. 

**We immediately notice that there is no RNN, LSTM, or CNN. Recurrence has been abandoned.**

Attention has replaced recurrence, which requires an increasing number of
operations as the distance between two words increases. The attention mechanism
is a "word-to-word" operation. The attention mechanism will find how each word
is related to all other words in a sequence, including the word being analyzed itself.

Let's examine the following sequence:

```
The cat sat on the mat.
```

Attention will run dot products between word vectors and determine the strongest
relationships of a word among all the other words, including itself ("cat" and "cat"):

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/attending-words.png?raw=1' width='800'/>

The attention mechanism will provide a deeper relationship between words and
produce better results.

**For each attention sub-layer, the original Transformer model runs not one but eight attention mechanisms in parallel to speed up the calculations.**

We just looked at the Transformer from the outside. Let's now go into each
component of the Transformer. We will start with the encoder.


## Setup

In [1]:
# Transformer Installation
!pip -qq install transformers

[K     |████████████████████████████████| 1.9MB 9.1MB/s 
[K     |████████████████████████████████| 3.2MB 41.3MB/s 
[K     |████████████████████████████████| 890kB 49.3MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [2]:
import numpy as np
from scipy.special import softmax
from transformers import pipeline

## The encoder stack

The layers of the encoder and decoder of the original Transformer model are stacks of layers. Each layer of the encoder stack has the following structure:

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/encoder-stack.png?raw=1' width='800'/>

The original encoder layer structure remains the same for all of the N=6 layers of the Transformer model. Each layer contains two main sub-layers: 

- a multi-headed attention mechanism 
- a fully connected position-wise feedforward network

Notice that a residual connection surrounds each main sub-layer, $Sublayer(x)$, in the Transformer model. **These connections transport the unprocessed input $x$ of a sublayer to a layer normalization function. This way, we are certain that key information such as positional encoding is not lost on the way.**

The normalized output of each layer is thus:

$$ LayerNormalization (x + Sublayer(x)) $$

> Though the structure of each of the N=6 layers of the encoder is identical, the content of each layer is not strictly identical to the previous layer.

For example, **the embedding sub-layer is only present at the bottom level of the stack. The other five layers do not contain an embedding layer, and this guarantees that the encoded input is stable through all the layers.**

Also, **the multi-head attention mechanisms perform the same functions from layer 1 to 6. However, they do not perform the same tasks. Each layer learns from the previous layer and explores different ways of associating the tokens in the sequence.**
It looks for various associations of words, just like how we look for different
associations of letters and words when we solve a crossword puzzle.

The designers of the Transformer introduced a very efficient constraint. The output of every sub-layer of the model has a constant dimension, including the embedding layer and the residual connections. This dimension is $d_{model}$ and can be set to another value depending on your goals. **In the original Transformer architecture, $d_{model} =512$.**

**$d_{model}$ has a powerful consequence. Practically all the key operations are dot products. The dimensions remain stable, which reduces the number of operations to calculate, reduces machine consumption, and makes it easier to trace the information as it flows through the model.**

This global view of the encoder shows the highly optimized architecture of the
Transformer.

## Input embedding

The input embedding sub-layer converts the input tokens to vectors of dimension
$d_{model} = 512$ using learned embeddings in the original Transformer model. The structure of the input embedding is classical:

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/input-embedding.png?raw=1' width='800'/>

The embedding sub-layer works like other standard transduction models. A
tokenizer will transform a sentence into tokens. Each tokenizer has its methods,
but the results are similar.

For example, a tokenizer applied to the sequence "the Transformer is an innovative NLP model!" will produce the following tokens in one type of model:

```
['the', 'transform', 'er', 'is', 'a', 'revolutionary', 'n', 'l', 'p', 'model', '!']
```

You will notice that this tokenizer normalized the string to lower case and truncated it into subparts. A tokenizer will generally provide an integer representation that will be used for the embedding process.

```
Text = "The cat slept on the couch.It was too tired to get up."

tokenized text= [1996, 4937, 7771, 2006, 1996, 6411, 1012, 2009, 2001,
2205, 5458, 2000, 2131, 2039, 1012]
```

There is not enough information in the tokenized text at this point to go further. The tokenized text must be embedded.

The Transformer contains a learned embedding sub-layer. Many embedding
methods can be applied to the tokenized input.

A skip-gram will focus on a center word in a window of words and predicts context words. For example, if $word(i)$ is the center word in a two-step window, a skipgram model will analyze $word(i-2), word(i-1), word(i+1)$, and $word(i+2)$. Then the window will slide and repeat the process. A skip-gram model generally contains an input layer, weights, a hidden layer, and an output containing the word embeddings of the tokenized input words.

Suppose we need to perform embedding for the following sentence:

```
The black cat sat on the couch and the brown dog slept on the rug.
```

We will focus on two words, black and brown. The word embedding vectors of these
two words should be similar.

Since we must produce a vector of size $d_{model} = 512$ for each word, we will obtain a size 512 vector embedding for each word:

```
black=[[-0.01206071 0.11632373 0.06206119 0.01403395 0.09541149
0.10695464 0.02560172 0.00185677 -0.04284821 0.06146432 0.09466285
0.04642421 0.08680347 0.05684567 -0.00717266 -0.03163519 0.03292002
-0.11397766 0.01304929 0.01964396 0.01902409 0.02831945 0.05870414
0.03390711 -0.06204525 0.06173197 -0.08613958 -0.04654748 0.02728105
-0.07830904
…
0.04340003 -0.13192849 -0.00945092 -0.00835463 -0.06487109 0.05862355
-0.03407936 -0.00059001 -0.01640179 0.04123065
-0.04756588 0.08812257 0.00200338 -0.0931043 -0.03507337 0.02153351
-0.02621627 -0.02492662 -0.05771535 -0.01164199
-0.03879078 -0.05506947 0.01693138 -0.04124579 -0.03779858
-0.01950983 -0.05398201 0.07582296 0.00038318 -0.04639162
-0.06819214 0.01366171 0.01411388 0.00853774 0.02183574
-0.03016279 -0.03184025 -0.04273562]]
```

The word black is now represented by 512 dimensions. Other embedding methods
could be used and $d_{model}$ could have a higher number of dimensions.

The word embedding of brown is also represented by 512 dimensions:

```
brown=[[ 1.35794589e-02 -2.18823571e-02 1.34526128e-02 6.74355254e-02
1.04376070e-01 1.09921647e-02 -5.46298288e-02 -1.18385479e-02
4.41223830e-02 -1.84863899e-02 -6.84073642e-02 3.21860164e-02
4.09143828e-02 -2.74433400e-02 -2.47369967e-02 7.74542615e-02
9.80964210e-03 2.94299088e-02 2.93895267e-02 -3.29437815e-02
…
7.20389187e-02 1.57317147e-02 -3.10291946e-02 -5.51304631e-02
-7.03861639e-02 7.40829483e-02 1.04319192e-02 -2.01565702e-03
2.43322570e-02 1.92969330e-02 2.57341694e-02 -1.13280728e-01
8.45847875e-02 4.90090018e-03 5.33546880e-02 -2.31553353e-02
3.87288055e-05 3.31782512e-02 -4.00604047e-02 -1.02028981e-01
3.49597558e-02 -1.71501152e-02 3.55573371e-02 -1.77437533e-02
-5.94457164e-02 2.21221056e-02 9.73121971e-02 -4.90022525e-02]]
```

**To verify the word embedding produced for these two words, we can use cosine
similarity to see if the word embeddings of the words black and brown are similar.**

Cosine similarity uses Euclidean (L2) norm to create vectors in a unit sphere. The dot product of the vectors we are comparing is the cosine between the points of those two vectors.

The cosine similarity between the black vector of size $d_{model} = 512$ and brown vector of size $d_{model} = 512$ in the embedding of the example is:

```
cosine_similarity(black, brown)= [[0.9998901]]
```

The skip-gram produced two vectors that are very close to each other. It detected that `black` and `brown` form a color subset of the dictionary of words.

The Transformer's subsequent layers do not start empty-handed. They have learned
word embeddings that already provide information on how the words can be
associated.

**However, a big chunk of information is missing because no additional vector or
information indicates a word's position in a sequence.**

The designers of the Transformer came up with yet another innovative feature:
**positional encoding**.


## Positional encoding

We enter this positional encoding function of the Transformer with no idea of the position of a word in a sequence:

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/position-encoding.png?raw=1' width='800'/>

We cannot create independent positional vectors that would have a high cost on the training speed of the Transformer and make attention sub-layers very complex to work with. The idea is to add a positional encoding value to the input embedding instead of having additional vectors to describe the position of a token in a sequence.

Please refer this notebook for  [positional encoding](https://github.com/rahiakela/transformers-for-natural-language-processing/blob/main/1-model-architecture-of-the-transformer/1_positional_encoding.ipynb).



## Sub-layer 1: Multi-head attention

**The multi-head attention sub-layer contains eight heads and is followed by postlayer normalization, which will add residual connections to the output of the sublayer and normalize it.**

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/sub-layer-1.png?raw=1' width='800'/>

The input of the multi-attention sub-layer of the first layer of the encoder stack is a vector that contains the embedding and the positional encoding of each word. The next layers of the stack do not start these operations over.

The dimension of the vector of each word $x_n$ of an input sequence is $d_{model} = 512$:

$$
pe(x_n) = [d_1=9.09297407e-01, d_2=9.09297407e-01, .., d_{512}=1.00000000e+00]
$$

The representation of each word $x_n$ has become a vector of $d_{model} = 512$ dimensions.

**Each word is mapped to all the other words to determine how it fits in a sequence.**

In the following sentence, we can see that "it" could be related to "cat" and "rug" in the sequence:

```
Sequence =The cat sat on the rug and it was dry-cleaned.
```

**The model will train to find out if "it" is related to "cat" or "rug."** We could run a huge calculation by training the model using the $d_{model} = 512$ dimensions as they are now.

However, we would only get one point of view at a time by analyzing the sequence
with one $d_{model}$ block. Furthermore, it would take quite some calculation time to find other perspectives.

**A better way is to divide the $d_{model} = 512$ dimensions of each word $x_n$ of $x$ (all of the words of a sequence) into $8 d_k = 64$ dimensions.**

**We then can run the 8 "heads" in parallel to speed up the training and obtain 8 different representation subspaces of how each word relates to another:**

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/multi-head-representations.png?raw=1' width='800'/>

**You can see that there are now 8 heads running in parallel.** One head might decide that "it" fits well with "cat" and another that "it" fits well with "rug" and another that "rug" fits well with "dry-cleaned."

The output of each head is a matrix $z_i$ with a shape of $x^*d_k$ The output of a multiattention head is $Z$ defined as:

$$ Z = (z_0, z_1, z_2, z_3, z_4, z_5, z_6, z_7,) $$

**However, $Z$ must be concatenated so that the output of the multi-head sub-layer is not a sequence of dimensions but one lines of $xm*d_{model}$ matrix.**

Before exiting the multi-head attention sub-layer, the elements of $Z$ are concatenated:

$$ MultiHead(output) = Concat(z_0, z_1, z_2, z_3, z_4, z_5, z_6, z_7,) = x, d_{model} $$

**Notice that each head is concatenated into $z$ that has a dimension of $d_{model} = 512$. The output of the multi-headed layer respects the constraint of the original Transformer model.**

Inside each head $h_n$ of the attention mechanism, each word vector has three
representations:

- A query vector $(Q)$ that has a dimension of $d_q = 64$, which is activated and trained when a word vector $x_n$ seeks all of the key-value pairs of the other word vectors, including itself in self-attention
- A key vector $(K)$ that has a dimension of $d_k = 64$, which will be trained to provide an attention value
- A value vector $(V)$ that has a dimension of $d_v = 64$, which will be trained to provide another attention value


Attention is defined as **Scaled Dot-Product Attention** which is represented in the following equation in which we plug $Q$, $K$ and $V$:

$$
Attention(Q,K,V) = softmax \begin{pmatrix} \frac{QK^T}{\sqrt{d_k}} \end{pmatrix} V
$$

**The vectors all have the same dimension making it relatively simple to use a scaled dot product to obtain the attention values for each head and then concatenate the output Z of the 8 heads.**

To obtain $Q$, $K$, and $V$, we must train the model with their respective weight matrices $Q_w, K_w$ and $V_w$, which have $d_k = 64$ columns and $d_{model} = 512$ rows. For example, $Q$ is obtained by a dot-product between $x$ and $Q_w. Q$ will have a dimension of $d_k = 64$.

Hugging Face and Google Brain Trax, among others, provide ready-to-use
frameworks, libraries, and modules. However, let's open the hood of the Transformer model and get our hands dirty in Python to illustrate the architecture we just explored in order to visualize the model in code and show it with intermediate images.

We will use basic Python code with only numpy and a softmax function in 10 steps to run the key aspects of the attention mechanism.

We will start by only using minimal Python functions to understand the Transformer at a low level with the inner workings of an attention head. We will explore the inner workings of the multi-head attention sub-layer using basic code.


Please refer this notebook for [implemetation of Multi-head attention](https://github.com/rahiakela/transformers-for-natural-language-processing/blob/main/1-model-architecture-of-the-transformer/2_architecture_of_multi_head_attention.ipynb).







## Post-layer normalization

Each attention sub-layer and each feedforward sub-layer of the Transformer is
followed by post-layer normalization (Post-LN):

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/post-layer-normalization.png?raw=1' width='800'/>

**The Post-LN contains an add function and a layer normalization process. The add function processes the residual connections that come from the input of the sublayer. The goal of the residual connections is to make sure critical information is not lost.**

$$ LayerNorm(x+Sublayer(x)) $$

$Sublayer(x)$ is the sub-layer itself. $x$ is the information available at the input step of $Sublayer(x)$.

The input of $LayerNorm$ is a vector $v$ resulting from $x + Sublayer(x)$. $d_{model} = 512$ for every input and output of the Transformer, which standardizes all the processes.

Many layer normalization methods exist, and variations exist from one model to
another. The basic concept for $v= x + Sublayer(x)$ can be defined by $LayerNorm(v)$:

$$ LayerNorm(v)=\gamma \frac{v - \mu}{\sigma} + \beta $$

The variables are: 

- $\mu$ is the mean of $v$ of dimension $d$. As such:

$$ \mu = \frac{1}{d}\sum_{k=1}^{d}v_k $$

- $\sigma$ is the standard deviation $v$ of dimension $d$. As such:

$$ \sigma^2 = \frac{1}{d}\sum_{k=1}^{d} (v_{k-\mu})^2 $$

- $\gamma$ is a scaling parameter.

- $\beta$ is a bias vector.

This version of $LayerNorm(v)$ shows the general idea of the many possible Post-LN methods.

The next sub-layer can now process the output of the Post-LN or $LayerNorm(v)$. In this case, the sub-layer is a feedforward network.

## Sub-layer 2: Feedforward network

The input of the FFN is the $d_{model} = 512$ output of the Post-LN of the previous sublayer:

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/feedforward-sub-layer.png?raw=1' width='800'/>

The FFN sub-layer can be described as follows:

- The FFNs in the encoder and decoder are fully connected.
- The FFN is a position-wise network. Each position is processed separately
and in an identical way.
- The FFN contains two layers and applies a ReLU activation function.
- The input and output of the FFN layers is $d_{model} = 512$, but the inner layer is larger with $d_{ff} = 2048$.
- The FFN can be viewed as performing two kernel size 1 convolutions.

Taking this description into account, we can describe the optimized and
standardized FFN as follows:

$$ FFN(x) = max(0, xW_1 + b_1)W_2 =b_2 $$

**The output of the FFN goes to the Post-LN, as described in the previous section. Then the output is sent to the next layer of the encoder stack and the multi-head attention layer of the decoder stack.**



## The decoder stack

The layers of the decoder of the Transformer model are stacks of layers like the encoder layers. Each layer of the decoder stack has the following structure:

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/decoder-stack.png?raw=1' width='800'/>

The structure of the decoder layer remains the same as the encoder for all the N=6 layers of the Transformer model. Each layer contains three sub-layers: 

- a multiheaded masked attention mechanism 
- a multi-headed attention mechanism
- a fully connected position-wise feedforward network

**The decoder has a third main sub-layer, which is the masked multi-head attention mechanism. In this sub-layer output, at a given position, the following words are masked so that the Transformer bases its assumptions on its inferences without seeing the rest of the sequence. That way, in this model, it cannot see future parts of the sequence.**

A residual connection, $Sublayer(x)$, surrounds each of the three main sub-layers in the Transformer model like in the encoder stack:

$$ LayerNormalization(x + Sublayer(x)) $$

The embedding layer sub-layer is only present at the bottom level of the stack, like for the encoder stack. The output of every sub-layer of the decoder stack has a constant dimension, $d_{model}$ like in the encoder stack, including the embedding layer and the output of the residual connections.

**The structure of each sub-layer and function of the decoder is similar to the encoder. So, we can refer to the encoder for the same functionality when we need to. We will only focus on the differences between the decoder and the encoder.**

## Output embedding and position encoding

The structure of the sub-layers of the decoder is mostly the same as the sub-layers of the encoder. The output embedding layer and position encoding function are the same as in the encoder stack.

In the Transformer usage we are exploring through the model, the output is a translation we need to learn. I chose to use a French translation:

```
Output=Le chat noir était assis sur le canapé et le chien marron
dormait sur le tapis
```

This output is the French translation of the English input sentence:

```
Input=The black cat sat on the couch and the brown dog slept on the
rug.
```

The output words go through the word embedding layer, and then the positional
encoding function, like in the first layer of the encoder stack.

## The attention layers

**The Transformer is an auto-regressive model. It uses the previous output sequences as an additional input. The multi-head attention layers of the decoder use the same process as the encoder.**

However, the masked multi-head attention sub-layer 1 only lets attention apply to the positions up to and including the current position. The future words are hidden from the Transformer, and this forces it to learn how to predict.

A post-layer normalization process follows the masked multi-head attention sublayer 1 as in the encoder.

The multi-head attention sub-layer 2 also only attends to the positions up to the current position the Transformer is predicting to avoid seeing the sequence it must predict.

The multi-head attention sub-layer 2 draws information from the encoder by taking $encoder (K, V)$ into account during the dot-product attention operations. This sublayer also draws information from the masked multi-head attention sub-layer 1 (masked attention) by also taking $sub-layer 1(Q)$ into account during the dot-product attention operations. The decoder thus uses the trained information of the encoder.

We can define the input of the self-attention multi-head sub-layer of a decoder as:

```
Input_Attention=(Output_decoder_sub_layer-1(Q), Output_encoder_layer(K,V))
```

A post-layer normalization process follows the masked multi-head attention sub-layer 1 as in the encoder.

The Transformer then goes to the FFN sub-layer, followed by a Post-LN and the
linear layer.


## The FFN sub-layer, the Post-LN, and the linear layer

The FFN sub-layer has the same structure as the FFN of the encoder stack. The Post-LN of the FFN works as the layer normalization of the encoder stack.

The Transformer produces an output sequence of only one element at a time:

$$ OutputSequence= (y_1, y_2, … y_n) $$

The linear layer produces an output sequence with a linear function that varies per model but relies on the standard method:

$$ y = w * x + b $$

$x$ and $b$ are learned parameters.

**The linear layer will thus produce the next probable elements of a sequence that a softmax function will convert into a probable element.**

The decoder layer as the encoder layer will then go from layer $l$ to layer $l+1$ up to the top layer of the N=6-layer transformer stack.

## Transformer in Action

The original Transformer was trained on a 4.5-million-sentence-pair English-German dataset and a 36-million-sentence English-French dataset.

The training of the original Transformer base models took 12 hours to train for 100,000 steps on a machine with 8 NVIDIA P100 GPUs. The big models took 3.5 days for 300,000 steps.

The original Transformer outperformed all the previous machine translation models with a BLEU score of 41.8. The result was obtained on the WMT English-to-French dataset.

With Hugging Face, you can implement machine translation in three lines of code!

We implement the Hugging Face pipeline that contains several transformer
usages. The pipeline contains ready-to-use functions. In our case, to illustrate the Transformer model, we activate the translator model and enter a
sentence to translate from English to French:

In [3]:
translator = pipeline("translation_en_to_fr")

# One line of code!
print(translator("It is easy to translate languages with transformers", max_length=40))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…


[{'translation_text': "Il est facile de traduire des langues à l'aide de transformateurs"}]


And voilà! The translation is displayed.

Hugging Face shows how transformer architectures can be used in ready-to-use
models.