# An introduction to seq2seq models, Attenction and Transformers

This presentation is heavily inspired by [Jay Alammar](http://jalammar.github.io/) and [Christopher Olah blog](http://colah.github.io/).

## Introduction

* Sequence to sequence models are `Deep Learning` models used in many tasks
    * Machine Translation
    * Text Summarization
    * Text Generation
* Takes in a sequence of items, and outputs another sequence of items
    * Here we focus on words as input and output

**Here is how a trained seq2seq model works for the task of machine translation**

<video controls src="https://jalammar.github.io/images/seq2seq_2.mp4" alt="Seq2seq machine translation" width="80%"/>

## Digging the black box

The model is composed of an **encoder** and a **decoder**.

#### Encoder
* Takes each input item (word) one by one
* Processes them and captures their information
* Outputs a *Context* vector as its result of processing the entire input

#### Decoder
* Takes the *Context* vector as its input entirely
* Processes it and decode the information to fit into the desired output (another language for machine translation task)
* Outputs items (words) one by one

**Machine translation task, step by step**
<video controls src="https://jalammar.github.io/images/seq2seq_4.mp4" alt="Seq2seq machine translation step by step" width="80%"/>

* Context is a vector of numbers, representing the information captured by the encoder from the input
    * It's a matter of choice what size it has
* Both encoder and decoder are Recurrent Neural Networks under the hood
    * Introduced RNNs and specifically, LSTMs in previous series

**This is how the context vector look like**

<img src="https://jalammar.github.io/images/context.png" alt="Context Vector" width="80%"/>

### Word Embedding

We discussed word embedding methods `Word2Vec` and `GloVe` in the previous series of tutorials. To summarize, word embedding is used to convert words and sentences into numbers so that we could feed them to neural networks.

Seq2seq models and specificall, encoders are not exception and we should embed the document before we feed them to the network.

**This is how an embedded vector for that sentence looks like**


<img src="https://jalammar.github.io/images/embedding.png" alt="Embedded Vector" width="80%"/>

### Recap of RNN

<video controls src="https://jalammar.github.io/images/RNN_1.mp4" alt="RNNs step by step" width="80%"/>

1. Hidden state 0 and input vector 1 (current word) are fed to the RNN
2. The result of that would be hidden state 1 and output vector 1

The unrolled version of RNNs may help to understand their operation better

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" alt="Unrolled RNN" width="80%"/>


3. Similarly, the hidden state 1 and the input vector 2 (next word) are fed to the RNN
4. Hidden state 2 and output vector 2 are the outputs
5. This process continues until no further input is left

The math behind the scenes is a series of dot products and softmax:

<img src="https://datascience-enthusiast.com/figures/rnn_step_forward.png" alt="Behind the scenes RNN" width="80%"/>

* W vectors are the weights of the RNN to be trained and optimized
* X vector is the embedded word vector (input feature vector)
* a vector is the hidden state
* y vector is the output state
* t and t-1 shows current time step and previous time step, respectively

**Note that there is also a backpropagation process for the sake of training the network and adjusting weights, but we don't discuss them here**

### Back to encoder-decoder architecture

Now that we know how RNNs work, we can continue with the encoder-decoder network.

<video controls src="https://jalammar.github.io/images/seq2seq_5.mp4" alt="En-De step by step" width="80%"/>

At each pulse, the RNN in encoder or decoder is processing its input and generating the output and hidden state for that time step.

The hiddent states in the encoder RNNs keep propagating to the next ones, until they reach the last RNN in the encoder. The final hidden state vector, will be the `Context Vecror` that goes through the decoder as its input.

Now let's unroll the process even more.


<video controls src="https://jalammar.github.io/images/seq2seq_6.mp4" alt="En-De step by step unrolled" width="80%"/>

The decoder also works the same way as encoder, as it has a very similar architecture to encoder. However, it does not accept any input vector.

### encoder-decoder weakness and the concept of Attention

The `Context vector` tends to be the bottleneck for this model. In the case of long sentences, the number of words is more and when the time step comes to the later words, the hidden state has already forgotten about the earlier words as it propagates throughout the RNN cells.


#### Attention

Attention helps with the `context vector` bottleneck problem by providing context for **each word** rather than the whole sentence. This helps the decoder to focus on relevant and important parts of the encoded input data at each step of decoding.

So the **encoder** with attention sends more information to the decoder by providing **all** of the hiddent states.

The **decoder** with attention takes all of the hidden states and do the followings:
1. Process the hidden state for each word and gives it a score
2. Amplify the important hidden states for each time step and drown the less informative and less important hidden states

**Here is how encoder-decoder with attention works for the task of machine translation**

<video controls src="https://jalammar.github.io/images/seq2seq_7.mp4" alt="En-De with attention step by step" width="80%"/>


**Now let's see how the hidden states pass along decoder cells and how they are scored**

<video controls src="https://jalammar.github.io/images/attention_process.mp4" alt="Decoder with attention step by step" width="80%"/>


To summarize what happens in the decoder:
1. At each time step the previous decoder hidden state is fed to the decoder RNN cell (the decoder RNN input is always /<END/> as we don't have input in decoder)
2. The output of the RNN is calculated as new hidden state
3. The encoder hidden states are amplified based on their importance against the cell weights
4. The result of step 3 and 2 are concatenated to form the final decoder cell hidden state at that time step

**To visualize how the encoder hidden states are scored, let's look at this example**
<video controls src="https://jalammar.github.io/images/seq2seq_9.mp4" alt="Translation encoder hidden states scored" width="80%"/>

*Note that hidden states are not weighted based on their order, rather based on their importance which does not necessarily comply with the word order*

<img src="https://jalammar.github.io/images/attention_sentence.png" alt="Encoder hidden state amplification" width="80%"/>


#### Long-Short Term Memory Networks - LSTM

LSTMs are a variation of RNNs that improve the performance. Specifically, they help better preserving the context of previously seen words in future passes. We introduced them with more details in the previous series of tutorials.

We use LSTM here to implement a demo. We won't implement the attention mechanism for the sake of time.


The example here is heavily inspired by the content from the [Keras blog](https://blog.keras.io/)