# What is Seq2Seq?

The so-called ***Seq2Seq(Sequence to Sequence)*** is a method that can generate another sequence by a specific method based on a given sequence.It was firstly proposed in 2014, having first, the two articles describes its main idea, namely Google Brain team "Sequence to Sequence Learning with Neural Networks" and Yoshua Bengio team "Learning Phrase Representation using RNN Encoder-Decoder for Statistical Machine Translation. The two articles coincided with a similar solution,and Seq2Seq was generated.

As a simple example, when we use machine translation: input (Hello) ---> output (hello). For another example, in human-machine dialogue, we ask the machine: "Who are you?", And the machine will return the answer "I am XX".

<img src="photos/fig1.png">

<center>This figure shows the simple email conversation</center>

The Encoder and Decoder in the figure only show one layer of ordinary LSTM cell(to prevent readers from misunderstanding, what needs to be explained here is that the cell used by Seq2Seq is not limited to LSTM).From the above structure, we can see that the entire model structure is still very simple. The state of the EncoderCell at the last moment is the intermediate semantic vector C, which will be the initial state of the DecoderCell. Then in the DecoderCell, the output of each moment will be used as the input of the next moment. And so on until the end of the predictive output special symbol < END > at the DecoderCell.

Here we call the Encoder stage the encoding stage . The corresponding Decoder stage is called the decoding stage . The intermediate semantic vector C can be regarded as a set of all input contents, and all input contents are included in C. The details will be explained later. Let's first look at the specific application scenarios of the Seq2Seq technology we will learn.

## Application scenarios of Seq2Seq

With the development of computer technology, artificial intelligence technology, algorithm research, etc. and the needs of social development, Seq2Seq has produced some applications in many fields.

- 1. **Machine Translation** (currently the most famous *Google translation* is completely developed based on Seq2Seq + Attention Mechanism)
- 2. **Chatbot**(**Microsoft Xiaobing**, also used seq2seq technology)
- 3. **The text summary is automatically generated**(this technology is used by **headlines today**)
- 4. **The picture discription is automatically genreated**
- 5. **Machine writing poetry, code completion, generation of comit message, story style rewriting, etc.**


## Seq2Seq principle analysis

First, we need to make clear that the main idea of seq2seq is to solve the problem is to map a sequence as an input to a sequence as an output through a deep neural network model (commonly LSTM decoder), it consists of 2 links.

Here we must emphasize that ***input sequence and output sequence length of the*** Seq2Seq implementation program is designed to be ***immutable***.

## Basic seq2seq model 

__The Notation of sequence__

The seq2seq model converts the input sequence to the output sequence. Let the input sequence and output sequence br **X** and **Y** respectively. The *i-th* element of the input sequence is represented as *$x_i$*, and the j-th element of the output sequence is represented as *$y_j$*. Generally, each of the $x_i$ and the $y_j$ is the one-hot vector of the symbols. For e.g, in Natural Language Processing(NLP), the one-hot vector represents the word and its size becomes the vocabulary size.

Lets think of the seq2seq model in terms of NLP. Let the vocabulary of input and outputs be $\nu^{(s)}$ and $\nu^{(t)}$, all the elements $x_i$ and $y_j$ satisfy <img src="photos/fig3.png"> . The input sequence **X**  and the output sequence **Y** are represented as the following equations: <img src="photos/fig4.png"> 

***I*** and ***J*** are the length of the input sequence and the output sequence. Using the NLP notation, ***$y_0$*** is the one-hot vector of the BOS(beginning of the sentence), which is the virtual word representing the beginning of the sentence, and **$y_{j+1}$** is the EOS(end of the sentence), which is the virtual word represent the end of the sentence.

### The Notation of the Conditional Probability ***P*** **(Y/X)** 

Now, lets think about the conditional probability ***P*** **(Y/X)** generating the output sequence **Y** when the input sequence **X** is given. The purpose of seq2seq model is modelling the probability ***P*** **(Y/X)**. However, seq2seq model does not model the probability ***P*** **(Y|X)** directly. Actually it models the probability ***P*** **$(y_j | Y_<j , X)$** which is the probability of generating the of probability of j-th element of the output sequence **$y_j$** given the **$Y_<j$** and **X**. **$Y_<j$** means the output sequence from 1 to ***j - 1*** , We can write the model ***$P_\theta$ (Y|X)*** with the product of *$P_\theta$* **$(y_j | Y_{<j}, X)$** : <img src="photos/fig5.png">

## Processing steps in seq2seq model 

Now, let's think about the processing steps in seq2seq model.The feature seq2seq model is that it consists of the two processes:

1. The process that generates the fixed size vecor **z** from the input sequence **X**.
2. The process that generates the output sequence **Y** from **z**.

In other words, the information of ***X*** is coveyed by **z** and *$P_\theta$* **$(y_j | Y_<j, X)$** is actually calculated as *$P_ \theta$* **$(y_j | Y_j , z)$**.

First we represent the process which generating **z** from **X** by the function $ \Lambda $.

**z** = **$ \Lambda $(X)**

The function $ \Lambda $ may be the recurrent neural net such as LSTMs.

Second, we represent the process which generating **Y** from **z** by the following formula:

<img src="photos/fig6.png">

$\Psi$ is the function to generate the hidden vectors **$h^{(t)}_{j}$**, and **$\Upsilon$** is the function to calculate the generative probability of the one-hot vector **$y_j$** . When j=1, **$h^{t}_{j-1}$** or **$h_{0}^{t}$** is **z** generated by **$\Lambda(X)$**, and **y_{j-1}** or **$y_{0}$** is the one-hot vector of BOS.

### Model architecture of seq2seq Model

Now, we will discuss about the architecture of seq2seq model.To ease in explanation, we used the most basic architecture. The architecture of seq2seq model can be seperated to the five major roles.

1. Encoder embedding layer
2. Encoder Recurrent layer
3. Decoder embedding layer
4. Decoder Recurrent layer
5. Decoder Output layer

<img src="photos/fig7.png">

The encoder consist of two layers: the embedding layer and the recurrent layer, and the decoder consist of three layers: the embedding layer, recurrent layer and output layer.

In explanation, we use the following symbols:

<img src="photos/fig8.png">

1. Encoder embedding layer

The first layer or encoder embedding layer converts each word in input sentence to the embedding vector. When processing the i-th word in the input sentence, the input and output of the layer are the following:

   - The input is **$x_i$**: the one-hot vector which represents the i-th word.
   - The output is **$ \bar{x}_{i}$**: the embedding vector which represents the i-th word.
   
Each embedding vector is calculated by this equation: **$ \bar{x}_i = E^{(s)}x_i$**

**$E^{(s)} \in \R^{D x \nu^{(s)}}$** is the embedding matrix of the encoder.

2. Encoder recurrent layer 

The encoder recurrent layer generates a hidden vectors from the embedding vectors. When we processing the i-th embedding vector, the input and output layer are the following:
      
   - The input is **$ \bar{x}_{i}$**: the embedding vector which represents the i-th word.
   - The output vector is **$h^{(s)}_{i}$** : hidden vector of the i-th position
   
For example, when using the uni-directional RNN of one layer, the process can be represented as the following function **$ \Psi^{(s)}$**:
<img src="photos/fig9.png">

In this case, we have used **tanh** as the activation function.


3. Decoder Embedding layer

The decoder embedding layer converts each word in the output sentence to the embedding vector. When processing the j-th word in the output sentence, the input and output layer are the following:
  
  - The input is **$y_{j-1}$**: the one-hot vector which represents the (j - 1)-th word generated by the decoder output layer.
  - The output is **$\bar{y}_{j}$**: the embedding vector which represents the (j - 1)-th word.
  
Each embedding vector is calculated by the following equation:
 **$\bar{y}_{j} = E^{(t)}y_{j-1}$** 
 
**$E^{t} \in \R^{D x|\nu^{(t)}}$ is the embedding matrix of the encoder.

4. Decoder Recurrent layer

The decoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the j-th embedding vector, the input and the output layers are following:

 - The input **$ \bar{y}_{j}$** : the embedding vector.
 - The output is **$h^{(t)}_{j}$** : the hidden vector of j-th position.
 
For example, when using the uni-directional RNN of one layer, the process can be represented as the following function **$ \Psi ^{(t)}:$** <img src="photos/fig10.png"> 

In this case we used **tanh** as the activation function.And we must use the encoder's hidden vector of the last position as the decoder's hidden vector of first position as following:

  **$h^{(t)}_{0} = z = h^{(s)}_{t}$**
  
5. Decoder output layer

The decoder output layer generates the probability of the j-th word of the output sentence from the hidden vector. When we processing the j-th embedding vector, the input and output of the layer are the following:

  - The input is **$h^{(t)}_{j}$** : the hidden vector of the j-th position.
  - The output is **$p_{j}$** : the probability of generating 
  
  <img src="photos/fig11.png"> 
  
