# Introduction to Encoder-Decoder Sequence-to-Sequence Architectures
Jun 2, 2019

Guorui Shen, guorui233@outlook.com

## 1 - LSTM
References 
- https://en.wikipedia.org/wiki/LSTM
- picture resources https://www.cnblogs.com/pinking/p/9362966.html

The input and output of a LSTM cell are
+ input: $x_1, x_2, \cdots, x_T$ where $T$ represents unrolling $T$ times. Initial states are $c_0 = 0, h_0 = 0$.
+ output: $c_1, c_2, \cdots, c_T$ and $h_1, h_2, \cdots, h_T$.

The mathematical formulation of LSTM is
\begin{align}
&f_t = \sigma_g(W_fx_t+U_fh_{t-1}+b_f)\cr
&i_t = \sigma_g(W_ix_t+U_ih_{t-1}+b_i)\cr
&o_t = \sigma_g(W_ox_t+U_oh_{t-1}+b_o)\cr
&g_t = \sigma_c(W_cx_t+U_ch_{t-1}+b_c)
\end{align}
and then
\begin{align}
&c_t = f_t\circ c_{t-1}+i_t\circ g_t\cr
&h_t = o_t\circ \sigma_h(c_t)
\end{align}

where the initial values are $c_{0}=0$ and $h_{0}=0$, $c_t\in R^h, h_t\in R^h$, $f_t, i_t, o_t, g_t$ are "forget, input, output, gate" at time $t$, respectively. The LSTM cell is unrolled many times, say $T$, which means $t=1, 2, \cdots, T$. The LSTM is shown as 

<p align="center">
  <img src="http://suzyi.github.io/images/lstm.png", alt="sturcture of a lstm unit", width=500px,height＝500px>
</p>

### Forget Gate
<p align="center">
  <img src="http://suzyi.github.io/images/forget-gate.png", alt="output gate", width=500px,height＝500px>
</p>

### Input Gate and  Gate Gate?
<p align="center">
  <img src="http://suzyi.github.io/images/input-gate.png", alt="input gate", width=500px,height＝500px>
</p>
<p align="center">
  <img src="http://suzyi.github.io/images/gate-gate.png", alt="gate gate", width=500px,height＝500px>
</p>

### Output Gate
<p align="center">
  <img src="http://suzyi.github.io/images/output-gate.png", alt="output gate", width=500px,height＝500px>
</p>

and finally, the information of all inputs $x_1, x_2, \cdots, x_{timesteps}$ are accumulated into $h_{timesteps}$ when the lstm unit was unrolled timesteps times.

## 2 - Encoder-Decoder Sequence-to-Sequence Model
The input and output of the seq2seq model are (LSTM cells are adopted here for both encoder and decoder)
+ Input of LSTM encoder: $x_1, x_2, \cdots, x_T$, $c_0 = 0, h_0 = 0$.
+ Output of LSTM encoder: $c_T, h_T$.
+ Input & Output of LSTM decoder: 
    - step 1: $c_0^{'} = c_T, h_0^{'} = h_T$, and $y_0 = <start>$ or $y_0 = <go>$ or $y_0 = <eos>$ (eos, end of input sequence) is manually defined, outputs are $c_1^{'}, h_1^{'}, y_1$.
    - step 2: input $c_1^{'}, h_1^{'}, y_1$, output $c_2^{'}, h_2^{'}, y_2$.
    - $\vdots$
    - step $T^{'}$: input $c_{T^{'}-1}^{'}, h_{T^{'}-1}^{'}, y_{T^{'}-1}$, output $c_{T^{'}}^{'}, h_{T^{'}}^{'}, y_{T^{'}}$.

<p align="center">
  <img src="http://suzyi.github.io/images/basic_seq2seq.png", width=900,height＝500>
</p>

The encoder-decoder sequence-to-sequence model, seq2seq for short, is actually a many-to-one plus one-to-many model. At the stage of many-to-one, the encoder encodes input sequence into a single vector. Then at the stage of one-to-many, the decoder produces output sequence from single input vector. The encoder can be chosen as a LSTM cell while the decoder can be chosen as another LSTM cell.

**References**
+ Tensorflow official code definition of seq2seq - https://github.com/tensorflow/tensorflow/blob/r1.0/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py
+ Neural Machine Translation (seq2seq) Tutorial - https://github.com/tensorflow/nmt
+ What is the input of decoder? - https://www.jianshu.com/p/c0c5f1bdbb88
### Encoder
Encoder can convert the input sequence $x_1, x_2, \cdots, x_T$ into a fixed-length vector $c$ called context vector or thought vector. In the picture below, $T=4$. At the very begining, $h_0$ is given initial hidden state. Then the hidden state at time $t$ is calculated as
\begin{align}
h_t = f_e(x_t, h_{t-1}), t = 1, 2, \cdots, T.
\end{align}
Finally, the context vector $c$ is a function over all inputs, namely, $c = q(h_1, h_2, \cdots, h_T)$. Usually $c$ can be chosen as the final hidden state or the average over all hidden states
\begin{align}
c = h_T,\cr
c = \frac{h_1+h_2+\cdots + h_T}{T}.
\end{align}
<p align="center">
  <img src="http://suzyi.github.io/images/seq2seq.jpg", width=500,height＝500>
</p>

When the encoder and decoder are both manually specified as LSTM cells, then the seq2seq model becomes
<p align="center">
  <img src="http://suzyi.github.io/images/seq2seq-example.png",width=700,height＝600>
</p>

### Decoder
The encoder outputs a context vector $c$, which contains the accumulated information about the input sequence $x_1, x_2, \cdots, x_T$. Assume the expected outputs of the decoder are $y_1, y_2, \cdots, y_{T'}$, here $T'$ may not equal to $T$. In the picture above, $T' = 3$. Given the initial hidden state of the decoder, then the hidde state and the output at time $t$ can be calculated as
\begin{align}
h^{'}_t = f_d(y_{t-1}, h^{'}_{t-1}, c),\cr
y_t = g(y_{t-1}, h^{'}_{t}, c), \cr
t = 1, 2, \cdots, T^{'}.
\end{align}

<p align="center">
  <img src="http://suzyi.github.io/images/encoder-decoder.png",width=400,height＝400>
</p>

### Attention-based Mechanism
Traditional decoder uses the same context vector $c$ to calculate $y_t$ for all $t=1, 2, \cdots, T^{'}$. Naturally, we can think of using different $c$ at different time, which is the main idea of attention-based mechanism.
+ Tensorflow official recommended tutorials: https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb

## 3 - Coding on Tensorflow

| | | |
|---|---|---|
| deprecated | tf.nn.rnn_cell.LSTMCell() | tf.nn.dynamic_rnn() |
| deprecated | tf.contrib.rnn.LSTMCell() |  |
| new | tf.keras.layers.LSTMCell() | keras.layers.RNN() |