# Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Kyunghyun Cho,Kyunghyun Cho
3 Sep 2014

https://arxiv.org/pdf/1406.1078.pdf

## 总结
1. RNN Encoder–Decoder，两个RNN网络分别作为encoder和decoder
1. encoder将一个变长sequence映射到一个固定长度vector
1. decoder将这个固定长度vector映射到一个变长sequence

## Introduction
The proposed neural network architecture, which we will refer to as an RNN Encoder–Decoder, consists of two recurrent neural networks (RNN) that act as an encoder and a decoder pair. The encoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vector representation back to a variable-length target sequence.

## RNN Encoder–Decoder
### Preliminary: Recurrent Neural Networks
- a hidden state h
- an optional output y which operates on a variable-length sequence x = (x1,...,xT).

At each time step t, the hidden state h⟨t⟩ of the RNN is updated by

$$h_{<t>}=f(h_{<t-1>},x_t)\ (1)$$

- f is a non-linear activation function

The output at each timestep t is the conditional distribution p(xt | xt−1, . . . , x1)

$$p(x_{t,j}=1|x_{x-1},\cdots,x_1)=\frac{exp(w_j h_{<t>})}{\sum_{j'=1}^K exp(w_{j'} h_{<t>})}\ (2)$$

By combining these probabilities, we can compute the probability of the sequence x using

$$p(x)=\prod_{t=1}^T p(x_t|x_{t-1},\cdots,x_1)\ (3)$$

### RNN Encoder–Decoder
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes according to Eq. (1). After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary c of the whole input sequence.

The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol $y_t$ given the hidden state h⟨t⟩. However, unlike the RNN described in Sec. 2.1, both $y_t$ and h⟨t⟩ are also conditioned on $y_{t−1}$ and on the summary c of the input sequence. Hence, the hidden state of the decoder at time t is computed by,

$$h_{<t>}=f(h_{<t-1>},y_{t-1},c)$$

and similarly, the conditional distribution of the next symbol is

$$P(y_t|y_{t-1},y_{t-2},\cdots,y_1,c)=g(h_{<t>},y_{t-1},c)$$

for given activation functions f and g (the latter must produce valid probabilities, e.g. with a softmax).

![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-8/16608032.jpg)

The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood

$$max_{\theta} \frac{1}{N}\sum_{n=1}^N log p_{\theta}(y_n|x_n)\ (4)$$

where θ is the set of the model parameters and each (xn,yn) is an (input sequence, output sequence) pair from the training set.

### Hidden Unit that Adaptively Remembers and Forgets
![2](http://ou8qjsj0m.bkt.clouddn.com//17-8-8/60099754.jpg)

Propose a new type of hidden unit (f in Eq. (1)) that has been motivated by the LSTM unit but is much simpler to compute and implement:

First, the reset gate $r_j$ is computed by

$$r_j=\sigma([W_r x]_j + [U_r h_{t-1}]_j)\ (5)$$

where σ is the logistic sigmoid function, and $[.]_j$ denotes the j-th element of a vector. x and $h_{t-1}$ are the input and the previous hidden state, respectively. Wr and Ur are weight matrices which are learned.

The update gate $z_j$ is computed by

$$z_j=\sigma([W_z x]_j + [U_z h_{<t-1>}]_j)\ (6)$$

The actual activation of the proposed unit $h_j$ is then computed by

$$h_j^{<t>}=z_j h_j^{<t-1>}+(1-z_j) \tilde{h}_j^{<t>}\ (7)$$

where

$$\tilde{h}_j^{<t>}=\phi([W_x]_j + [U(r \odot h_{<t-1>})]_j)\ (8)$$

n this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to **drop** any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

## Statistical Machine Translation
The goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes

$$p(f|e) \propto p(e|f)p(f)$$

In practice, however, most SMT systems model log p(f | e) as a log- linear model with additional features and corresponding weights:

$$log p(f|e)=\sum_{n=1}^N w_n f_n(f,e)+log Z(e)\ (9)$$

where fn and wn are the n-th feature and weight, respectively. Z(e) is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.
