# Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio
    11 Dec 2014

https://arxiv.org/abs/1412.3555

## 总结
1. RNN
    1. computes a weighted sum of the input signal and applies a nonlinear function.
    1. always replaces the activation, or the content of a unit with a new value computed from the current input and the previous hidden state.
1. LSTM
    1. Each j-th LSTM unit maintains a memory $c_t^j$ at time t
    1. The memory cell $c_t^j$ is updated by partially forgetting the existing memory and adding a new memory content $c_t^{\sim j}$
1. GRU:make each recurrent unit to adaptively capture dependencies of different time scales.

## Background: Recurrent Neural Network
More formally, given a sequence x = (x1 , x2 , · · · , xT ), the RNN updates its recurrent hidden state $h_t$ by

$$
h_t=\left\{\begin{matrix}
 0,& if\ t = 0 \\ 
 \phi(h_{t-1},x_t), & elsewise.
\end{matrix}\right.\ (1)
$$

where φ is a nonlinear function such as composition of a logistic sigmoid with an affine transformation. Optionally, the RNN may have an output y = (y1 , y2 , . . . , yT ) which may again be of variable length.

Traditionally, the update of the recurrent hidden state in Eq. (1) is implemented as

$$h_t=g(Wx_t+Uh_{t-1})\ (2)$$

The sequence probability can be decomposed into

$$p(x_1,\cdots,x_T)=p(x_1)p(x_2|x_1)p(x_3|x_1,x_2) \cdots p(x_T|x_1,\cdots,x_{T-1})\ (3)$$

where the last element is a special end-of-sequence value. We model each conditional probability distribution with

$$p(x_t|x_1,\cdots,x_{t-1})=g(h_t)$$

where ht is from Eq. (1).

## Gated Recurrent Neural Networks
![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-9/57870199.jpg)

### Long Short-Term Memory Unit
Unlike to the recurrent unit which simply computes a weighted sum of the input signal and applies a nonlinear function, each j-th LSTM unit maintains a memory $c_t^j$ at time t. The output $h_t^j$ , or the activation, of the LSTM unit is then

$$h_t^j=o_t^j tanh(c_t^j)$$

where $o_t^j$ is an output gate that modulates the amount of memory content exposure. The output gate is computed by

$$o_t^j=\sigma(W_o x_t+U_o h_{t-1} + V_o c_t)^j$$

where σ is a logistic sigmoid function. Vo is a diagonal matrix.

The memory cell $c_t^j$ is updated by partially forgetting the existing memory and adding a new memory content $c_t^{\sim j}$:

$$c_t^j=f_t^j c_{t-1}^j + i_t^j c_t^{\sim j}\ (4)$$

where the new memory content is

$$c_t^{\sim j}=tanh(W_c x_t + U_c h_{t-1})^j$$

The extent to which the existing memory is forgotten is modulated by a forget gate $f_t^j$, and the degree to which the new memory content is added to the memory cell is modulated by an input gate $i_t^j$ . Gates are computed by

$$f_t^j=\sigma(W_f x_t + U_f h_{t-1} + V_f c_{t-1})^j$$

$$i_t^j=\sigma(W_i x_t + U_i h_{t-1} + V_i c_{t-1})^j$$

Note that Vf and Vi are diagonal matrices.

### Gated Recurrent Unit
A gated recurrent unit (GRU) was proposed by Cho et al. [2014] to make each recurrent unit to adaptively capture dependencies of different time scales.

The activation $h_t^j$ of the GRU at time t is a linear interpolation between the previous activation $h_{t−1}^j$ and the candidate activation $\tilde{h}_t^j$ :

$$h_t^j=(10z_t^j)h_{t-1}^j+z_t^j\tilde{h}_t^j\ (5)$$

where an update gate $z_t^j$ decides how much the unit updates its activation, or content. The update gate is computed by

$$z_t^j=\sigma(W_z x_t + U_z h_{t-1})^j$$

The candidate activation $\tilde{h}_t^j$ is computed similarly to that of the traditional recurrent unit(seeEq.(2)) and as in [Bahdanau et al., 2014],

$$\tilde{h}_t^j=tanh(W x_t + U(r_t \odot h_{t-1}))^j$$

where $r_t$ is a set of reset gates and ⊙ is an element-wise multiplication. When off ($r_t^j$ close to 0), the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state.

The reset gate $r_t^j$ is computed similarly to the update gate:

$$r_t^j=\sigma(W_r x_t + U_r h_{t-1})^j$$

### Discussion
The most prominent feature shared between these units is the additive component of their update from t to t + 1, which is lacking in the traditional recurrent unit. The traditional recurrent unit always replaces the activation, or the content of a unit with a new value computed from the current input and the previous hidden state. On the other hand, both LSTM unit and GRU keep the existing content and add the new content on top of it (see Eqs. (4) and (5)).

This additive nature has two advantages. First, it is easy for each unit to remember the existence of a specific feature in the input stream for a long series of steps. Any important feature, decided by either the forget gate of the LSTM unit or the update gate of the GRU, will not be overwritten but be maintained as it is.

Second, and perhaps more importantly, this addition effectively creates shortcut paths that bypass multiple temporal steps. These shortcuts allow the error to be back-propagated easily without too quickly vanishing (if the gating unit is nearly saturated at 1) as a result of passing through multiple, bounded nonlinearities, thus reducing the difficulty due to vanishing gradients [Hochreiter, 1991, Bengio et al., 1994].

These two units however have a number of differences as well. One feature of the LSTM unit that is missing from the GRU is the controlled exposure of the memory content. In the LSTM unit, the amount of the memory content that is seen, or used by other units in the network is controlled by the output gate. On the other hand the GRU exposes its full content without any control.

Another difference is in the location of the input gate, or the corresponding reset gate. The LSTM unit computes the new memory content without any separate control of the amount of information flowing from the previous time step. Rather, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not independently control the amount of the candidate activation being added (the control is tied via the update gate).
