# A Critical Review of Recurrent Neural Networks for Sequence Learning

This is a [paper](https://arxiv.org/pdf/1506.00019.pdf) reading note

**Recurrent Neural Networks(RNN)**: are connectionist models with the ability to selectively pass information across the sequence steps, while processing sequential data one element at a time

### Why Not Using Markov Models?

- Their states must be drawn from a modestly sized discrete state space $S$, the inference scales in time $O(|S|^2)$
- Transition matrix grow in size $|S|^2$
- Only takein previous one state, have much lager matrices if count into more previous states
- Rendering Markov models computationally impractical for modeling long-range dependencies **(NILM?)**

### Neural Networks

$v_j=l_j(\sum_{j'}w_{jj'}v_{j'})$  
- $jj'$ denotes "to-from"
- $v_j$: output of node $j$
- $l_j$: activation function of node $j$
- $w_{jj'}$: weight
- $v_{j'}$: output of node $j'$

Term the weighted sum inside parentheses the _incoming activation_, note as $a_j$

Common activation functions:
- Sigmoid: $\delta(z)=\frac{1}{1+e^{-z}}$
- tanh: $\phi (z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$
- Rectified Linear Unit(ReLU): $l_j(z)=max(0,z)$

Output activation function: softmax
- $\hat{y}_k=\frac{e^{a_k}}{\sum_{k'=1}^Ke^{a_{k'}}}$ for $k=1$ to $k=K$

### Recurrent Neural Networks

RNN: feedforward NN augmented by the inclusion of edges that span adjacent time steps  
At time $t$, nodes with recurrent edges receive input from both $x^{(t)}$ and $h^{(t-1)}$

$h^{(t)}=\delta(W^{hx}x^{(t)}+W^{hh}h^{(t-1)}+b_h)$

$\hat{y}^{(t)}=softmax(W^{yh}h^{(t)}+b_y)$

![](pics\RNN_Survey_1.png)
![](pics\RNN_Survey_2.png)

### Training RNN

Extremely difficult:
- vanishing
- exploding gradients

**Truncated backpropagation through time**: on solution to exploding/vanishing gradients -> sacrifice the ability to have long memory

### Long Short-Term Memory (LSTM)

![](pics/RNN_survey_3.png)

Each ordinary node in hidden layer is replaced by a memory cell -> recurrent edge with weight 1: ensure gradient can pass without vanishing or exploding

![](pics/RNN_survey_4.png)

![](pics/RNN_survey_5.png)

Full algorithm of LSTM with forget gates:
- $g^{(t)}=\phi(W^{gx}x^{(t)}+W^{gh}h^{(t-1)}+b_g)$
- $i^{(t)}=\delta(W^{ix}x^{(t)}+W^{ih}h^{(t-1)}+b_i)$
- $f^{(t)}=\delta(W^{fx}x^{t}+W^{fh}h^{t-1}+b_f)$
- $o^{(t)}=\delta(W^{ox}x^{t}+W^{oh}h^{t-1}+b_o)$
- $s^{(t)}=g^{(g)}\odot i^{(i)}+s^{(t-1)}\odot f^{(t)}$
- $h^{(t)}=\phi(s^{(t)})\odot o^{(t)}$

### Bidirectional Recurrent Neural Networks (BRNNs)

![](pics/RNN_survey_6.png)

has two layers of hidden layers

- $h^{(t)}=\delta(W^{hx}x^{(t)}+W^{hh}h^{(t-1)}+b_h)$
- $z^{(t)}=\delta(W^{Zx}x^{(t)}+W^{ZZ}z^{(t+1)}+b_z)$
- $\hat{y}^{(t)}=softmax(W^{yh}h^{(t)}+W^{yz}z^{(t)}+b_y)$