# Recurrent neural network

## Why not standard network

- inputs, outputs can be different lengths in different examples
- doesn't share features learned across different positions of text

## Recurrent neural network

- $a^{<0>} = \overrightarrow{0}$
- $a^{<1>} = g_{1}(W_{aa}a^{<0>} + W_{ax}X^{<1>} + b_{a})$ (tanh/relu)
- $\hat{y}^{<1>} = g_{2}(W_{ya}a^{<1>} + b_{y})$ (sigmoid)
- $a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}X^{<t>} + b_{a}) = g(W_{a}[a^{<t-1>}, X^{<t>}] + b_{a})$ where $[W_{aa} \vdots W_{ax}] = W_{a}$ and $[a^{<t-1>}, X^{<t>}]$ = |
$\begin{bmatrix}
   a^{<t-1>} \\
   X^{<t>} \\
 \end{bmatrix}$
- $\hat{y}^{<t>} = g(W_{ya}a^{<t>} + b_{y}) = g(W_{y}a^{<t>} + b_{y})$ 

## Backpropagation through time

- $L^{<t>}(\hat{y}^{<t>}, y^{<t>}) = -y^{<t>}log\hat{y}^{<t>} - (1-y^{<t>})log(1-\hat{y}^{<t>})$
- $L(\hat{y},y) = \displaystyle\sum_{t=1}^{T_{y}}L^{<t>}(\hat{y}^{<t>}, y^{<t>})$

## RNN types

- one to many (ex. music generation)
- many to one (ex. sentiment classification)
- many to many (ex. name entity recognition)
- many to many (ex. machine translation)

## Language modelling

- ex. P(The apple and pair salad) = $3.2x10^{-13}$, P(The apple and peer salad) = $5.7x10^{-13}$
- ex. "cats average 15 hours of sleep a day. (EOS)"
    - $L^{<t>}(\hat{y}^{<t>}, y^{<t>}) = -\displaystyle\sum_{i}y_{i}^{<t>}log\hat{y}_{i}^{<t>}$
    - $L = \displaystyle\sum_{t}L^{<t>}(\hat{y}^{<t>}, y^{<t>})$
    - $P(y^{<1>}, y^{<2>}, y^{<3>}) = P(y^{<1>})P(y^{<2>}|y^{<1>})P(y^{<3>}|y^{<1>},y^{<2>})$
    
## Vanishing gradients with RNNs

- ex. "the cats, which, ..., were full" vs "the cat, which, ..., was full"
- capturing long-term dependencies is hard
    
## Gated recurrent unit

- RNN unit
    - $a^{<t>} = g(W_{a}[a^{<t-1>}, x^{<t>}] + b_{a})$ (where $g$ is tanh)
- GRU
    - let $c$ = memory cell
    - $c^{<t>} = a^{<t>}$
    - $\tilde{c}^{<t>} = tanh(W_{c}[c^{<t-1>},x^{<t>}] + b_{c})$
    - $\Gamma_{u} = \sigma(W_{u}[c^{<t-1>},x^{<t>}] + b_{u})$
    - $c^{<t>} = \Gamma_{u}\tilde{c}^{<t>} + (1-\Gamma_{u}){c}^{<t-1>}$ (if vectors, multiplications are element-wise)
- Full GRU
    - $\tilde{c}^{<t>} = tanh(W_{c}[\Gamma_{r}c^{<t-1>},x^{<t>}] + b_{c})$
    - $\Gamma_{u} = \sigma(W_{u}[c^{<t-1>},x^{<t>}] + b_{u})$
    - $\Gamma_{r} = \sigma(W_{r}[c^{<t-1>},x^{<t>}] + b_{r})$
    - $c^{<t>} = \Gamma_{u}\tilde{c}^{<t>} + (1-\Gamma_{u}){c}^{<t-1>}$
    - $a^{<t>} = c^{<t>}$
    
## Long short term memory (LSTM)

- $\tilde{c}^{<t>} = tanh(W_{c}[a^{<t-1>},x^{<t>}] + b_{c})$
- $\Gamma_{u} = \sigma(W_{u}[c^{<t-1>},x^{<t>}] + b_{u})$ (update)
- $\Gamma_{f} = \sigma(W_{f}[c^{<t-1>},x^{<t>}] + b_{f})$ (forget)
- $\Gamma_{o} = \sigma(W_{o}[c^{<t-1>},x^{<t>}] + b_{o})$ (output)
- $c^{<t>} = \Gamma_{u}\tilde{c}^{<t>} + \Gamma_{u}{c}^{<t-1>}$
- $a^{<t>} = \Gamma_{o}tanhc^{<t>}$

## Bidirectional RNN

- getting information from the future
    - ex. He said, "Teddy bears are on sale!"
    - ex. He said, "Teddy Roosevelt was a great President"
- $\hat{y}^{<t>} = g(W_{y}[\overrightarrow{a}^{<t>}, \overleftarrow{a}^{<t>}] + b_{y})$

## Example

### Packages