# Recurrent neural network

## Why not standard network

- Inputs, outputs can be of different lengths in different examples.
- Doesn't share features learned across different positions of text.

## Recurrent neural network

- $a^{<0>} = \overrightarrow{0}$
- $a^{<1>} = g_{1}(W_{aa}a^{<0>} + W_{ax}X^{<1>} + b_{a})$ (tanh/relu)
- $\hat{y}^{<1>} = g_{2}(W_{ya}a^{<1>} + b_{y})$ (sigmoid)
- $a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}X^{<t>} + b_{a}) = g(W_{a}[a^{<t-1>}, X^{<t>}] + b_{a})$ where $[W_{aa} \vdots W_{ax}] = W_{a}$ and $[a^{<t-1>}, X^{<t>}]$ = |
$\begin{bmatrix}
   a^{<t-1>} \\
   X^{<t>} \\
 \end{bmatrix}$
- $\hat{y}^{<t>} = g(W_{ya}a^{<t>} + b_{y}) = g(W_{y}a^{<t>} + b_{y})$ 

## Backpropagation through time

- $L^{<t>}(\hat{y}^{<t>}, y^{<t>}) = -y^{<t>}log\hat{y}^{<t>} - (1-y^{<t>})log(1-\hat{y}^{<t>})$
- $L(\hat{y},y) = \displaystyle\sum_{t=1}^{T_{y}}L^{<t>}(\hat{y}^{<t>}, y^{<t>})$

## RNN types

- One to many (Ex. music generation)
- Many to one (Ex. sentiment classification)
- Many to many (Ex. name entity recognition)
- Many to many (Ex. machine translation)

## Language modelling

- Ex. P(The apple and pair salad) = $3.2x10^{-13}$, P(The apple and peer salad) = $5.7x10^{-13}$
- Ex. "cats average 15 hours of sleep a day. (EOS)"
    - $L^{<t>}(\hat{y}^{<t>}, y^{<t>}) = -\displaystyle\sum_{i}y_{i}^{<t>}log\hat{y}_{i}^{<t>}$
    - $L = \displaystyle\sum_{t}L^{<t>}(\hat{y}^{<t>}, y^{<t>})$
    - $P(y^{<1>}, y^{<2>}, y^{<3>}) = P(y^{<1>})P(y^{<2>}|y^{<1>})P(y^{<3>}|y^{<1>},y^{<2>})$
    
## Vanishing gradients with RNNs

- Ex. "the cats, which, ..., were full" vs "the cat, which, ..., was full"
- Capturing long-term dependencies is hard
    
## Gated recurrent unit

- RNN unit
    - $a^{<t>} = g(W_{a}[a^{<t-1>}, x^{<t>}] + b_{a})$ (where $g$ is tanh)
- GRU
    - let $c$ = memory cell
    - $c^{<t>} = a^{<t>}$
    - $\tilde{c}^{<t>} = tanh(W_{c}[c^{<t-1>},x^{<t>}] + b_{c})$
    - $\Gamma_{u} = \sigma(W_{u}[c^{<t-1>},x^{<t>}] + b_{u})$
    - $c^{<t>} = \Gamma_{u}\tilde{c}^{<t>} + (1-\Gamma_{u}){c}^{<t-1>}$ (if vectors, multiplications are element-wise)
- Full GRU
    - $\tilde{c}^{<t>} = tanh(W_{c}[\Gamma_{r}c^{<t-1>},x^{<t>}] + b_{c})$
    - $\Gamma_{u} = \sigma(W_{u}[c^{<t-1>},x^{<t>}] + b_{u})$
    - $\Gamma_{r} = \sigma(W_{r}[c^{<t-1>},x^{<t>}] + b_{r})$
    - $c^{<t>} = \Gamma_{u}\tilde{c}^{<t>} + (1-\Gamma_{u}){c}^{<t-1>}$
    - $a^{<t>} = c^{<t>}$
    
## Long short term memory (LSTM)

- $\tilde{c}^{<t>} = tanh(W_{c}[a^{<t-1>},x^{<t>}] + b_{c})$
- $\Gamma_{u} = \sigma(W_{u}[c^{<t-1>},x^{<t>}] + b_{u})$ (update)
- $\Gamma_{f} = \sigma(W_{f}[c^{<t-1>},x^{<t>}] + b_{f})$ (forget)
- $\Gamma_{o} = \sigma(W_{o}[c^{<t-1>},x^{<t>}] + b_{o})$ (output)
- $c^{<t>} = \Gamma_{u}\tilde{c}^{<t>} + \Gamma_{u}{c}^{<t-1>}$
- $a^{<t>} = \Gamma_{o}tanhc^{<t>}$

## Bidirectional RNN

- Getting information from the future
    - Ex. He said, "Teddy bears are on sale!"
    - Ex. He said, "Teddy Roosevelt was a great President"
- $\hat{y}^{<t>} = g(W_{y}[\overrightarrow{a}^{<t>}, \overleftarrow{a}^{<t>}] + b_{y})$

## Example

### Packages

In [1]:
import numpy as np

### Forward prop

<img src="img/RNN.png" style="width:500;height:300px;">

Input
- $x^{(i) \langle t \rangle }$ is a one-dimensional input vector.
    - Ex. language with 5000 word vocabulary could be one-hot encoded into a vector with 5000 units.  So $x^{(i)\langle t \rangle}$ would have the shape (5000,).
- Let $n_x$ be the number of units in a single timestep of a single training example.

Time steps
- Ex. if there are 10 time steps, $T_{x} = 10$

Batches
- Assume mini-batches with $m=20$ training examples.
- Stack $20$ columns of $x^{(i)}$ examples. For example, this tensor has the shape $(5000,20,10)$.
- So the shape of mini-batch is $(n_x,m,T_x)$

2D slice of each step
- Use mini-batches of training examples such that for each time step $t$, use 2D slice of shape $(n_x,m)$.
- Let 2D slice be $x^{\langle t \rangle}$.

Hidden state
- For a single training example, hidden state has length $n_{a}$.
- For mini-batch of $m$ training examples, the shape of a mini-batch is $(n_{a},m)$.
- Including time step, shape of the hidden state is $(n_{a}, m, T_x)$.
- Use 2D slice of shape $(n_{a}, m)$.
- Let 2D slice be $a^{\langle t \rangle}$. 

Prediction
- $\hat{y}$ is 3D tensor of shape $(n_{y}, m, T_{y})$.
    - $n_{y}$: number of units in the vector representing the prediction.
    - $m$: number of examples in mini-batch.
    - $T_{y}$: number of time steps in the prediction.
- For single time step $t$, 2D slice $\hat{y}^{\langle t \rangle}$ has shape $(n_{y}, m)$.

#### RNN cell

<img src="img/rnn_step_forward_figure2_v3a.png" style="width:700px;height:300px;">


In [3]:
def rnn_cell_forward(xt, a_prev, parameters):
    """
    Implements a single forward step of the RNN-cell as described in Figure (2)

    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
    """
    
    # Retrieve parameters from "parameters"
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]
    
    # compute next activation state using the formula given above
    a_next = np.tanh(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)
    # compute output of the current cell using the formula given above
    yt_pred = softmax(np.dot(Wya, a_next) + by)   
    
    # store values you need for backward propagation in cache
    cache = (a_next, a_prev, xt, parameters)
    
    return a_next, yt_pred, cache

#### RNN forward pass

<img src="img/rnn_forward_sequence_figure3_v3a.png" style="width:800px;height:180px;">

Inputs each step:
- $a^{\langle t-1 \rangle}$: Hidden state from the previous cell.
- $x^{\langle t \rangle}$: Current time-step's input data.

Outputs each step:
- Hidden state ($a^{\langle t \rangle}$)
- Prediction ($y^{\langle t \rangle}$)

The weights and biases $(W_{aa}, b_{a}, W_{ax}, b_{x})$ are re-used each time step. 

In [4]:
def rnn_forward(x, a0, parameters):
    """
    Implement the forward propagation of the recurrent neural network described in Figure (3).

    Arguments:
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y_pred -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of caches, x)
    """
    
    # Initialize "caches" which will contain the list of all caches
    caches = []
    
    # Retrieve dimensions from shapes of x and parameters["Wya"]
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape
        
    # initialize "a" and "y_pred" with zeros (≈2 lines)
    a = np.zeros((n_a, m, T_x))
    y_pred = np.zeros((n_y, m, T_x))
    
    # Initialize a_next (≈1 line)
    a_next = a0
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, compute the prediction, get the cache (≈1 line)
        a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t], a[:,:,t], parameters)
        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next
        # Save the value of the prediction in y (≈1 line)
        y_pred[:,:,t] = yt_pred
        # Append "cache" to "caches" (≈1 line)
        caches.append(cache)
    
    # store values needed for backward propagation in cache
    caches = (caches, x)
    
    return a, y_pred, caches

### LSTM (Long Short-Term Memory)

- Addresses vanishing gradient.

<img src="img/LSTM_figure4_v3a.png" style="width:500;height:400px;">

#### Forget gate $\mathbf{\Gamma}_{f}$

- Tensor containing values that are between 0 and 1.
    - If unit in forget gate has value close to 0, LSTM will "forget" the stored state in the previous cell state.
    * If unit in forget gate has value close to 1, LSTM will mostly remember the corresponding value in the stored state.
