# Recurrent neural network

- Useful for NLP due to "memory".
- Notation
    - $[l]$: $l^{th}$ layer
    - $(i)$: $i^{th}$ example
    - $\langle t \rangle$: $t^{th}$ timestamp
    - $i$: $i^{th}$ entry of a vector
- Input $x$ is represented by a 3-D tensor $(n_{x},m,T_{x})$
    - $n_{x}$: number of units (a single timestamp, a single example) For example, a language with 5000 words one-hot coded into a vector.
    - $m$: number of training example.
    - $T_{x}$: number of time steps.
- Hidden state $a$ is similarly represented by a 3-D tensor $(n_{a},m,T_{x})$
- Prediction $\hat{y}$ is similarly represented by a 3-D tensor $(n_{y},m,T_{y})$
    - $T_{y}$: number of time steps in the prediction.

## Single RNN cell

- Input
    - $x^{\langle t \rangle}$: current input.
    - $a^{\langle t-1 \rangle}$: previous hidden state.
- Output
    - $a^{\langle t \rangle}$: given to next RNN.
    - $\hat{y}^{\langle t \rangle}$: current prediction. 
- Parameters
    - The weights and biases $(W_{aa}, b_{a}, W_{ax}, b_{x})$ are re-used each time step.
    
$$a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)$$

$$\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)$$

In [1]:
import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [4]:
def rnn_cell_forward(xt, a_prev, parameters):
    """
    Implements a single forward step of RNN-cell.

    Arguments:
    xt -- Input data at timestep "t", numpy array of shape (n_x, m)
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- Python dictionary containing:
        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
        ba --  Bias, numpy array of shape (n_a, 1)
        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
        
    Returns:
    a_next -- Next hidden state, numpy array of shape (n_a, m)
    yt_pred -- Prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- Tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
    """
    
    # Retrieve parameters from "parameters"
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]
    
    a_next = np.tanh(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)
    yt_pred = softmax(np.dot(Wya, a_next) + by)   
    
    # store values you need for backward propagation in cache
    cache = (a_next, a_prev, xt, parameters)
    
    return a_next, yt_pred, cache

## RNN forward path

In [3]:
def rnn_forward(x, a0, parameters):
    """
    Implement the forward propagation of RNN.

    Arguments:
    x -- Input data for every time-step, numpy array of shape (n_x, m, T_x)
    a0 -- Initial hidden state, numpy array of shape (n_a, m)
    parameters -- Python dictionary containing:
        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
        ba -- Bias numpy array of shape (n_a, 1)
        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y_pred -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- Tuple of values needed for the backward pass, contains (list of caches, x)
    """
    
    # Initialize "caches" which will contain the list of all caches
    caches = []
    
    # Retrieve dimensions from shapes of x and parameters["Wya"]
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape
        
    # initialize "a" and "y_pred" with zeros (≈2 lines)
    a = np.zeros((n_a, m, T_x))
    y_pred = np.zeros((n_y, m, T_x))
    
    # Initialize a_next (≈1 line)
    a_next = a0
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, compute the prediction, get the cache
        a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t], a[:,:,t], parameters)
        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next
        # Save the value of the prediction in y
        y_pred[:,:,t] = yt_pred
        # Append "cache" to "caches"
        caches.append(cache)
    
    # store values needed for backward propagation in cache
    caches = (caches, x)
    
    return a, y_pred, caches

## RNN Limitations

- Vanishing gradient

## LSTM 

$$\mathbf{\Gamma}_{f}^{\langle t \rangle} = \sigma(\mathbf{W}_{f}[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{f})$$

- Forget gate
    - A tensor containing values between $0$ and $1$. (Sigmoid function $\sigma$)
        - If a unit is close to $0$, LSTM forgets the previous cell state.
        - If a unit is close to $1$, LSTM remembers the previous cell state.
    - $a^{\langle t-1 \rangle}$ and $x^{\langle t \rangle}$ are concatenated together, then multiplied by $\mathbf{W}_{f}$.
    - $\mathbf{\Gamma}_{f}$ has the same dimension as the previous cell state $c^{\langle t-1 \rangle}$.
    
$$\mathbf{\tilde{c}}^{\langle t \rangle} = \tanh\left( \mathbf{W}_{c} [\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{c} \right)$$

- Candidate value
    - A tensor containing values between $-1$ and $1$. ($tanh$ function)
    - A tensor containing information that may be stored in current cell state $\mathbf{c}^{\langle t \rangle}$.
    
$$\mathbf{\Gamma}_{i}^{\langle t \rangle} = \sigma(\mathbf{W}_{i}[a^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{i})$$ 

- Update gate
    - A tensor containing values between $0$ and $1$. (Sigmoid function $\sigma$)
        - If a unit is close to $1$, the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ is passed onto the hidden state $\mathbf{c}^{\langle t \rangle}$
        - If a unit is close to $0$, the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ is not passed onto the hidden state $\mathbf{c}^{\langle t \rangle}$
        
$$\mathbf{c}^{\langle t \rangle} = \mathbf{\Gamma}_{f}^{\langle t \rangle}* \mathbf{c}^{\langle t-1 \rangle} + \mathbf{\Gamma}_{i}^{\langle t \rangle} *\mathbf{\tilde{c}}^{\langle t \rangle}$$

- Cell state
    - "memory" that gets passed to future time steps.
    - Previous cell state is weighted by forget gate.
    - Candidate value is weighted by update gate.
    
$$\mathbf{\Gamma}_{o}^{\langle t \rangle} = \sigma(\mathbf{W}_{o}[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{o})$$ 

- Output gate
    - A tensor containing values between $0$ and $1$. (Sigmoid function $\sigma$)
    - Decides what gets sent to prediction.
    
$$ \mathbf{a}^{\langle t \rangle} = \mathbf{\Gamma}_{o}^{\langle t \rangle} * \tanh(\mathbf{c}^{\langle t \rangle})$$

- Hidden state
    - Used to determine three gates ($\mathbf{\Gamma}_{f}, \mathbf{\Gamma}_{u}, \mathbf{\Gamma}_{o}$) of the next time step.
    - Also used for the prediction $y^{\langle t \rangle}$.
    
$$\mathbf{y}^{\langle t \rangle}_{pred} = \textrm{softmax}(\mathbf{W}_{y} \mathbf{a}^{\langle t \rangle} + \mathbf{b}_{y})$$

- Prediction
    - Since this is classification, softmax is used.
    

In [1]:
def lstm_cell_forward(xt, a_prev, c_prev, parameters):
    """
    Implement a single forward step of the LSTM-cell.

    Arguments:
    xt -- Input data at timestep "t", numpy array of shape (n_x, m)
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    c_prev -- Memory state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- Python dictionary containing:
        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
        Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
        bi -- Bias of the update gate, numpy array of shape (n_a, 1)
        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
        bc -- Bias of the first "tanh", numpy array of shape (n_a, 1)
        Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
        bo -- Bias of the output gate, numpy array of shape (n_a, 1)
        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a_next -- next hidden state, numpy array of shape (n_a, m)
    c_next -- next memory state, numpy array of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, c_next, a_prev, c_prev, xt, parameters)
    
    Note: ft/it/ot stand for the forget/update/output gates, cct stands for the candidate value (c tilde),
          c stands for the cell state (memory)
    """

    # Retrieve parameters from "parameters"
    Wf = parameters["Wf"] # forget gate weight
    bf = parameters["bf"]
    Wi = parameters["Wi"] # update gate weight 
    bi = parameters["bi"] 
    Wc = parameters["Wc"] # candidate value weight
    bc = parameters["bc"]
    Wo = parameters["Wo"] # output gate weight
    bo = parameters["bo"]
    Wy = parameters["Wy"] # prediction weight
    by = parameters["by"]
    
    # Retrieve dimensions from shapes of xt and Wy
    n_x, m = xt.shape
    n_y, n_a = Wy.shape

    # Concatenate a_prev and xt
    concat = np.concatenate((a_prev, xt))
    concat[: n_a, :] = a_prev
    concat[n_a :, :] = xt

    # Compute values for ft, it, cct, c_next, ot, a_next
    ft = sigmoid(np.dot(Wf, concat) + bf)        # forget gate
    it = sigmoid(np.dot(Wi, concat) + bi)        # update gate
    cct = np.tanh(np.dot(Wc, concat) + bc)       # candidate value
    c_next = np.multiply(ft, c_prev) + np.multiply(it, cct)    # cell state
    ot = sigmoid(np.dot(Wo, concat) + bo)        # output gate
    a_next = np.multiply(ot, np.tanh(c_next))    # hidden state
    
    # Compute prediction of the LSTM cell
    yt_pred = softmax(np.dot(Wy, a_next) + by)

    # store values needed for backward propagation in cache
    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

In [2]:
def lstm_forward(x, a0, parameters):
    """
    Implement the forward propagation of the recurrent neural network using an LSTM-cell.

    Arguments:
    x -- Input data for every time-step, numpy array of shape (n_x, m, T_x)
    a0 -- Initial hidden state, numpy array of shape (n_a, m)
    parameters -- Python dictionary containing:
        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
        Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
        bi -- Bias of the update gate, numpy array of shape (n_a, 1)
        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
        bc -- Bias of the first "tanh", numpy array of shape (n_a, 1)
        Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
        bo -- Bias of the output gate, numpy array of shape (n_a, 1)
        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    c -- The value of the cell state, numpy array of shape (n_a, m, T_x)
    caches -- Tuple of values needed for the backward pass, contains (list of all the caches, x)
    """

    # Initialize "caches", which will track the list of all the caches
    caches = []
    
    Wy = parameters['Wy'] # saving parameters['Wy'] in a local variable
    # Retrieve dimensions from shapes of x and parameters['Wy']
    n_x, m, T_x = x.shape
    n_y, n_a = parameters['Wy'].shape
    
    # initialize "a", "c" and "y" with zeros
    a = np.zeros((n_a, m, T_x))
    c = np.zeros((n_a, m, T_x))
    y = np.zeros((n_y, m, T_x))
    
    # Initialize a_next and c_next
    a_next = a0
    c_next = np.zeros(a_next.shape)
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, next memory state, compute the prediction, get the cache
        a_next, c_next, yt, cache = lstm_cell_forward(x[:,:,t], a[:,:,t], c[:,:,t], parameters)
        # Save the value of the new "next" hidden state in a
        a[:,:,t] = a_next
        # Save the value of the prediction in y
        y[:,:,t] = yt
        # Save the value of the next cell state
        c[:,:,t] = c_next
        # Append the cache into caches
        caches.append(cache)
    
    # store values needed for backward propagation in cache
    caches = (caches, x)

    return a, y, c, caches