### Introduction

<img src="../images/RNN Diagram.png">

In [1]:
import numpy as np

In [2]:
def sigmoid(x):
    return 1/(1 + np.exp(-np.clip(x, -500, 500)))

In [3]:
def softmax(x):
    pass

An RNN Unit at time-step $t$ takes as input: <br/>
* a minibatch of 'words' denoted by $x^{(t)}$, of dimensions $N \times d$, and <br/>
* the 'hidden-state' vector $h^{(t-1)}$ from the previous unit, of dimensions $N \times D_h$.

**Note: $d$ and $D_h$ are hyper-parmaters, i.e. we _chose_ to represent each hidden state using a vector of length $D_h$ and we _chose_ to use 'word embedding' vectors of length $d$.**

The code below implements a single RNN Unit's computation. The output is the 'hidden-state' vector $h^{(t)}$ for this unit, of dimensions $N \times D_h$. (In this notebook, $h^{(t)}$ for time-step $t$ is always referred to as $h\_next$).

In [3]:
def rnn_step_forward(x_t, h_prev, Wh, We, b1):
    h_next = sigmoid(np.matmul(h_prev, Wh.T) + np.matmul(x_t, We.T) + b1)
    cache = h_prev, h_next, x_t
    return h_next, cache

An RNN Unit depends on the previous RNN Unit's hidden-state (this is not different from any plain feedforward network). Therefore we sequentially run the $rnn\_step\_forward$ method implemented above, for each time step.

**Note: One crucial difference from a plain feedforward network is that each RNN Unit uses the same parameters $W_h$, $W_e$, and $b_1$. This point of difference will have a significant bearing on how we backprop through an RNN.**

The code below implements the forward pass through an RNN. We are given as inputs:
* a minibatch of 'word sequences' denoted by $x$, of dimensions $N \times T \times d$, where $N$ is the numnber of minibatches and $T$ is the length of each sequence,
* an initial state vector denoted by $h^{(0)}$ of dimensions $N \times D_h$

In [4]:
def rnn_forward(T, x, h_0, Wh, We, b1):
    N, T, d = x.shape
    _, Dh = h_0.shape
    h = np.zeros((N, T, Dh))
    h_prev = h_0
    cache = {}
    for t in range(T):
        h[:, t, :],  cache_step = rnn_step_forward(x[:,t,:], h_prev, Wh, We, b1) 
        h_prev = h[:, t, :]
        cache.update({t : cache_step})

At each time-step $t$, we are given as inputs:
* the cache for this time-step saved during our forward pass - cache stores $h^{(t)}$ and $h^{(t-1)}$,
* the gradient of total loss $J$ with respect to $h^{(t)}$, denoted by $dh\_next$, of dimensions $N \times D_h$.

In [5]:
def rnn_step_backward(dh_next, cache):
    h_prev, h_next, x_t = cache
    dsigmoid = h_next*(1 - h_next)
    dWh = np.matmul(h_prev.T, dh_next*dsigmoid)
    dWe = np.matmul(x_t.T, dh_next*dsigmoid)
    dh_prev = 0

In [2]:
def rnn_backward():
    pass

Atop each RNN Unit, sits an Affine layer which takes the vector $h^{(t)}$ as input, applies an Affine transformation, and computes the Softmax Probability. The parameters of this layer are $U \space (Dim: D_{h} \times V )$ and $b_2 \space (Dim: V \times 1)$. 

We do not have to implement a separate Affine layer for each time-step. Unlike in the case of an RNN Unit where the computation inside it depended on the output of its previous unit, the Affine computations at each time-step are independent of each other. Therefore, once we have computed $h^{(t)}$ for each time-step, we will perform the Affine computation for ALL $T$ time-steps in one go, taking into impact the contribution from ALL mini-batches.

In [8]:
def affine_forward(h, U, b2):
    N, T, Dh = h.shape
    V = b2.shape[0]
    theta = (np.matmul(h.reshape(N*T, Dh), U) + b2).reshape(N, T, V)
    cache = U, b2, h
    return theta, cache

In [7]:
def affine_backward(dtheta, cache):
    U, b2, h = cache
    Dh = U.shape[0]
    N, T, V = dtheta.shape
    dh = np.matmul(dtheta.reshape(N*T, V), U).reshape(N, T, Dh)
    dU = np.matmul((dtheta.reshape(N*T, V).T), h.reshape(N*T, Dh))
    db2 = dtheta.sum(axis=(0,1))
    return dh, dU, db2
