### Introduction

<img src="../images/RNN Diagram.png">

In [16]:
import numpy as np

In [17]:
def sigmoid(x):
    return 1/(1 + np.exp(-np.clip(x, -500, 500)))

In [18]:
def softmax(x):
    pass

An RNN Unit at time-step $t$ takes as input: <br/>
* a minibatch of 'words' denoted by $x^{(t)}$, of dimensions $N \times d$, and <br/>
* the 'hidden-state' vector $h^{(t-1)}$ from the previous unit, of dimensions $N \times D_h$.

**Note: $d$ and $D_h$ are hyper-parmaters, i.e. we _chose_ to represent each hidden state using a vector of length $D_h$ and we _chose_ to use 'word embedding' vectors of length $d$.**

The code below implements a single RNN Unit's computation. The output is the 'hidden-state' vector $h^{(t)}$ for this unit, of dimensions $N \times D_h$. (In this notebook, $h^{(t)}$ for time-step $t$ is always referred to as $h\_next$).

In [36]:
def rnn_step_forward(x_t, h_prev, Wh, We, b1):
    h_next = sigmoid(np.matmul(h_prev, Wh.T) + np.matmul(x_t, We.T) + b1)
    cache = h_prev, h_next, x_t
    return h_next, cache

An RNN Unit depends on the previous RNN Unit's hidden-state (this is not different from any plain feedforward network). Therefore we sequentially run the $rnn\_step\_forward$ method implemented above, for each time step.

**Note: One crucial difference from a plain feedforward network is that each RNN Unit uses the same parameters $W_h$, $W_e$, and $b_1$. This point of difference will have a significant bearing on how we backprop through an RNN.**

The code below implements the forward pass through an RNN. We are given as inputs:
* a minibatch of 'word sequences' denoted by $x$, of dimensions $N \times T \times d$, where $N$ is the numnber of minibatches and $T$ is the length of each sequence,
* an initial state vector denoted by $h^{(0)}$ of dimensions $N \times D_h$

In [20]:
def rnn_forward(T, x, h_0, Wh, We, b1):
    N, T, d = x.shape
    _, Dh = h_0.shape
    h = np.zeros((N, T, Dh))
    h_prev = h_0
    cache_dict = {}
    for t in range(T):
        h[:, t, :],  cache_step = rnn_step_forward(x[:,t,:], h_prev, Wh, We, b1) 
        h_prev = h[:, t, :]
        cache_dict.update({t : cache_step})
    return h, cache_dict

At each time-step $t$, we are given as inputs:
* the cache for this time-step saved during our forward pass - cache stores $h^{(t)}$ and $h^{(t-1)}$,
* the gradient of total loss $J$ with respect to $h^{(t)}$, denoted by $dh\_next$, of dimensions $N \times D_h$.

In [87]:
def rnn_step_backward(dh_next, cache, Wh):
    h_prev, h_next, x_t = cache
    dsigmoid = h_next*(1 - h_next)
    interim_dot_prod = dh_next*dsigmoid       # This will be used in all equations below
    dWh_step = np.matmul(interim_dot_prod.T, h_prev)
    dWe_step = np.matmul(interim_dot_prod.T, x_t)
    db1_step = np.sum(interim_dot_prod, axis=0)
    dh_prev = np.matmul(interim_dot_prod, Wh)
    return  dWh_step, dWe_step, db1_step, dh_prev 

In [88]:
def rnn_backward(dh, cache_dict, Wh, We, b1):
    N, T, Dh = dh.shape
    dWh = np.zeros_like(Wh)
    dWe = np.zeros_like(We)
    db1 = np.zeros_like(b1)
    dh_next = np.zeros((N, Dh))
    for t in range(T, 0, -1):
        dh_next += dh[:, t-1, :]
        dWh_step, dWe_step, db1_step, dh_prev = rnn_step_backward(dh_next, cache_dict[t-1], Wh)
        dh_next = dh_prev
        dWh += dWh_step
        dWe += dWe_step
        db1 += db1_step
    return dWh, dWe, db1

Atop each RNN Unit, sits an Affine layer which takes the vector $h^{(t)}$ as input, applies an Affine transformation, and computes the Softmax Probability. The parameters of this layer are $U \space (Dim: D_{h} \times V )$ and $b_2 \space (Dim: V \times 1)$. 

We do not have to implement a separate Affine layer for each time-step. Unlike in the case of an RNN Unit where the computation inside it depended on the output of its previous unit, the Affine computations at each time-step are independent of each other. Therefore, once we have computed $h^{(t)}$ for each time-step, we will perform the Affine computation for ALL $T$ time-steps in one go, taking into impact the contribution from ALL mini-batches.

In [116]:
def affine_forward(h, U, b2):
    N, T, Dh = h.shape
    V = b2.shape[0]
    theta = (np.matmul(h.reshape(N*T, Dh), U.T) + b2).reshape(N, T, V)
    cache = U, b2, h
    return theta, cache

In [118]:
def affine_backward(dtheta, cache):
    U, b2, h = cache
    Dh = U.shape[1]
    N, T, V = dtheta.shape
    dh = np.matmul(dtheta.reshape(N*T, V), U).reshape(N, T, Dh)
    dU = np.matmul((dtheta.reshape(N*T, V).T), h.reshape(N*T, Dh))
    db2 = dtheta.sum(axis=(0,1))
    return dh, dU, db2

### Numerical Evaluation of Gradients to Check Correctness of our Implementation
If you are familiar with how to numerically check gradients for a network, you can skip this section, and move on to [New Section](#sec_id)

Below, the function $eval\_grad$ evaluates the gradient of a given function $f$ at a point $x$. This point $x$ can be multidimensional, for example I will use the $2D$ matrix $W_h$ as a 'point'. The 'gradient' is basically the change in Loss due to an infinitesimally small perturbation to the point $x$.

Notice in the code below that I have multiplied by $dh$ to calculate the gradient. This is because we are going to be passing $rnn\_forward$ and $rnn\_step\_forward$ for the argument $f$. Both these functions return the vector $h$ and not the scalar Loss which we need to compute the gradient w.r.t point $x$. Therefore we need to multiple by $dh$ to get our gradient, which is what we pass for the argument $df$.

In [112]:
def eval_grad(f, x, df):
    grad = np.zeros_like(x)
    epsilon = 1e-5
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        idx = it.multi_index
        orig_val = x[idx]
        x[idx] = orig_val + epsilon
        fwd_fx = f(x)
        x[idx] = orig_val - epsilon
        bck_fx = f(x)
        grad[idx] = np.sum((fwd_fx - bck_fx)*df/(epsilon*2))
        x[idx] = orig_val
        it.iternext()
    return grad
def rel_error(x, y):
    # Returns relative error
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

In [113]:
""" Check Gradients for single RNN Unit """

np.random.seed(10151)
N, Dh, d, V = 2, 5, 3, 50
# Parameters
Wh = np.random.randn(Dh, Dh)
b1 = np.random.randn(Dh)
We = np.random.randn(Dh, d)

# Inputs
x_t = np.random.randn(N, d)
h_prev = np.random.randn(N, Dh)

# Test functions
fWh = lambda Wh: rnn_step_forward(x_t, h_prev, Wh, We, b1)[0]
fWe = lambda We: rnn_step_forward(x_t, h_prev, Wh, We, b1)[0]
fb1 = lambda b1: rnn_step_forward(x_t, h_prev, Wh, We, b1)[0]
fh_prev = lambda h_prev: rnn_step_forward(x_t, h_prev, Wh, We, b1)[0]

# Evaluate test functions
h_next, cache_step = rnn_step_forward(x_t, h_prev, Wh, We, b1)
dh_next = np.random.randn(*h_next.shape)
dWh, dWe, db1, dh_prev = rnn_step_backward(dh_next, cache_step, Wh)
dWh_num = eval_grad(fWh, Wh, dh_next)
dWe_num = eval_grad(fWe, We, dh_next)
db1_num = eval_grad(fb1, b1, dh_next)
dh_prev_num = eval_grad(fh_prev, h_prev, dh_next)
print('Error in dWh: {}'.format(rel_error(dWh_num, dWh)))
print('Error in dWe: {}'.format(rel_error(dWe_num, dWe)))
print('Error in db1: {}'.format(rel_error(db1_num, db1)))
print('Error in dh_prev: {}'.format(rel_error(dh_prev_num, dh_prev)))

Error in dWh: 3.699756155582621e-11
Error in dWe: 2.8456523039254035e-10
Error in db1: 2.1963488162818666e-11
Error in dh_prev: 2.3951187440793865e-11


In [114]:
""" Check Gradients for the entire RNN """

np.random.seed(10151)
T, N, Dh, d, V = 10, 2, 5, 3, 50
# Parameters
Wh = np.random.randn(Dh, Dh)
b1 = np.random.randn(Dh)
We = np.random.randn(Dh, d)

# Inputs
x = np.random.randn(N, T, d)
h_0 = np.random.randn(N, Dh)

# Test functions
fWh = lambda Wh: rnn_forward(T, x, h_0, Wh, We, b1)[0]
fWe = lambda We: rnn_forward(T, x, h_0, Wh, We, b1)[0]
fb1 = lambda b1: rnn_forward(T, x, h_0, Wh, We, b1)[0]

# Evaluate test functions
h, cache_dict = rnn_forward(T, x, h_0, Wh, We, b1)
dh = np.random.randn(*h.shape)
dWh, dWe, db1 = rnn_backward(dh, cache_dict, Wh, We, b1)
dWh_num = eval_grad(fWh, Wh, dh)
dWe_num = eval_grad(fWe, We, dh)
db1_num = eval_grad(fb1, b1, dh)
print('Error in dWh: {}'.format(rel_error(dWh_num, dWh)))
print('Error in dWe: {}'.format(rel_error(dWe_num, dWe)))
print('Error in db1: {}'.format(rel_error(db1_num, db1)))

Error in dWh: 9.831792282221475e-10
Error in dWe: 1.0017182281175608e-10
Error in db1: 9.769693584166271e-11


In [120]:
""" Check Gradients for the Affine layer """

np.random.seed(10151)
N, T, Dh, V = 5, 10, 6, 50

# Parameters
U = np.random.randn(V, Dh)
b2 = np.random.rand(V)

# Inputs
h = np.random.randn(N, T, Dh)

# Test Functions
fU = lambda U: affine_forward(h, U, b2)[0]
fb2 = lambda b2: affine_forward(h, U, b2)[0]
fh = lambda h: affine_forward(h, U, b2)[0]

# Evaluate test functions
theta, cache = affine_forward(h, U, b2)
dtheta = np.random.randn(*theta.shape)
dh, dU, db2 = affine_backward(dtheta, cache)

dU_num = eval_grad(fU, U, dtheta)
db2_num = eval_grad(fb2, b2, dtheta)
dh_num = eval_grad(fh, h, dtheta)

print('Error in dU: {}'.format(rel_error(dU_num, dU)))
print('Error in db2: {}'.format(rel_error(db2_num, db2)))
print('Error in dh: {}'.format(rel_error(dh_num, dh)))

Error in dU: 7.956108235981939e-10
Error in db2: 1.578675630521908e-09
Error in dh: 1.0117441876769132e-09


<a id='sec_id'></a>
### New Section