### Introduction
This notebook builds an LSTM model without using any Deep Learning libraries. It accompanies my [blog post](https://talwarabhimanyu.github.io/blog/2018/08/12/lstm-backprop) which explains the mathematics behind forward-pass and more importantly backpropogation involved with training an LSTM.

To organize various Python functions involved in training, I have used the same layout as in [Assignment 3 of Stanford's CS231n course](http://cs231n.github.io/assignments2016/assignment3/) - the code for forward-pass and backpropogation is my own.

### Figure: A Single Time-Step in an LSTM
<img src="../../images/LSTM Diagram.png">

In [1]:
import numpy as np

In [2]:
def sigmoid(x):
    return 1/(1 + np.exp(-np.clip(x, -500, 500)))

def tanh(x):
    expo = np.exp(-2*np.clip(x, -500, 500))
    return (1 - expo)/(1 + expo)

### Computation for LSTM Units
An LSTM Unit at time-step $t$ takes as input: <br/>
* a minibatch of 'words' denoted by $x^{(t)}$, of dimensions $N \times d$, and <br/>
* the 'hidden-state' vector $h^{(t-1)}$ from the previous unit, of dimensions $N \times D$, and </br>
* the 'internal-state' vector $s^{(t-1)}$ from the previous unit, of dimensions $N \times D$.

**Note: $d$ and $D$ are hyper-parmaters, i.e. we _chose_ to represent each hidden/internal state using a vector of length $D$ and we _chose_ to use 'word embedding' vectors of length $d$.**

The code below implements a single LSTM Unit's computation. The output is two vectors - the 'hidden-state' $h^{(t)}$, and the 'internal-state' $s^{(t)}$ - each of dimensions $N \times D_h$. (In this notebook, $h^{(t)}$ for time-step $t$ is always referred to as $h\_next$. Similar notation is used for $s^{(t)}$).

In [3]:
def lstm_step_forward(x_t, h_prev, s_prev, params):
    # Load parameters
    We = params['We']
    Wf = params['Wf']
    Wg = params['Wg']
    Wq = params['Wq']
    Ue = params['Ue']
    Uf = params['Uf']
    Ug = params['Ug']
    Uq = params['Uq']
    be = params['be']
    bf = params['bf']
    bg = params['bg']
    bq = params['bq']
    # Compute gate values
    e_t = sigmoid(be + np.matmul(x_t, Ue.T) + np.matmul(h_prev, We.T))
    f_t = sigmoid(bf + np.matmul(x_t, Uf.T) + np.matmul(h_prev, Wf.T))
    g_t = sigmoid(bg + np.matmul(x_t, Ug.T) + np.matmul(h_prev, Wg.T))
    q_t = sigmoid(bq + np.matmul(x_t, Uq.T) + np.matmul(h_prev, Wq.T))
    # Compute signals
    s_next = f_t*s_prev + g_t*e_t
    h_next = q_t*tanh(s_next)
    cache = {'s_prev' : s_prev, 
             's_next' : s_next, 
             'x_t' : x_t, 
             'e_t' : e_t, 
             'f_t' : f_t, 
             'g_t' : g_t, 
             'q_t' : q_t, 
             'h_prev' : h_prev}
    return h_next, s_next, cache

### Forward Pass through All Time-Steps
An LSTM Unit depends on the previous LSTM Unit's hidden/internal states (this is not different from any plain feedforward network). Therefore we sequentially run the $lstm\_step\_forward$ method implemented above, for each time step.

**Note: One crucial difference from a plain feedforward network is that each LSTM Unit uses the same parameters ($W_f$, $U_f$, $b_f$ etc.). This point of difference will have a significant bearing on how we backprop through an LSTM.**

The code below implements the forward pass through an LSTM. We are given as inputs:
* a minibatch of 'word sequences' denoted by $x$, of dimensions $N \times T \times d$, where $N$ is the numnber of minibatches and $T$ is the length of each sequence,
* initial hidden/internal state vectors denoted by $h^{(0)}$ and $s^{(0)}$ of dimensions $N \times D$

In [4]:
def lstm_forward(T, x, h_0, s_0, params):
    N, T, d = x.shape
    _, D = h_0.shape
    h = np.zeros((N, T, D))
    h_prev = h_0
    s_prev = s_0
    cache_dict = {}
    for t in range(T):
        h[:, t, :], s_next, cache_step = lstm_step_forward(x[:,t,:], h_prev, s_prev, params) 
        h_prev = h[:, t, :]
        s_prev = s_next
        cache_dict.update({t : cache_step})
    return h, cache_dict

### Backpropogation
At each time-step $t$, we are given as inputs:
* the cache for this time-step saved during our forward pass - cache stores a bunch of quantities which were computed during the forward pass,
* the gradients of total loss $J$ with respect to $h^{(t)}$ and $s^{(t)}$, denoted by $dh\_next$ and $ds\_next$ respectively, each of dimensions $N \times D$.

In [5]:
def lstm_step_backward(dh_next, ds_next, cache, params):
    """
    dh_next is of shape (N, D)
    ds_next is of shape (N, D)
    """
    # Load parameters
    We = params['We']
    Wf = params['Wf']
    Wg = params['Wg']
    Wq = params['Wq']
    Ue = params['Ue']
    Uf = params['Uf']
    Ug = params['Ug']
    Uq = params['Uq']
    be = params['be']
    bf = params['bf']
    bg = params['bg']
    bq = params['bq']
    # Load cached quantities
    s_prev = cache['s_prev']
    s_next = cache['s_next']
    x_t = cache['x_t']
    e_t = cache['e_t']
    f_t = cache['f_t']
    g_t = cache['g_t']
    q_t = cache['q_t']
    h_prev = cache['h_prev']
    
    # Compute frequently used quantities
    tanh_s = tanh(s_next)
    
    # Internal state s
    ds_next = dh_next*q_t*(1-tanh_s**2) + ds_next
    
    # Forget gate f
    df_step = ds_next*s_prev
    dsigmoid_f = f_t*(1 - f_t)
    f_temp = df_step*dsigmoid_f
    dUf_step = np.matmul(f_temp.T, x_t) 
    dWf_step = np.matmul(f_temp.T, h_prev)
    dbf_step = np.sum(f_temp, axis=0)
    
    # Input gate g
    dg_step = ds_next*e_t
    dsigmoid_g = g_t*(1 - g_t)
    g_temp = dg_step*dsigmoid_g
    dUg_step = np.matmul(g_temp.T, x_t) 
    dWg_step = np.matmul(g_temp.T, h_prev)
    dbg_step = np.sum(g_temp, axis=0)
    
    # Output gate q
    dq_step = dh_next*tanh_s
    dsigmoid_q = q_t*(1 - q_t)
    q_temp = dq_step*dsigmoid_q
    dUq_step = np.matmul(q_temp.T, x_t) 
    dWq_step = np.matmul(q_temp.T, h_prev)
    dbq_step = np.sum(q_temp, axis=0)
    
    # Input transform e
    de_step = ds_next*g_t
    dsigmoid_e = e_t*(1 - e_t)
    e_temp = de_step*dsigmoid_e
    dUe_step = np.matmul(e_temp.T, x_t) 
    dWe_step = np.matmul(e_temp.T, h_prev)
    dbe_step = np.sum(e_temp, axis=0)
    
    # Gradient w.r.t previous state h_prev
    dh_prev = np.matmul(dh_next*tanh_s*dsigmoid_q, Wq) \
                    + np.matmul(ds_next*s_prev*dsigmoid_f, Wf) \
                    + np.matmul(ds_next*g_t*dsigmoid_e, We) \
                    + np.matmul(ds_next*e_t*dsigmoid_g, Wg)           
    ds_prev = f_t*ds_next
    grads = {'We' : dWe_step, 'Wf' : dWf_step, 'Wg' : dWg_step, 'Wq' : dWq_step,
              'Ue' : dUe_step, 'Uf' : dUf_step, 'Ug' : dUg_step, 'Uq' : dUq_step,
              'be' : dbe_step, 'bf' : dbf_step, 'bg' : dbg_step, 'bq' : dbq_step
            }
    return dh_prev, ds_prev, grads

In [6]:
def lstm_backward(dh, cache_dict, params):
    N, T, D = dh.shape
    _, d = params['Ue'].shape
    all_grads = {key: np.zeros_like(params[key]) for key in params} 
    dh_next = np.zeros((N, D))
    ds_next = np.zeros((N, D))
    for t in range(T, 0, -1):
        dh_next += dh[:, t-1, :]
        dh_prev, ds_prev, step_grads = lstm_step_backward(dh_next, ds_next, cache_dict[t-1], params)
        dh_next = dh_prev
        ds_next = ds_prev
        # Accumulate gradients
        for key in step_grads:
            all_grads[key] += step_grads[key]
    return all_grads

### Computation for the Affine Layer 
Atop each LSTM Unit, sits an Affine layer which takes the vector $h^{(t)}$ as input, applies an Affine transformation, and computes the Softmax Probability. The parameters of this layer are $U \space (Dim: D \times V )$ and $b_2 \space (Dim: V \times 1)$. 

We do not have to implement a separate Affine layer for each time-step. Unlike in the case of computation inside an LSTM Unit, where the computation depended on output of the previous LSTM unit, the Affine computations at each time-step are independent of each other (i.e. once $h^{(t)}$ has been computed for all time-steps). Therefore, we will perform the Affine computation for ALL $T$ time-steps in one go, taking into impact the contribution from ALL mini-batches.

In [7]:
def affine_forward(h, U, b2):
    N, T, Dh = h.shape
    V = b2.shape[0]
    theta = (np.matmul(h.reshape(N*T, Dh), U.T) + b2).reshape(N, T, V)
    cache = U, b2, h
    return theta, cache

In [8]:
def affine_backward(dtheta, cache):
    U, b2, h = cache
    Dh = U.shape[1]
    N, T, V = dtheta.shape
    dh = np.matmul(dtheta.reshape(N*T, V), U).reshape(N, T, Dh)
    dU = np.matmul((dtheta.reshape(N*T, V).T), h.reshape(N*T, Dh))
    db2 = dtheta.sum(axis=(0,1))
    return dh, dU, db2

### Computation for the Softmax Layer
We will compute probabilities for all $T$ sequences in a minibatch, over all $N$ minibatches, in one go. This follows from similar reasoning as I described for the Affine layer above.

**Inputs:**
* Matrix $\theta$ of dimensions $N \times T \times V$ which stores the output of Affine Layers, and
* Matrix $y$ of dimensions $N \times T$ which stores the index in Vocabulary of the true 'word' for each time-step, for each minibatch.

**Outputs:**
* Loss over all minibatches (a single floating point number), and
* Matrix $dtheta$ of same dimensions as $\theta$, and which stores gradients of Loss w.r.t $\theta$.

**Notes:** 
* I have directly lifted the $softmax\_loss$ function from the starter code of Assignment 3 of Stanford's CS231n's [Winter 2016 edition](http://cs231n.stanford.edu/2016/).
* This version of $softmax\_loss$ uses a 'mask', an array of dimensions $N \times T$ which indicates which time-steps in a minibatch should not be counted towards the loss. This is used to handle sequences whose length is less than $T$ - we pad them with zeros (in my implementation) to increase their length to $T$, which makes for easy code elsewhere. 
* I have previously discussed maths for backprop through a Softmax layer in a [blog post](https://talwarabhimanyu.github.io/blog/2017/05/20/softmax-backprop). You may refer to $Equation \space 1.3$ in that post which derives the gradient of Loss w.r.t $\theta$.

In [9]:
def softmax(theta, y, mask):
    N, T, V = theta.shape
    theta_flat = theta.reshape(N*T, V)
    y_flat = y.reshape(N*T)
    mask_flat = mask.reshape(N*T)
    
    probs = np.exp(theta_flat - np.max(theta_flat, axis=1, keepdims=True))
    probs /= np.sum(probs, axis=1, keepdims=True)
    loss = -np.sum(mask_flat * np.log(probs[np.arange(N * T), y_flat])) / N
    dtheta_flat = probs.copy()
    dtheta_flat[np.arange(N * T), y_flat] -= 1
    dtheta_flat /= N
    dtheta_flat *= mask_flat[:, None]
    
    dtheta = dtheta_flat.reshape(N, T, V)
    return loss, dtheta

### Numerical Evaluation of Gradients to Check Correctness of our Implementation
If you are familiar with how to numerically check gradients for a network, you can skip this section and move on to [Training our LSTM](#sec_id)

Below, the function $eval\_grad$ evaluates the gradient of a given function $f$ at a point $x$. This point $x$ can be multidimensional, for example I will use the $2D$ matrix $W_h$ as a 'point'. The 'gradient' is basically the change in Loss due to an infinitesimally small perturbation to the point $x$.

Notice in the code below that I have multiplied by $df$ to calculate the gradient. This is because we are going to be passing $lstm\_forward$ and $lstm\_step\_forward$ for the argument $f$. Both these functions return the vector $h$ and not the scalar Loss which we need to compute the gradient w.r.t point $x$. Therefore we need to multiple by $dh$ to get our gradient, which is what we pass for the argument $df$.

**Note: Below, I have used the numerical gradient evaluation functions provided for assignments of Stanford's course CS321n, _"Convolutional Neural Networks for Visual Recognition"_. Specifically, I have used code from the [Winter 2016 edition](http://cs231n.stanford.edu/2016/).  **

In [10]:
def eval_grad(f, x, df):
    grad = np.zeros_like(x)
    epsilon = 1e-5
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        idx = it.multi_index
        orig_val = x[idx]
        x[idx] = orig_val + epsilon
        fwd_fx = f(x)
        x[idx] = orig_val - epsilon
        bck_fx = f(x)
        grad[idx] = np.sum((fwd_fx - bck_fx)*df/(epsilon*2))
        x[idx] = orig_val
        it.iternext()
    return grad
def rel_error(x, y):
    # Returns relative error
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

In [11]:
""" Check Gradients for a single LSTM Unit """

np.random.seed(10151)
N, D, d = 10, 5, 3
# Parameters
We = np.random.randn(D, D)
Wf = np.random.randn(D, D)
Wg = np.random.randn(D, D)
Wq = np.random.randn(D, D)

be = np.random.randn(D)
bf = np.random.randn(D)
bg = np.random.randn(D)
bq = np.random.randn(D)

Ue = np.random.randn(D, d)
Uf = np.random.randn(D, d)
Ug = np.random.randn(D, d)
Uq = np.random.randn(D, d)

params = {'We' : We, 'Wf' : Wf, 'Wg' : Wg, 'Wq' : Wq,
          'Ue' : Ue, 'Uf' : Uf, 'Ug' : Ug, 'Uq' : Uq,
          'be' : be, 'bf' : bf, 'bg' : bg, 'bq' : bq
         }

# Inputs
x_t = np.random.randn(N, d)
h_prev = np.random.randn(N, D)
s_prev = np.random.randn(N, D)

# Test functions
fWe_h = lambda We: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fWf_h = lambda Wf: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fWg_h = lambda Wg: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fWq_h = lambda Wq: lstm_step_forward(x_t, h_prev, s_prev, params)[0]

fUe_h = lambda Ue: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fUf_h = lambda Uf: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fUg_h = lambda Ug: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fUq_h = lambda Uq: lstm_step_forward(x_t, h_prev, s_prev, params)[0]

fbe_h = lambda be: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fbf_h = lambda bf: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fbg_h = lambda bg: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fbq_h = lambda bq: lstm_step_forward(x_t, h_prev, s_prev, params)[0]

fWe_s = lambda We: lstm_step_forward(x_t, h_prev, s_prev, params)[1]
fWf_s = lambda Wf: lstm_step_forward(x_t, h_prev, s_prev, params)[1]
fWg_s = lambda Wg: lstm_step_forward(x_t, h_prev, s_prev, params)[1]
fWq_s = lambda Wq: lstm_step_forward(x_t, h_prev, s_prev, params)[1]

fUe_s = lambda Ue: lstm_step_forward(x_t, h_prev, s_prev, params)[1]
fUf_s = lambda Uf: lstm_step_forward(x_t, h_prev, s_prev, params)[1]
fUg_s = lambda Ug: lstm_step_forward(x_t, h_prev, s_prev, params)[1]
fUq_s = lambda Uq: lstm_step_forward(x_t, h_prev, s_prev, params)[1]

fbe_s = lambda be: lstm_step_forward(x_t, h_prev, s_prev, params)[1]
fbf_s = lambda bf: lstm_step_forward(x_t, h_prev, s_prev, params)[1]
fbg_s = lambda bg: lstm_step_forward(x_t, h_prev, s_prev, params)[1]
fbq_s = lambda bq: lstm_step_forward(x_t, h_prev, s_prev, params)[1]

fh_prev_h = lambda h_prev: lstm_step_forward(x_t, h_prev, s_prev, params)[0]
fh_prev_s = lambda h_prev: lstm_step_forward(x_t, h_prev, s_prev, params)[1]

# Evaluate test functions
h_next, s_next, cache_step = lstm_step_forward(x_t, h_prev, s_prev, params)
dh_next = np.random.randn(*h_next.shape)
ds_next = np.random.randn(*s_next.shape)
dh_prev, ds_prev, grads = lstm_step_backward(dh_next, ds_next, cache_step, params)

dWe_num = eval_grad(fWe_h, We, dh_next) + eval_grad(fWe_s, We, ds_next)
dWf_num = eval_grad(fWf_h, Wf, dh_next) + eval_grad(fWf_s, Wf, ds_next)
dWg_num = eval_grad(fWg_h, Wg, dh_next) + eval_grad(fWg_s, Wg, ds_next)
dWq_num = eval_grad(fWq_h, Wq, dh_next) + eval_grad(fWq_s, Wq, ds_next)

dUe_num = eval_grad(fUe_h, Ue, dh_next) + eval_grad(fUe_s, Ue, ds_next)
dUf_num = eval_grad(fUf_h, Uf, dh_next) + eval_grad(fUf_s, Uf, ds_next)
dUg_num = eval_grad(fUg_h, Ug, dh_next) + eval_grad(fUg_s, Ug, ds_next)
dUq_num = eval_grad(fUq_h, Uq, dh_next) + eval_grad(fUq_s, Uq, ds_next)

dbe_num = eval_grad(fbe_h, be, dh_next) + eval_grad(fbe_s, be, ds_next)
dbf_num = eval_grad(fbf_h, bf, dh_next) + eval_grad(fbf_s, bf, ds_next)
dbg_num = eval_grad(fbg_h, bg, dh_next) + eval_grad(fbg_s, bg, ds_next)
dbq_num = eval_grad(fbq_h, bq, dh_next) + eval_grad(fbq_s, bq, ds_next)

dh_prev_num = eval_grad(fh_prev_h, h_prev, dh_next) + eval_grad(fh_prev_s, h_prev, ds_next)

print('Error in dWe: {}'.format(rel_error(dWe_num, grads['We'])))
print('Error in dWf: {}'.format(rel_error(dWf_num, grads['Wf'])))
print('Error in dWg: {}'.format(rel_error(dWg_num, grads['Wg'])))
print('Error in dWq: {}'.format(rel_error(dWq_num, grads['Wq'])))
print('\n')
print('Error in dUe: {}'.format(rel_error(dUe_num, grads['Ue'])))
print('Error in dUf: {}'.format(rel_error(dUf_num, grads['Uf'])))
print('Error in dUg: {}'.format(rel_error(dUg_num, grads['Ug'])))
print('Error in dUq: {}'.format(rel_error(dUq_num, grads['Uq'])))
print('\n')
print('Error in dbe: {}'.format(rel_error(dbe_num, grads['be'])))
print('Error in dbf: {}'.format(rel_error(dbf_num, grads['bf'])))
print('Error in dbg: {}'.format(rel_error(dbg_num, grads['bg'])))
print('Error in dbq: {}'.format(rel_error(dbq_num, grads['bq'])))
print('\n')
print('Error in dh_prev: {}'.format(rel_error(dh_prev_num, dh_prev)))

Error in dWe: 2.859711404201946e-10
Error in dWf: 2.7817214490791844e-09
Error in dWg: 8.596699600007139e-10
Error in dWq: 1.7290091323325574e-09


Error in dUe: 3.807242366831024e-09
Error in dUf: 3.3221298997976607e-08
Error in dUg: 1.6343065069759545e-10
Error in dUq: 4.376333502803696e-10


Error in dbe: 6.885219031776487e-11
Error in dbf: 6.716352031190419e-10
Error in dbg: 2.949556129459655e-10
Error in dbq: 1.848974550920749e-11


Error in dh_prev: 1.940746476877571e-09


In [12]:
""" Check Gradients for the entire LSTM """

np.random.seed(10151)
T, N, D, d = 5, 7, 5, 10
# Parameters
We = np.random.randn(D, D)
Wf = np.random.randn(D, D)
Wg = np.random.randn(D, D)
Wq = np.random.randn(D, D)

Ue = np.random.randn(D, d)
Uf = np.random.randn(D, d)
Ug = np.random.randn(D, d)
Uq = np.random.randn(D, d)

be = np.random.randn(D)
bf = np.random.randn(D)
bg = np.random.randn(D)
bq = np.random.randn(D)

params = {'We' : We, 'Wf' : Wf, 'Wg' : Wg, 'Wq' : Wq,
          'Ue' : Ue, 'Uf' : Uf, 'Ug' : Ug, 'Uq' : Uq,
          'be' : be, 'bf' : bf, 'bg' : bg, 'bq' : bq
         }

# Inputs
x = np.random.randn(N, T, d)
h_0 = np.random.randn(N, D)
s_0 = np.random.randn(N, D)

# Test functions
fWe = lambda We: lstm_forward(T, x, h_0, s_0, params)[0]
fWf = lambda Wf: lstm_forward(T, x, h_0, s_0, params)[0]
fWg = lambda Wg: lstm_forward(T, x, h_0, s_0, params)[0]
fWq = lambda Wq: lstm_forward(T, x, h_0, s_0, params)[0]

fUe = lambda Ue: lstm_forward(T, x, h_0, s_0, params)[0]
fUf = lambda Uf: lstm_forward(T, x, h_0, s_0, params)[0]
fUg = lambda Ug: lstm_forward(T, x, h_0, s_0, params)[0]
fUq = lambda Uq: lstm_forward(T, x, h_0, s_0, params)[0]

fbe = lambda be: lstm_forward(T, x, h_0, s_0, params)[0]
fbf = lambda bf: lstm_forward(T, x, h_0, s_0, params)[0]
fbg = lambda bg: lstm_forward(T, x, h_0, s_0, params)[0]
fbq = lambda bq: lstm_forward(T, x, h_0, s_0, params)[0]

# Evaluate test functions
h, cache_dict = lstm_forward(T, x, h_0, s_0, params)
dh = np.random.randn(*h.shape)
all_grads = lstm_backward(dh, cache_dict, params)

dWe_num = eval_grad(fWe, We, dh)
dWf_num = eval_grad(fWf, Wf, dh)
dWg_num = eval_grad(fWg, Wg, dh)
dWq_num = eval_grad(fWq, Wq, dh)

dUe_num = eval_grad(fUe, Ue, dh)
dUf_num = eval_grad(fUf, Uf, dh)
dUg_num = eval_grad(fUg, Ug, dh)
dUq_num = eval_grad(fUq, Uq, dh)

dbe_num = eval_grad(fbe, be, dh)
dbf_num = eval_grad(fbf, bf, dh)
dbg_num = eval_grad(fbg, bg, dh)
dbq_num = eval_grad(fbq, bq, dh)

print('Error in dWe: {}'.format(rel_error(dWe_num, all_grads['We'])))
print('Error in dWf: {}'.format(rel_error(dWf_num, all_grads['Wf'])))
print('Error in dWg: {}'.format(rel_error(dWg_num, all_grads['Wg'])))
print('Error in dWq: {}'.format(rel_error(dWq_num, all_grads['Wq'])))
print('\n')
print('Error in dUe: {}'.format(rel_error(dUe_num, all_grads['Ue'])))
print('Error in dUf: {}'.format(rel_error(dUf_num, all_grads['Uf'])))
print('Error in dUg: {}'.format(rel_error(dUg_num, all_grads['Ug'])))
print('Error in dUq: {}'.format(rel_error(dUq_num, all_grads['Uq'])))
print('\n')
print('Error in dbe: {}'.format(rel_error(dbe_num, all_grads['be'])))
print('Error in dbf: {}'.format(rel_error(dbf_num, all_grads['bf'])))
print('Error in dbg: {}'.format(rel_error(dbg_num, all_grads['bg'])))
print('Error in dbq: {}'.format(rel_error(dbq_num, all_grads['bq'])))

Error in dWe: 1.301420817525277e-08
Error in dWf: 7.115199565178917e-10
Error in dWg: 9.268257581061839e-09
Error in dWq: 1.2087991093064742e-09


Error in dUe: 7.708884312846093e-09
Error in dUf: 1.0165199498294216e-09
Error in dUg: 5.196960122411291e-08
Error in dUq: 2.4545406556922528e-09


Error in dbe: 7.696329593761408e-10
Error in dbf: 4.398797000208122e-10
Error in dbg: 1.2224354996155846e-10
Error in dbq: 1.4806447099098493e-10


In [13]:
""" Check Gradients for the Affine layer """

np.random.seed(10151)
N, T, D, V = 5, 10, 6, 50

# Parameters
U = np.random.randn(V, D)
b2 = np.random.rand(V)

# Inputs
h = np.random.randn(N, T, D)

# Test Functions
fU = lambda U: affine_forward(h, U, b2)[0]
fb2 = lambda b2: affine_forward(h, U, b2)[0]
fh = lambda h: affine_forward(h, U, b2)[0]

# Evaluate test functions
theta, cache = affine_forward(h, U, b2)
dtheta = np.random.randn(*theta.shape)
dh, dU, db2 = affine_backward(dtheta, cache)

dU_num = eval_grad(fU, U, dtheta)
db2_num = eval_grad(fb2, b2, dtheta)
dh_num = eval_grad(fh, h, dtheta)

print('Error in dU: {}'.format(rel_error(dU_num, dU)))
print('Error in db2: {}'.format(rel_error(db2_num, db2)))
print('Error in dh: {}'.format(rel_error(dh_num, dh)))

Error in dU: 7.956108235981939e-10
Error in db2: 1.578675630521908e-09
Error in dh: 1.0117441876769132e-09


<a id='sec_id'></a>
### Training our LSTM
We will train our LSTM model on the [Dinosaur Names Dataset](https://github.com/brunoklein99/deep-learning-notes/blob/master/dinos.txt). The training data contains names of real dinosaurs - once trained, we will use the LSTM to sample some made-up dinosaur names!

In [15]:
from tqdm import tqdm_notebook as tqdm
import random

N = 512
T = 20
D = 256

loss_freq = 2
num_epochs = 300

train_file = 'dinos.txt'
encoding = 'utf-8'
with open(train_file, encoding=encoding) as f:
    data = f.read().lower()
chars = list(set(data))
data_size, V = len(data), len(chars)
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# Split text into strings of length T+1
data_list = [data[i*(T+1):(i+1)*(T+1)].strip() for i in range(len(data)//(T+1))]

# Parameters initialization
We = np.random.randn(D, D)
Wf = np.random.randn(D, D)
Wg = np.random.randn(D, D)
Wq = np.random.randn(D, D)

Ue = np.random.randn(D, V)
Uf = np.random.randn(D, V)
Ug = np.random.randn(D, V)
Uq = np.random.randn(D, V)

be = np.zeros(D)
bf = np.zeros(D)
bg = np.zeros(D)
bq = np.zeros(D)

U = np.random.randn(V, D)
b2 = np.zeros(V)

lstm_params = {'We' : We, 'Wf' : Wf, 'Wg' : Wg, 'Wq' : Wq,
              'Ue' : Ue, 'Uf' : Uf, 'Ug' : Ug, 'Uq' : Uq,
              'be' : be, 'bf' : bf, 'bg' : bg, 'bq' : bq
             }

# Mems for Adagrad
lstm_mems = {'We' : None, 'Wf' : None, 'Wg' : None, 'Wq' : None,
              'Ue' : None, 'Uf' : None, 'Ug' : None, 'Uq' : None,
              'be' : None, 'bf' : None, 'bg' : None, 'bq' : None
             }

for param in lstm_params:
    lstm_mems[param] = np.zeros_like(lstm_params[param])
mU = np.zeros_like(U)
mb2 = np.zeros_like(b2)

# Other variables' initialization
h_0 = np.zeros((N, D))
s_0 = np.zeros((N, D))

def str_to_idx(st):
    idx_arr = np.array([char_to_ix[ch] for ch in st])
    return idx_arr
lr = 0.1

prog_bar = tqdm(total=num_epochs)

for epoch in range(num_epochs):
    start = 0
    iter_count = 0
    running_loss = 0
    while True:
        iter_count += 1
        batch_str = data_list[start:(start + N)]
        batch_idx = [str_to_idx(st) for st in batch_str if len(st) == (T+1)]
        batch_size = len(batch_idx)
        if batch_size < N:
            batch_idx.extend([(T+1)*[0] for i in range(N - batch_size)])
        x = np.array([np.eye(V)[indices[0:len(indices)-1]] for indices in batch_idx])
        y = np.array([indices[1:] for indices in batch_idx])
        mask = np.ones((N, T))
        mask[batch_size:,:] = 0

        # forward pass
        h, cache_dict = lstm_forward(x.shape[1], x, h_0, s_0, lstm_params)
        theta, cache = affine_forward(h, U, b2)
        loss, dtheta = softmax(theta, y, mask)
        running_loss += loss
        if iter_count % loss_freq == 0:
            prog_bar.set_postfix(epoch='{}/{}'.format(epoch+1, num_epochs), \
                                 loss='{:.3f}'.format(running_loss/(loss_freq*N)))
            running_loss = 0
        # backprop
        dh, dU, db2 = affine_backward(dtheta, cache)
        for dz in [dh, dU, db2]: np.clip(dz, -5, 5, out=dz)
        all_grads = lstm_backward(dh, cache_dict, lstm_params)
        
        # update grads
        for param in lstm_params:
            all_grads[param] = np.clip(all_grads[param], -5, 5)
            lstm_mems[param] += all_grads[param]*all_grads[param]
            lstm_params[param] += -lr*all_grads[param]/ \
                                np.sqrt(lstm_mems[param] + 1e-8)
        
        for z, dz, m in zip([U, b2], 
                            [dU, db2],
                            [mU, mb2]):
            m += dz*dz
            z += -lr*dz / np.sqrt(m + 1e-8)
        start += N
        if start >= len(data_list): break
    prog_bar.update(1)
    # Shuffle data
    random.shuffle(data_list)




In [15]:
# save weights
for param in lstm_params:
    lstm_params[param].dump(param + '.dat')
    lstm_mems[param].dump(param + '_mem.dat')
U.dump('U.dat')
b2.dump('b2.dat')
mU.dump('U_mem.dat')
mb2.dump('b2_mem.dat')

In [16]:
# sample text
def sampleText(length=20, seed_ch='T'):
    x_t = np.eye(V)[char_to_ix[seed_ch]]
    h_prev = np.zeros(D)
    s_prev = np.zeros(D)
    str_out = seed_ch
    for t in range(length):
        h_next, s_next, _ = lstm_step_forward(x_t, h_prev, s_prev, lstm_params)
        theta = np.matmul(U, h_next) + b2
        p = np.exp(theta)/np.sum(np.exp(theta))
        idx = np.random.choice(range(V), p=p.ravel())
        h_prev = h_next
        s_prev = s_next
        x_t = np.eye(V)[idx]
        str_out += ix_to_char[idx]
    return str_out

In [19]:
num_samples = 15
for ch in ['t', 'v']:
    for i in range(num_samples):
        s = sampleText(15, ch)
        if not s: continue
        pos = s.find('\n')
        if pos != -1: s = s[0:pos]
        if len(s) > 5: print(s.capitalize())

Tholin
Topsaurus
Trosaurucrosauru
Tthunntin
Totltinptyrus
Tylinnnosaurosau
Tyrntasaurus
Teosaurus
Tyrantrosaur
Tosaurus
Thangeng
Tatassaurus
Tharaerophybrosa
Voteanadus
Veintor
Venator
Vinsaurus
Vontioa
Venator
Venator
Venator
Veterypasaururaa
Venator
Venatiror
Vetirus
Venaroeltratyyps
Venelnites
