# Recurrent Neural Network Tutorial




In [1]:
import nltk
import itertools
from collections import Counter
import numpy as np
import sys


## Introduction

I would like to talk about a special implmentation of neural network in this tutorial where we introduce recursions to it. And I am building a natural language processing model using RNN with the shakespear data we used in our lecture. So here comes the first question...

## Why Would I Bother To Have Recursions In My Neural Network?
### When both normal neural net and recursions themselves are already making my head hurts

Have you asked yourself this question, what do I do when the patterns in my data changes with time? Or what do I do when my next data points is dependent of the previous one? For a traditional neural network, we assume that all our input data (each row of the input X matrix to be specific) are independent from each other. But that could be a really bad idea sometimes right? Taking our language model building problem for example, the previous word appears in a text must have some influence on the word appears next. So for building that language model, your best bet is to use a recurrent neural network or RNN.


## Okay, Okay, Tell Me What Is An RNN Then

An RNN is a deep learning model which has a simple structure with a built-in feedback loop, allowing it to model sequential data and act as a forecasting engine. You can also think of it as a neural network with some memory abilities. 

Wait, what do you mean by memories? Does a traditional neural network has memories of what it saw in previous time steps? If you think about this question, normally a neural network's information flow would look like this:

\begin{aligned}  input -> hidden -> output  \end{aligned}

Memory changes this framework. All memory does is that the hidden layer is now decided by both the current input and the hidden layer from previous time step.

\begin{aligned}  (input + prevHidden) -> hidden -> output  \end{aligned}

In a Recurrent Net the output of layer (value of hidden units) is added to the next input, and feed back into the same layer.


We all love pictures so here is what a typical RNN looks like:

![title](rnn.jpg)

> A recurrent neural network and the unfolding in time of the computation involved in its forward computation.

> Source: Nature

As shown in the picture above, an RNN stores and feeds the previous hidden layer output to the next data point.



Let me introduce the notations here in more details:

- x_t is the input at time step t. 
- s_t is the hidden unit value at time step t. It’s the “memory” of the network. It captures information about what happened in all the previous time steps. s_t is calculated based on the previous hidden state and the input at the current step: \begin{aligned}  s_t=f(Ux_t + Ws_{t-1})  \end{aligned}. The first hidden state is typically initialized to all zeroes.
- o_t is the output at step t.  \begin{aligned}  o_t = \mathrm{sigmoid}(Vs_t)  \end{aligned}.

Let's see a color demonstration of what "memory" means for an RNN.

![title](rnn_color.png)

> Using color to show how recurrent neural network memories information from previous input data.

> Source: https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/

For the first time step, the outputs are only influenced by the input from time step 1. For the second time step, the outputs are influenced by both the input from time step 1 and time step 2. So on and so forth.

## You Know Enough To Build A Natural Language Model Using RNN Now!

Given a sequence of words we would like to predict the probability of each word given the previous words. Language Models allow us to measure how likely a sentence is, which is widely used in machine translation.


### 1. What Is A Language Model


Recall from our homework and course on NLP, a language model allows us to predict the probability of observing the sentence (in a given dataset) as:

\begin{aligned}  P(w_1,...,w_m) = \prod_{i=1}^{m} P(w_i \mid w_1,..., w_{i-1})  \end{aligned}

The possibility of a sentence is given by the word that comes before it. For example, the probability of "I love data science homework" would be the probability of "homework" given "I love data science", multiplied by the probability of “science” given “I love data”, and so on.


### 2. Introduce RNN To A Language Model

In Language Modeling our input is typically a sequence of words (encoded as one-hot vectors for example), and our output is the sequence of predicted words. When training the network we set
\begin{aligned}  o_t = x_{t+1}   \end{aligned} 
since we want the output at step t to be the actual next word.


### 3. How Does RNN Learn?

The basic learning procedure is similar as a normal neural network. We do forward propogation and then backpropogation. The slight difference is that we do forward propogation from 1 to 4 and then backpropagating all the derivatives from 4 back to 1. So we are using the same weights (U, V, W) for these 4 steps. This means that we are performing the same operation at each step, just with different inputs. Other than that, it's just normal backpropagation. We will get to that in more details later. 

So hopefully you get the idea that under the fancy name, Recurrent Neural Network is nothing but a neural network which takes multiple inputs at each time step, propagates them to the hidden layer and then generates output based on all the information the hidden layer chooses to remember.

## Let's Do Some Fun Preprocessing For Our Data First.

### 1. Text Tokenizing

That's exactly what we did for our homework 3. We tokenize our input text by making them all to lower case and remove punctuations. As we are not removing punctuations this time, we don't need to write a process function ourselves, simple use the nltk package to do the job.

### 2. Get Rare Words

Recall from our homework 3 again, really rare words are either typos or not important. So we would like to exclude them from our text. In this special case, as we would like to keep our vocabulary size (the words that we care about modeling) relatively small, we delete words that appeared equal or less than 5 times.

In [2]:
def get_rare_words(processed_text):
    all_words = [pp for p in processed_text for pp in p]
    word_dict = dict(Counter(all_words))
    rare = [k for k, v in word_dict.iteritems() if v <= 5]
    return sorted(rare)

### 3. Prepend Special START And END Tokens

We also want to learn which words tend start and end a sentence. To do this we prepend a special SENTENCE_START token, and append a special SENTENCE_END token to each sentence. 

### 4. It's Time To Read Our Shakespeare Data In And Build The Training Data Matrices!

#### (1) Read Data
First let's read in our Shakespeare data that is already downloaded into a txt file.

In [3]:
# Read in local txt file containing all Shakespeare work
f = open("shakespear.txt", "r" )
sentence = []
for line in f:
    new_line = "SENTENCE_START" + " " + line.strip().lower() + " " + "SENTENCE_END"
    sentence.append(new_line)

Let's take a look at the sentences we have read in

In [4]:
print "Example sentence"
for i in range(10, 15):
    print sentence[i]
print "--------------------------------------------"
print "Cheers! We parsed %d sentences in total!" % (len(sentence))

Example sentence
SENTENCE_START his tender heir might bear his memory: SENTENCE_END
SENTENCE_START but thou contracted to thine own bright eyes, SENTENCE_END
SENTENCE_START feed'st thy light's flame with self-substantial fuel, SENTENCE_END
SENTENCE_START making a famine where abundance lies, SENTENCE_END
SENTENCE_START thy self thy foe, to thy sweet self too cruel: SENTENCE_END
--------------------------------------------
Cheers! We parsed 124192 sentences in total!


#### (2) Parse Sentence Into Words, And Tokenize The Words

Now we would like to parse each sentence into separate words and at the same time, tokenize them using the nltk package.

In [5]:
tokenized_sent = [nltk.word_tokenize(sent) for sent in sentence]
print "Example sentence after tokenizing: "
for i in range(5):
    print tokenized_sent[i]

Example sentence after tokenizing: 
['SENTENCE_START', 'the', 'sonnets', 'SENTENCE_END']
['SENTENCE_START', 'SENTENCE_END']
['SENTENCE_START', 'by', 'william', 'shakespeare', 'SENTENCE_END']
['SENTENCE_START', 'SENTENCE_END']
['SENTENCE_START', 'SENTENCE_END']


#### (3) Remove Rare Words From Our Vocabulary

Now it's time to find the rare words and replace them from the text data with a unknown token symbol. The word UNKNOWN_TOKEN will become part of our vocabulary and we will predict it just like any other word. When we generate new text, we will replace the UNKNOWN_TOKEN with it.

In [6]:
# Find the rare words, ...
rare = get_rare_words(tokenized_sent)
print "Found %d rare words." % len(rare)

Found 21384 rare words.


We remove the rare words from our vocabulary list.

In [7]:
# And Remove them forever from our vocabulary...
new_word = [b for a in tokenized_sent for b in a]
unique_word = set(new_word)
vocab = [a for a in unique_word if a not in rare]
# Move "SENTENCE_START" to the 0th position and "SENTENCE_END" to the 1st position
vocab.insert(0, vocab.pop(vocab.index('SENTENCE_START')))
vocab.insert(1, vocab.pop(vocab.index('SENTENCE_END')))
vocab_size = len(vocab)
print "%d unique words left. This is our vocabulary size." % vocab_size

7678 unique words left. This is our vocabulary size.


We build a dictionary that maps from our unique words in our vocabulary to their indices. And a dictionary that maps from indices to the words.

In [8]:
word_to_index = dict([(w,i) for i, w in enumerate(vocab)])
word_to_index["UNKNOWN_TOKEN"] = len(word_to_index) - 1

index_to_word = dict((i, w) for i, w in enumerate(vocab))
index_to_word["UNKNOWN_TOKEN"] = len(word_to_index) - 1


And replace the rare words by 'UNKNOWN_TOKEN' in our word list. 

In [35]:
# And replace them with the unknown token in the training example
for i, sent in enumerate(tokenized_sent):
    tokenized_sent[i] = [w if w in word_to_index else "UNKNOWN_TOKEN" for w in sent]
print "Example sentence after replacing rare words with unknown token: "
for i in range(100, 103):
    print tokenized_sent[i]

Example sentence after replacing rare words with unknown token: 
['SENTENCE_START', 'ten', 'times', 'thy', 'self', 'were', 'happier', 'than', 'thou', 'art', ',', 'SENTENCE_END']
['SENTENCE_START', 'if', 'ten', 'of', 'thine', 'ten', 'times', 'UNKNOWN_TOKEN', 'thee', ':', 'SENTENCE_END']
['SENTENCE_START', 'then', 'what', 'could', 'death', 'do', 'if', 'thou', 'shouldst', 'depart', ',', 'SENTENCE_END']


#### (4) Map Input To Indices
Now the last step for input processing before we feed them to our recurrent neural network is to map our input words to indices. 

In [10]:
# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sent])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sent])

Let's take a look at a sample training example that we generated.

In [11]:
# Print an training data example
x_example, y_example = X_train[100], y_train[100]
print "x example:\n%s\n%s" % (" ".join([index_to_word[x] for x in x_example]), x_example)
print "\ny example:\n%s\n%s" % (" ".join([index_to_word[x] for x in y_example]), y_example)

x example:
SENTENCE_START ten times thy self were happier than thou art ,
[0, 3618, 840, 2396, 308, 3798, 1727, 6383, 4114, 7503, 6708]

y example:
ten times thy self were happier than thou art , SENTENCE_END
[3618, 840, 2396, 308, 3798, 1727, 6383, 4114, 7503, 6708, 1]


## Here Comes the Recurrent Neural Network!

### RNN Infrastructure Again

Recall the infrastructure of a RNN we saw before.


![title](rnn.jpg)

> A recurrent neural network and the unfolding in time of the computation involved in its forward computation. 

> Source: Nature

There are only three weight matrices, U, V, and W. U and V maps from input to hidden layer and from hidden layer to output respectively. The new matrix W propagates from the hidden layer to the hidden layer at the next timestep.

Let's get concrete and see what the RNN for our language model looks like. 

#### Input

The input $x$ will be a sequence of words (just like the example printed above) and each $x_t$ would be a single word. Note here we are using a one-hot vector of size vocabulary. 

So, each $x_t$ will become a vector, and $x$ will be a matrix, with each row representing a word. 

#### Output

The output of our network $o$ has a similar format as the input. Each $o_t$ is a vector as the same size as our vocabulary, and each element represents the probability of that word being the next word in the sentence.



### How To Train AN RNN

Training a RNN is similar to training a traditional Neural Network. We also use the backpropagation algorithm, but with some changes. We can't simply calculate the gradient because the gradient depends not only on the current time step but also the previous time steps. In order to calculate the gradient at t=4 we would need to backpropagate 3 steps and sum up the gradients. This is called Backpropagation Through Time (BPTT).

We will have to deal with that later. Now let's first initialize some parameters for our RNN class. 



### 1. Initialization

For the weights, we initialize them to be in the interval from $\left[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}\right]$ where $n$ is the number of incoming connections from the previous layer. 

In [12]:
class RNN:
    def __init__(self, word_dim, hidden_dim = 100, bptt_truncate = 4):
        # Assign instance variables
        self.word_dim = word_dim # the input and output size
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        # Randomly initialize the network parameters
        np.random.seed(0)
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim)) 
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim)) 
        # Initialize the gradient to be all 0
        self.V_grad = np.zeros(self.V.shape) 
        self.U_grad = np.zeros(self.U.shape) 
        self.W_grad = np.zeros(self.W.shape) 

Let's also define the sigmoid function here. It introduces nonlinearity to our model.

In [13]:
# Compute sigmoid nonlinearity
def sigmoid(x):
    output = 1 / (1 + np.exp(-x))
    return output

### 2. Forward Propagation

Now it's time for forward propagation.

In [14]:
def forward_propagation(self, X):
    # The total number of time steps
    t = len(X)
    # Create a ndarray to store all the hidden states from the previous input. We have to use it later.
    hidden_prev = np.zeros((t + 1, self.hidden_dim))
    hidden_prev[-1] = np.zeros(self.hidden_dim)
    # The outputs at each time step. Again, we save them for later.
    o = np.zeros((t, self.word_dim))
    # For each time step...
    for time_step in np.arange(t):
        # Hidden_layer = sigmoid(input + prvious_hidden_layer)
        # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
        hidden_prev[time_step] = np.tanh(self.U[:, X[time_step]] + self.W.dot(hidden_prev[time_step - 1]))
        # Output_layer = sigmoid(hidden_layer)
        o[time_step] = sigmoid(self.V.dot(hidden_prev[time_step]))
    return o, hidden_prev

RNN.forward_propagation = forward_propagation

Please make sure you understand the line of code to calculate the hidden_prev[t]. That's all the magic about RNN. We first propagate from the input to the hidden layer (self.U[:, X[t]]). Then, we propagate from the previous hidden layer to the current hidden layer (self.W.dot(hidden_prev[t - 1])). Then we sum these 2 vectors and pass the result to the sigmoid function to generate our hidden layer value. That's where our model actually "memorize" things.

### 3. Making Predictions And Calculating The Loss

After forward propagation, we get out output. Now it's time to calculate the difference between our output and the true y. This is the loss we want to minimize when we train our model later.

In [15]:
def predict(self, X):
    # Perform forward propagation and return index of the highest score
    o, _ = self.forward_propagation(X)
    return np.argmax(o, axis = 1)
RNN.predict = predict

Our prediction for each input X is a vector of 14714 possibilities which represent the possibilities of next word in our vocabulary.

In [16]:
def loss(self, X, y):
    L = 0
    # For each sentence...
    for i in np.arange(len(y)):
        o, hidden_prev = self.forward_propagation(X[i])
        # We only care about our prediction of the "correct" words
        correct_word_predictions = o[np.arange(len(y[i])), y[i]]
        # Add to the loss based on how off we were
        L += -1 * np.sum(np.log(correct_word_predictions))
    return L
RNN.loss = loss

### 4. Check Our Forward Propagation Before Moving On

We will print our prediction using one single input (one sentence in X_train) to make predictions.

In [17]:
model = RNN(vocab_size)
o, s = model.forward_propagation(X_train[100])
print o.shape
pred = model.predict(X_train[100])
print pred.shape
print pred
print y_train[100]
print " ".join([index_to_word[x] for x in pred])
print " ".join([index_to_word[x] for x in y_train[100]])

(11, 7678)
(11,)
[5246 3219 5132 4138 2514 3691  602 3661  306  517 7234]
[3618, 840, 2396, 308, 3798, 1727, 6383, 4114, 7503, 6708, 1]
woodville infected ornaments befall'n friend exhibition cinna inc. waft guilt broker
ten times thy self were happier than thou art , SENTENCE_END


We could see that our prediction is totally random. That's what we expected because the parameters are randomly initialized. Seems that our forward propagation method is okay. Let's move on to BPTT!



### 5. BackPropagation Through Time (BPTT)

Calculating the derivatives and gradient for each weight, so we can use it to do gradient descent in our training.

In [24]:
def bptt(self, X, y):
    T = len(y)
    # Perform forward propagation
    delta_o, s = self.forward_propagation(X)
    # We accumulate the gradients in these variables
    V_grad = self.V_grad
    U_grad = self.U_grad
    W_grad = self.W_grad
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        V_grad += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t - self.bptt_truncate), t + 1)[::-1]:
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            W_grad += np.outer(delta_t, s[bptt_step-1])              
            U_grad[:,X[bptt_step]] += delta_t
            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return U_grad, V_grad, W_grad
RNN.bptt = bptt

### 6. Updating The Weights By Their Gradient Descent



In [32]:
# Performs one step of SGD.
def gd(self, x, y, learning_rate):
    # Calculate the gradients
    U_grad, V_grad, W_grad = self.bptt(x, y)
    # Change parameters according to gradients and learning rate
    self.U -= learning_rate * U_grad
    self.V -= learning_rate * V_grad
    self.W -= learning_rate * W_grad
RNN.gd = gd

def train(model, X_train, y_train, learning_rate = 1e-4, iteration = 100):
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0
    N = np.sum((len(y) for y in y_train))
    for i in range(iteration):
        # Output the loss every 5 iterations
        if (i % 5 == 0 and i > 0):
            total_loss = model.loss(X_train, y_train)
            loss = total_loss / float(N)
            losses.append((num_examples_seen, loss))
            print "Average loss after num_examples_seen = %d i = %d: %f" % (num_examples_seen, i, loss)
            # Adjust the learning rate if loss increases
            if (len(losses) >= 2 and losses[-1][1] >= losses[-2][1]):
                learning_rate = learning_rate * 0.5 
                print "Setting learning rate to %f" % learning_rate
        # For each training example...
        for j in range(len(y_train)):
            # One Gradient Descent step
            model.gd(X_train[j], y_train[j], learning_rate)
            num_examples_seen += 1

Let's try to run the gradient descent on one single input and see if our implementation is efficient.

In [26]:
%timeit model.gd(X_train[100], y_train[100], 0.001)

10 loops, best of 3: 40.3 ms per loop


Oops, it takes about 3 milliseconds to run a single step of gradien descent. Seems that we can't process our whole dataset which has hundreds of thousands of rows. Let's just run with the first 100 input.

In [33]:
model = RNN(vocab_size)
losses = train(model, X_train[:100], y_train[:100], iteration = 100)

Loss after num_examples_seen = 500 i = 5: 72.666206
Loss after num_examples_seen = 1000 i = 10: 110.926938
Setting learning rate to 0.000050
Loss after num_examples_seen = 1500 i = 15: 118.704145
Setting learning rate to 0.000025


  app.launch_new_instance()


Loss after num_examples_seen = 2000 i = 20: 119.315965
Setting learning rate to 0.000013
Loss after num_examples_seen = 2500 i = 25: 115.688422
Loss after num_examples_seen = 3000 i = 30: 110.990778
Loss after num_examples_seen = 3500 i = 35: 106.110767
Loss after num_examples_seen = 4000 i = 40: 104.444395
Loss after num_examples_seen = 4500 i = 45: 100.539793
Loss after num_examples_seen = 5000 i = 50: 96.820646
Loss after num_examples_seen = 5500 i = 55: 93.683399
Loss after num_examples_seen = 6000 i = 60: 90.088033
Loss after num_examples_seen = 6500 i = 65: 89.402487
Loss after num_examples_seen = 7000 i = 70: 89.262406
Loss after num_examples_seen = 7500 i = 75: 83.562442
Loss after num_examples_seen = 8000 i = 80: 84.422838
Setting learning rate to 0.000006
Loss after num_examples_seen = 8500 i = 85: 78.061657
Loss after num_examples_seen = 9000 i = 90: 74.321762
Loss after num_examples_seen = 9500 i = 95: 71.959543


### 7. Solutions

Sorry this is not a perfect ending. Our model starts to converge when learning rate is around 1e-5 though.

In the research field, when it comes to training a recurrent net, GPUs are an obvious choice over an ordinary CPU as it is able to train the nets 250 times faster. 

## Reference

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

https://www.tensorflow.org/versions/r0.11/tutorials/recurrent/index.html