## Introduction

### What is RNN
A recurrent neural network(RNN) is a class of aritificial neural network where connections between units form a directed cycle. This creates an internal states of the network which allows it to exhibit dynamic temporal behavior. It can be applied on machine translation, speech recognition, generating image descriptions. It is different from traditional neural network, which assume all inputs and outputs are independent of each other. RNN performs the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a "memory" which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps.
[<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/09/rnn.jpg">](http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/09/rnn.jpg)

The above diagram shows a RNN being unrolled into a full network. By unrollying we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word. The formulas that govern the computation happening in a RNN are as follows:
1. x_t is the input at time step t. For example, x_1 could be a one-hot vector corresponding to the second word of a sentence.

2. s_t is the hidden state at time step t. It’s the “memory” of the network. s_t is calculated based on the previous hidden state and the input at the current step: s_t=f(Ux_t + Ws_{t-1}). The function f usually is a nonlinearity such as tanh or ReLU.  s_{-1}, which is required to calculate the first hidden state, is typically initialized to all zeroes.

3. o_t is the output at step t. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary. o_t = \mathrm{softmax}(Vs_t).


### Tutorial content

In this tutorial, we will show how to implement a full Recurrent Neural Network from scratch using Python.

We will cover the following topics in this tutorial:
- [Training data and preprocessing](#Training data and preprocessing)
- [Building the rnn](#Building the rnn)
- [Initialization](#Initialization)
- [Forward propagation](#Forward propagation)
- [Calculating the loss](#Calculating the loss)
- [Training the rnn with sgd and backpropagation through time (BPTT)](#Training the rnn with sgd and backpropagation through time (BPTT))
- [Gradient checking](#Gradient checking)



### Import Libraries

First we need to process the raw data from files. In order to split words, we use NLTK library, especially word_tokenize and sent_tokenize methods.

In [21]:
import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime
from utils import *
import matplotlib.pyplot as plt
%matplotlib inline

In [22]:
# Download NLTK model data (you only need to do this once)
nltk.download("book")

[nltk_data] Downloading collection u'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/hongjunliu/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/hongjunliu/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/hongjunliu/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/hongjunliu/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/hongjunliu/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/hongjunliu/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-dat

True

## Training data and preprocessing

To train our language model we need text to learn from， we first need to do some pre-processing to get our data into the right format. Here we can use some libraries to help us.

In [99]:
vocabulary_size = 10000
unknown = "UNKNOWN"

In [100]:
# Read the data and append SENTENCE_START and SENTENCE_END tokens
print "Reading CSV file..."
with open('reddit-comments-2015-08.csv', 'rb') as f:
    reader = csv.reader(f, skipinitialspace=True)
    reader.next()
    # Split full comments into sentences
    sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
    # Append SENTENCE_START and SENTENCE_END
    sentences = ["%s %s %s" % ("SENTENCE_START", x, "SENTENCE_END") for x in sentences]
    
print "Parsed %d sentences." % (len(sentences))

Reading CSV file...
Parsed 79170 sentences.


#### Tokenize Text

We have raw text, but as the prediction is word-based, we need to split the sentences into words. We should split the words according to spaces and punctuations.

In [101]:
# Tokenize the sentences into words
processed_sentences = []
for sent in sentences:
    processed_sentences.append(nltk.word_tokenize(sent))
print "The length of processed words is ", len(tokenized_sentences)

print "finish"

The length of processed words is  79170
finish


#### Remove infrequent words
As you may have guessed, the number of possible words is prohibitively large and not all of them may be useful for our classification task. Our first sub-task is to determine which words to retain, and which to omit. The common heuristic is to construct a frequency distribution of words in the corpus and prune out the head and tail of the distribution. The intuition of the above operation is as follows. Very common words (i.e. stopwords) add almost no information regarding similarity of two pieces of text. Conversely, very rare words tend to be typos.


In [102]:
# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*processed_sentences))
print "Found %d unique words tokens." % len(word_freq.items())

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

print "Using vocabulary size %d." % vocabulary_size
print "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1])

# Replace all words not in our vocabulary with the unknown tokens
for i, sent in enumerate(processed_sentences):
    processed_sentences[i] = [w if w in word_to_index else unknown for w in sent]

print "\nExample sentence: '%s'" % sentences[0]
print "\nExample sentence after Pre-processing: '%s'" % processed_sentences[0]

Found 65751 unique words tokens.
Using vocabulary size 10000.

Example sentence: 'SENTENCE_START i joined a new league this year and they have different scoring rules than i'm used to. SENTENCE_END'

Example sentence after Pre-processing: '[u'SENTENCE_START', u'i', u'joined', u'a', u'new', u'league', u'this', u'year', u'and', u'they', u'have', u'different', u'scoring', u'rules', u'than', u'i', u"'m", u'used', u'to', u'.', u'SENTENCE_END']'


#### Build training data matrices

The input to our Recurrent Neural Networks are vectors, not strings. So we create a mapping between words and indices, index_to_word, and word_to_index. For example, the word "friendly" may be at index 2001. A training example $x$ may look like [0, 179, 341, 416], where 0 corresponds to SENTENCE_START. The corresponding label $y$ would be [179, 341, 416, 1]. Our goal is to predict the next word, so y is just the x vector shifted by one position with the last element being the SENTENCE_END token. In this example, the correct prediction for word 179 above would be 341, the actual next word.

In [103]:
# Create the training data using tokenized_sentences
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in processed_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in processed_sentences])

Here is an actual training example from our text:

In [104]:
# Print an training data example
print "x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train[20]]), X_train[20])
print "\ny:\n%s\n%s" % (" ".join([index_to_word[x] for x in y_train[20]]), y_train[20])

x:
SENTENCE_START & gt ; you 're just supporting it to make a gesture towards background check reform in bad faith .
[0, 55, 85, 44, 10, 75, 38, 2037, 11, 5, 102, 7, 8350, 707, 1013, 333, 5216, 14, 199, 1664, 2]

y:
& gt ; you 're just supporting it to make a gesture towards background check reform in bad faith . SENTENCE_END
[55, 85, 44, 10, 75, 38, 2037, 11, 5, 102, 7, 8350, 707, 1013, 333, 5216, 14, 199, 1664, 2, 1]


#### Building the RNN
From the figure in the top of the tutorial, the input will be a sequence of words and each $x_t$ is a single word. But there's one more thing: Because of how matrix multiplication works we can't simply use a word index as an input. Instead, we represent each word as a one-hot vector of size vocabulary_size. So, each $x_t$ will become a vector, and $x$ will be a matrix, with each row representing a word. Here in the following, we will perform this transformation in our Neural Network code instead of doing it in pre-processing. The output of our network $o$ has a similar format. Each $o_t$ is a vector of `vocabulary_size` elements, and each element represents the probability of that word being the next word in the sentence.

$
\begin{aligned}
s_t &= \tanh(Ux_t + Ws_{t-1}) \\
o_t &= \mathrm{softmax}(Vs_t)
\end{aligned}
$

I always find it useful to write down the dimensions of the matrices and vectors. Let's assume we pick a vocabulary size $C = 1000$ and a hidden layer size $H = 100$. You can think of the hidden layer size as the "memory" of our network. Making it bigger allows us to learn more complex patterns, but also results in additional computation. Then we have:

$
\begin{aligned}
x_t & \in \mathbb{R}^{1000} \\
o_t & \in \mathbb{R}^{1000} \\
s_t & \in \mathbb{R}^{100} \\
U & \in \mathbb{R}^{100 \times 1000} \\
V & \in \mathbb{R}^{1000 \times 100} \\
W & \in \mathbb{R}^{100 \times 100} \\
\end{aligned}
$

This is valuable information. Remember that $U,V$ and $W$ are the parameters of our network we want to learn from data. Thus, we need to learn a total of $2HC + H^2$ parameters. In the case of $C=10000$ and $H=100$ that's 2，010,000.  The dimensions also tell us the bottleneck of our model. Note that because $x_t$ is a one-hot vector, multiplying it with $U$ is essentially the same as selecting a column of U, so we don't need to perform the full multiplication. Then, the biggest matrix multiplication in our network is $Vs_t$. That's why we want to keep our vocabulary size small if possible.

### Implementation

#### Initialization

We start by declaring a RNN class an initializing our parameters. Initializing the parameters $U,V$ and $W$ is very important. Here you cannot just assume a specific number, because if so, you cannot assure the number will have impact on the following result. So we need to define a random number to exclude this possiblity. One of the recommended approach is to initialize the weights randomly in the interval from $\left[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}\right]$ where $n$ is the number of incoming connections from the previous layer.

In [95]:
class RNNNumpy:
    # as we define the hidden_dim is 100 as above instruction. bptt will be explained later.
    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        # Randomly initialize the network parameters
        # Initialize the weights randomly in the interval from range in the instruction. 
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
        

#### Make prediction
Using the equation above, we can implement the prediction function.
During forward propagation we save all hidden states in s because need them later. We add one additional element for the initial hidden, which we set to 0

In [96]:
def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)
    s = np.zeros((T + 1, self.hidden_dim))
    s[-1] = np.zeros(self.hidden_dim)
    # The outputs at each time step. Again, we save them for later.
    o = np.zeros((T, self.word_dim))
    # For each time step...
    for t in np.arange(T):
        # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
        s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
        o[t] = softmax(self.V.dot(s[t]))
    return [o, s]



def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o, axis=1)


print "finish"

finish


In [97]:
RNNNumpy.forward_propagation = forward_propagation
RNNNumpy.predict = predict
# Randomly create parameters by using .seed(rngSeed) method
np.random.seed(10)
model = RNNNumpy(vocabulary_size)
o, s = model.forward_propagation(X_train[10])
print o.shape
print o

# 
predictions = model.predict(X_train[10])
print predictions.shape
print predictions

(11, 1000)
[[ 0.00101317  0.00099654  0.00099927 ...,  0.00099873  0.0010151
   0.00100557]
 [ 0.00099106  0.00099863  0.00097568 ...,  0.00100935  0.00099581
   0.00099827]
 [ 0.0009939   0.00099987  0.00097949 ...,  0.00100389  0.00100047
   0.0009945 ]
 ..., 
 [ 0.00097222  0.00100205  0.00099743 ...,  0.00101555  0.00100605
   0.00097576]
 [ 0.0009983   0.00099256  0.00098318 ...,  0.00099973  0.00099021
   0.00100587]
 [ 0.00099783  0.00099938  0.00098342 ...,  0.0010036   0.00100369
   0.0009917 ]]
(11,)
[458 820 978 388 661 978 730 609 583 412 978]


Here we randomly create parameters by using .seed(rngSeed) method. If you run an example many times, and generate new, random weights each time you begin, then your net’s results and using the forward_propagation to train the input data. 

#### Implement loss function
To evaluate the our method, we need to measure the errors it makes. Here is what loss function used for. Here our goal is to find the parameters $U,V$ and $W$ that minimize the loss function for our training data. The typical measure reported in the papers is average per-word perplexity (often just called perplexity). If we have $N$ training examples (words in our text) and $C$ classes (the size of our vocabulary) then the loss with respect to our predictions $o$ and the true labels $y$ is given by:

$
\begin{aligned}
L(y,o) = - \frac{1}{N} \sum_{n \in N} y_{n} \log o_{n}
\end{aligned}
$

This method is to sum over our training examples and add to the loss based on how off our prediction are. The further away $y$ (the correct words) and $o$ (our predictions), the greater the loss will be. 

In [98]:
def calculate_total_loss(self, x, y):
    L = 0
    # For each sentence...
    for i in np.arange(len(y)):
        o, s = self.forward_propagation(x[i])
        correct_word_predictions = o[np.arange(len(y[i])), y[i]]
        # Add to the loss based on how off we were
        L += -1 * np.sum(np.log(correct_word_predictions))
    return L

def calculate_loss(self, x, y):
    # Divide the total loss by the number of training examples
    N = np.sum((len(y_i) for y_i in y))
    return self.calculate_total_loss(x,y)/N

RNNNumpy.calculate_total_loss = calculate_total_loss
RNNNumpy.calculate_loss = calculate_loss
# Calculate loss using 1000 training example data. 
print "Expected Loss for random predictions: %f" % np.log(vocabulary_size)
print "Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000])

Expected Loss for random predictions: 6.907755
Actual loss: 6.911610


### Training the RNN with SGD and Backpropagation Through Time (BPTT)
#### Understanding Backpropagation Trough Time (BPTT)
As the purpose of recurrent nets is to accurately classify sequential input. We rely on the backpropagation of error and gradient descent to do so.

Backpropagation in feedforward networks moves backward from the final error through the outputs, weights and inputs of each hiddent layer, assigning those weights responsibility for a portion of the error by calculating their partial derivatives - ∂E/∂w, or the relationship between their rates of change. Those derivatives are then used by our learning rule, gradient descent, to adjust the weights up or down, whichever direction decreases error. Recurrent networks rely on an extension of backpropagation called backpropagation through time, or BPTT. Time, in this case, is simply expressed by a well-defined, ordered series of calculations linking one time step to the next, which is all backpropagation needs to work.

Here in our example, we want to find the parameters $U,V$ and $W$ that minimize the total loss on the training data. The most common way to do this is SGD, Stochastic Gradient Descent. The idea behind SGD is pretty simple. We iterate over all our training examples and during each iteration we nudge the parameters into a direction that reduces the error. These directions are given by the gradients on the loss: $\frac{\partial L}{\partial U}, \frac{\partial L}{\partial V}, \frac{\partial L}{\partial W}$. SGD also needs a *learning rate*, which defines how big of a step we want to make in each iteration. SGD is the most popular optimization method not only for Neural Networks, but also for many other Machine Learning algorithms. As such there has been a lot of research on how to optimize SGD using batching, parallelism and adaptive learning rates. 

The following is the implementation of SGD using BPTT to calculate those gradients.

In [68]:
def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            dLdU[:,x[bptt_step]] += delta_t
            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return [dLdU, dLdV, dLdW]

RNNNumpy.bptt = bptt
print "finish"

finish


#### Gradient Checking

Whenever you implement backpropagation it is good idea to also implement *gradient checking*, which is a way of verifying that your implementation is correct. The idea behind gradient checking is that derivative of a parameter is equal to the slope at the point, which we can approximate by slightly changing the parameter and then dividing by the change:

$
\begin{aligned}
\frac{\partial L}{\partial \theta} \approx \lim_{h \to 0} \frac{J(\theta + h) - J(\theta -h)}{2h}
\end{aligned}
$

We then compare the gradient we calculated using backpropagation to the gradient we estimated with the method above. The approximation needs to calculate the total loss for *every* parameter, so that gradient checking is very expensive (remember, we had more than a million parameters in the example above). So it's a good idea to perform it on a model with a smaller vocabulary.

In [69]:
def gradient_check(self, x, y, h=0.001, error_threshold=0.01):
    # Calculate the gradients using backpropagation. We want to checker if these are correct.
    bptt_gradients = model.bptt(x, y)
    # List of all parameters we want to check.
    model_parameters = ['U', 'V', 'W']
    # Gradient check for each parameter
    for pidx, pname in enumerate(model_parameters):
        # Get the actual parameter value from the mode, e.g. model.W
        parameter = operator.attrgetter(pname)(self)
        print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape))
        # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
        it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            ix = it.multi_index
            # Save the original value so we can reset it later
            original_value = parameter[ix]
            # Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
            parameter[ix] = original_value + h
            gradplus = model.calculate_total_loss([x],[y])
            parameter[ix] = original_value - h
            gradminus = model.calculate_total_loss([x],[y])
            estimated_gradient = (gradplus - gradminus)/(2*h)
            # Reset parameter to original value
            parameter[ix] = original_value
            # The gradient for this parameter calculated using backpropagation
            backprop_gradient = bptt_gradients[pidx][ix]
            # calculate The relative error: (|x - y|/(|x| + |y|))
            relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient))
            # If the error is to large fail the gradient check
            if relative_error > error_threshold:
                print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix)
                print "+h Loss: %f" % gradplus
                print "-h Loss: %f" % gradminus
                print "Estimated_gradient: %f" % estimated_gradient
                print "Backpropagation gradient: %f" % backprop_gradient
                print "Relative Error: %f" % relative_error
                return 
            it.iternext()
        print "Gradient check for parameter %s passed." % (pname)

RNNNumpy.gradient_check = gradient_check

# To avoid performing millions of expensive calculations we use a smaller vocabulary size for checking.
grad_check_vocab_size = 100
np.random.seed(10)
model = RNNNumpy(grad_check_vocab_size, 10, bptt_truncate=1000)
model.gradient_check([0,1,2,3], [1,2,3,4])

Performing gradient check for parameter U with size 1000.




Gradient check for parameter U passed.
Performing gradient check for parameter V with size 1000.
Gradient check for parameter V passed.
Performing gradient check for parameter W with size 100.
Gradient check for parameter W passed.


#### Congratulations! You have finished this tutorial. Hope it gives you some ideas about RNN.