## RNNs and text

Sequence information is everywhere and classifying or forecasting it has applications in many different areas.  One area where RNNs have proven to be useful in recent years is in machine understanding of natural language text - this is often called Natural Language Processing (NLP). 

Text data can be viewed as a sequence at multiple resolutions - for example, as sequences of paragraphs, lines, sentences, words or individual characters.  

![](files/lucky.png)

In the same way that feed-forward and convolutional neural networks can be applied to classification or regression tasks, there are a variety of different tasks that RNNs can be applied to:
- We can use an RNN to regress a value based on sequential inputs - this is what we did in our binary addition example.
- We can use an RNN to classify it's input - an NLP example of this approach is determining the positive or negative sentiment of tweets.  
- We can also use RNNs to forecast future values of sequences given their historical values - an NLP example of this approach is forecasting the next word in a partial sentence.  This can be very useful in real-time text translation between languages.

We can also do all of these applications with either a single vector or whole sequence vectors as both input and output.

# Example 2: Character level text generation

We are going to replicate an RNN training example made famous by Andrej Karpathy in his entertaining and informative blog post ["The unreasonable effectiveness of Recurrent Neural Networks"](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).  In that post Andrej showed how an RNN can be trained on a text corpus, one character at a time, to predict the next character in the sequence.  Once trained in this way the network can be provided a seed of characters to begin a sequence and then asked to recursively predict as many additional characters in the sequence as we like.  If training has been successful the generated characters will form words, sentences and even paragraphs with the appropriate vocabulary, grammar and structure learned from the training corpus.

Learning to generate not only sensible next character guesses but whole words and sentences with correct grammar requires the network to have a memory of tens or even hundreds of characters in the past.

For this example we are going to use the [torch-rnn](https://github.com/ngimel/torch-rnn) codebase implemented by Justin Johnson (Stanford) and then accelerated with cuDNN by NVIDIA's own Natalia Gimelshein atop the scientific computing framework [Torch](http://torch.ch/).  Torch has wide support for GPU accelerated machine learning algorithms. It is easy to use and efficient, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation.

In [None]:
-- First we will import the required torch modules

package.path = '/home/ubuntu/notebook/torch-rnn/?.lua;/home/ubuntu/notenook/torch-rnn/?/?.lua;' .. package.path
require 'torch'
require 'nn'
require 'cunn'
require 'cudnn'
require 'util.DataLoader'
require 'LanguageModel'

utils = require 'util.utils'
unpack = unpack or table.unpack

cutorch.setDevice(1)          -- we will use the first GPU
dtype = 'torch.CudaTensor'

We will use a text file containing 40000 lines from the works of William Shakespeare as our training dataset.  The raw data is in the file [tiny-shakespeare](torch-rnn/data/tiny-shakespeare.txt).  torch-rnn expects the text characters of the training data to be converted to integers and stored in an HDF5 binary file and the mapping from text characters to integers to be stored in an accompanying [JSON file](torch-rnn/data/shakespeare.json). The `DataLoader` module from torch-rnn can iteratively generated mini-batches of data for training from the HDF5 file.

In [None]:
-- Set parameters for the training files and minibatch generation
opt = {}       -- We will use the opt dictionary to store all parameters related to the RNN training
opt.input_h5 = 'torch-rnn/data/shakespeare.h5'
opt.input_json = 'torch-rnn/data/shakespeare.json'
opt.batch_size = 50     
opt.seq_length = 50

-- Instantiate a data loader to generate minibatches from the HDF5 file
loader = DataLoader(opt)
vocab = utils.read_json(opt.input_json)

-- Create a dictionary mapping text characters to integers
idx_to_token = {}
for k, v in pairs(vocab.idx_to_token) do
  idx_to_token[tonumber(k)] = v
end
opt.idx_to_token = idx_to_token

-- Sample the first training batch and target
-- The batch is opt.batch_size sequences each of length seq_length
-- The RNN will be trained one character at a time from each sequence
-- The target is the next character in the sequence
x, y = loader:nextBatch('train')
print(x:size())
print(y:size())

RNN layers are implemented in torch-rnn as a subclass of Torch [nn.Module](https://github.com/torch/nn/blob/master/doc/module.md#nn.Module).   An RNN layers transforms a sequence of input vectors of dimension `D` into a sequence of hidden state vectors of dimension `H`; it operates over sequences of length `T` and minibatches of size `N`, which can be different on each forward pass.

First we will show how to instantiate an RNN layer in torch-rnn and feed-forward some random data.

In [None]:
-- Initialize an RNN layer
N, T, D, H = 3, 4, 5, 6
rnn = nn.VanillaRNN(D, H)

-- Initialize some random input data
x  = torch.randn(N, T, D)

-- Initialize a random hidden state from the last time step
h0 = torch.randn(N, H)

-- Feed-forward the random data  
h = rnn:forward{h0, x}
print(h)

An RNN layer instantiated in this way also has a `backward` method that takes as input the original input to `forward` as well as the gradients computed with respect to the output of `forward`.  This function provides the means to perform gradient backpropagation on the layer.

For more complicated applications we can extend the RNN structure we used in our first example in many of the same ways we can extend a feed-forward or convolutional neural network.  For example, we can increase the number of neurons in the hidden layer and we can stack multiple hidden layers to create a deep RNN architecture.  We can define a stacked RNN network for natural language processing and randomly initialize it on the GPU in Torch in the following way:

In [None]:
rnn_size = 128    -- Number of neurons in each hidden layer
num_layers = 2    -- Number of hidden layers

vocab_size = 256  -- How many unique text characters appear in the training data corpus
wordvec_dim = 64  -- What size of dense vector should be used to represent each character
-- NOTE: a dense real-valued vector is often used to represent individual characters or words
-- in a text sequence rather than sparse "one-hot" vectors

V, D, H = vocab_size, wordvec_dim, rnn_size

net = nn.Sequential()          -- Initialize an empty network architecture
net:add(nn.LookupTable(V, D))  -- Add a lookup table layer for character vectors

-- Iteratively stack RNN layers on the network
for i = 1, num_layers do
    local prev_dim = H
    if i == 1 then prev_dim = D end
    local rnn
    rnn = nn.VanillaRNN(prev_dim, H)
    net:add(rnn)
end

-- The next 5 layers are used to have the output of the network
-- be a fully-connected linear layer based upon the values of the top 
-- hidden layer across all time-steps rather than just the last time-step.
-- Don't worry about the details, just understand the purpose.
view1 = nn.View(1, 1, -1):setNumInputDims(3)
view2 = nn.View(1, -1):setNumInputDims(2)
net:add(view1)
net:add(nn.Linear(H, V))
net:add(view2)

net:type(dtype)  -- Tells Torch to use the CudaTensor type for the network variables, hence using the GPU

params, grad_params = net:getParameters()  -- This will return vectors of the randomly initialized network parameters

crit = nn.CrossEntropyCriterion():type(dtype)  -- The cross-entropy loss function will be used for optimization

print(net)  -- Print network structure

This stacked RNN is now ready to be fed training data and trained using backpropagation through time (BPTT) but we are not going to use this specific implementation in this example.  Instead we will the torch-rnn script [LanguageModel.lua](torch-rnn/LanguageModel.lua) to handle this same initialization of a stacked RNN for NLP for us much more concisely:

In [None]:
-- Specify stacked RNN architecture
opt.model_type = 'rnn'
opt.wordvec_size = 64    -- size of dense real-valued vector representing each character
opt.rnn_size = 128       -- number of neurons in each RNN layer
opt.num_layers = 2       -- number of stacked RNN layers
opt.dropout = 0          -- option to apply a decimal percentage of dropout on feed-forward
opt.batchnorm = 0        -- option to use batch normalization
opt.cudnn = 1            -- option to use cuDNN native RNN implementation

-- Initialize model
model = nn.LanguageModel(opt):type(dtype)

params, grad_params = model:getParameters()  -- This will return vectors of the randomly initialized network parameters
crit = nn.CrossEntropyCriterion():type(dtype)   -- The cross-entropy loss function will be used for optimization

The Torch [`optim`](https://github.com/torch/optim) optimization library is used to manage the stochastic gradient descent training process.  

In [None]:
require 'optim'

N, T = opt.batch_size, opt.seq_length

-- Define the loss function that we pass to an optim method
-- This function will retrive a batch of data, feed-forward through the model
-- compute the error and perform a backpropagation update
function f(w)
    assert(w == params)
    grad_params:zero()
    
    -- Retrieve next data batch
    local x, y = loader:nextBatch('train')
    x, y = x:type(dtype), y:type(dtype)
    
    -- Perform feed-forward operation
    local scores = model:forward(x)
    
    -- Use the Criterion to compute loss; we need to reshape the scores to be
    -- two-dimensional before doing so. Annoying.
    local scores_view = scores:view(N * T, -1)
    local y_view = y:view(N * T)
    local loss = crit:forward(scores_view, y_view)
    
    -- Run the Criterion and model backward to compute gradients
    local grad_scores = crit:backward(scores_view, y_view):view(N, T, -1)
    model:backward(x, grad_scores)
    
    grad_params:clamp(-5,5)  -- This is a constraint on the gradient values
    
    return loss, grad_params
end

To train the model we use a simple loop iteratively calling the loss function `f(w)` defined above.  Executing the cell below should take about 1 minute.

In [None]:
opt.max_epochs = 10

local optim_config = {learning_rate = 2e-3}

local num_train = loader.split_sizes['train']
local num_iterations = opt.max_epochs * num_train

model:training()

local timer = torch.Timer()

for i = 1, num_iterations do
    
    local epoch = math.floor(i / num_train) + 1
    
    -- After a complete epoch (pass through the entire training data)
    -- reset the model hidden states
    if i % num_train == 0 then
        model:resetStates()
    end
    
    -- Apply learning rate decay
    if epoch % 5 == 0 then
        old_lr = optim_config.learning_rate
        optim_config = {learning_rate = old_lr * 0.5}
    end
    
    -- Perform optimization update using the ADAM algorithm
    local _, loss = optim.adam(f, params, optim_config)
    
    local float_epoch = i / num_train
    
    if float_epoch % 1 == 0 then
        local msg = 'Epoch %.2f / %d, i = %d / %d, loss = %f'
        local args = {msg, float_epoch, opt.max_epochs, i, num_iterations, loss[1]}
        print(string.format(unpack(args)))
    end
end

-- Display total training time
local msg = '\n Training took %.2f seconds'
local args = {msg, timer:time().real}
print(string.format(unpack(args)))

Now that the model is trained we can use it to generate some original random text.  We use the `model:sample()` function to do this.  Note the use of the `temperature` variable.  Decreasing the temperature from 1 to some lower number (e.g. 0.5) makes the RNN more confident, but also more conservative in its samples. Conversely, higher temperatures will give more diversity but at cost of more mistakes (e.g. spelling mistakes, etc). In particular, setting temperature very near zero will give the most likely thing that Shakespeare might write.

In [None]:
opt.length = 2000          -- how many random characters should be generated
opt.start_text = ''        -- option to provide some text as a seed for the generation process
opt.temperature = 1        -- controls how conservative vs. diverse the sampling is

model:evaluate()
print(model:sample(opt))

Even after just 10 training epochs we see that the RNN has learned to use appropriate vocabulary as well as the typical structure of Shakespeare writing.  It's astonishing to think that the model learned this by seeing only one character at a time - this really demonstrates the sequential memory encoded within an RNN.

**Exercise 1:** Feel free to modify the variables in the `opt` dictionary and re-run the cells above.  Things you might want to try are looking at the effect on the quality of generated text by increasing the RNN layer sizes, adding more RNN layers or increasing the sequence length used in the data loader.

**Exercise 2:**  What happens to the training time if you change `opt.cudnn` to 0.  This will deactivate the use of the RNN CUDA kernels in cuDNN and instead use the native Torch `cunn` kernels.  

## Challenges in training RNNs

Unfortunately simple RNNs with many stacked layers can be brittle and difficult to train.  This brittleness arises because the backpropagation of gradients within a neural network is a recursive multiplication process.  This means that if the gradients are small they will shrink exponentially and if they are large they will grow exponentially.  These problems are called "vanishing" and "exploding" gradients respectively.  One way to look at this unrolling is that it creates a very deep feed-forward neural network, especially for deep RNNs applied to long time sequences.

![Backprop in RNNs](files/backpropRNN.png)

As you see in the diagram the red arrows representing the backpropagation of gradients accumulate in very long chains for an RNN on the right versus the feed-forward network on the left and so can easily vanish or explode through multiplication.  For a much more detailed explanation of these problems, see [this](https://www.youtube.com/watch?v=Pp4oKq4kCYs) video by Geoffrey Hinton.

One of the most popular solutions for preventing vanishing and exploding gradients is a modified network structure called Long Short Term Memory (LSTM).  For a thorough understanding of LSTM networks, after completing this lab, I recommend you read [this](http://colah.github.io/posts/2015-08-Understanding-LSTMs) blog post.  LSTMs were introduced by [Hochreiter & Schmidhuber](http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf) in 1997 and have been subsequently refined and put to good use by many people.

For now the key idea to remember about LSTM layers is that they add gating functions that can control how much the next network state is determined by new input and how much it is determined by what has already been seen. By allowing the network to learn how to control these gates, vanishing and exploding gradients can be avoided.

LSTM layers can be viewed as a drop-in replacement for the vanilla RNN layer we have already defined.  In `torch-rnn` we can choose to make this replacement by setting the variable `opt.model_type` to 'lstm'.  

**Exercise 3:**  Modify in the code above to set `opt.model_type = 'lstm'` and re-run the training.  What happens to model accuracy and training speed?  

**Exercise 4:**  Set `opt.batch_norm = 1` to apply batch normalization and `opt.dropout = 0.2` to apply dropout regularization.  What happens to the quality of generated text?

## Answers to exercises

**Answer 1:** You will find that a longer training sequence length, more RNN layers or a larger RNN will all improve the quality of the generated text but will also increase training time for the network.

**Answer 2:** You will find that training time is about 1.5X slower.  This is because cuDNN v5 introduced a number of optimizations for RNN layers, for example fusing of multiple stacked layers into a smaller number of more efficient operations.  You can read more about these optimizations in [this blog post](https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/).

**Answer 3:** You will find that training time takes a little longer due to the added complexity of an LSTM layer versus an RNN layer.  You may also assess that the quality of the generated text is worse. This is due to the LSTM overfitting the training data. 

**Answer 4:** With these options you should see a much higher quality output from the LSTM model versus the original RNN model.  These regularization techniques prevent overfitting in the LSTM.