#  Recurrent Neural Networks (RNNs) for Language Modeling

**WARNING: DRAFT**

In previous tutorials, we worked with *feedforward* neural networks. They're called feedforward networks because each layer feeds into the next layer in a chain connecting the inputs to the outputs.


![](img/multilayer-perceptron.png)

At each iteration $t$, we feed in a new example $x_t$, by setting the values of the input nodes (orange). We then *feed the activation forward* by successively calculating the activations of each higher layer in the network. Finally, we read the outputs from the topmost layer. 

So when we feed the next example $x_{t+1}$, we overwrite all of the previous activations. If consecutive inputs to our network have no special relationship to each other (say, images uploaded by unrelated users), then this is perfectly acceptable behavior. But what if our inputs exhibit a seqeuntial relationship?

Say for example that you want to predict the next character in a string of text. We might devide to feed each character into the neural network with the goal of predicting the succeeding character. 

![](img/recurrent-motivation.png)

In the above example, the neural network forgets the previous context every time you feed a new input. How is the neural network supposed to know that "e" is followed by a space? It's hard to see why that should be so probable if you didn't know that the "e" was the final letter in the word "Time". 

Recurrent neural networks provide a slick way to incorporate sequential structure. At each time step $t$, each hidden layer $h_t$ (typically) will receive input from both the current input $x_t$ and from *that same hidden layer* at the previous time step $h_{t-1}$

![](img/recurrent-lm.png)

Now, when our net is trying to predict what comes after the "e" in time, it has access to its previous *beliefs*, and by extension, the entire history of inputs. Zooming back in to see how the nodes in a basic RNN are connected, you'll see that each node in the hidden layer is connected to each node at the hidden layer at the next time step:

![](img/simple-rnn.png)

Even though the neural network contains loops (the hidden layer is connected to itself), because this connection spans a time step our network is still technically a feedforward network. Thus we can still train by backpropagration just as we normally would with an MLP. Typically the loss function will be an average of the losses at each time step.

In this tutorial, we're going to roll up our sleeves and write a simple RNN in MXNet using nothing but ``mxnet.ndarray`` and ``mxnet.autograd``. In practice, unless you're trying to develop fundamentlally new recurrent layers, you'll want to use the prebuilt layers that call down to extremely optimized primitives. You'll also want to rely some pre-built batching code because batching sequences can be a pain. But we think in general, if you're going to work with this stuff, and have a modicum of self respect, you'll want to implement from scratch and understand how it works at a reasonably low level. 

Let's go ahead and import our dependencies and specify our context. If you've been following along without a GPU until now, this might be where you'll want to get your hands on some faster hardware. GPU instances are available by the hour through Amazon Web Services. A single GPU via a [p2 instance](https://aws.amazon.com/ec2/instance-types/p2/) (NVIDIA K80s) or even an older g2 instance will be perfetly adequate for this tutorial.

In [1]:
from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
import numpy as np
mx.random.seed(1)
ctx = mx.gpu(0)

## Dataset: "The Time Machine" 

Now mess with some data. I grabbed a copy of the ``Time Machine``, mostly because it's available freely thanks to the good people at [Project Gutenberg](http://www.gutenberg.org) and a lot of people are tired of seeing RNNs generate Shakespeare. In case you prefer to torturing Shakespeare to torturing H.G. Wells, I've also included Andrej Karpathy's tinyshakespeare.txt in the data folder. Let's get started by reading in the data.

In [None]:
with open("data/timemachine.txt") as f:
    time_machine = f.read()

And you'll probably want to get a taste for what the text looks like.

In [None]:
print(time_machine[0:500])

## Tidying up

I went through and discovered that the last 38083 characters consist entirely of legalese from the Gutenberg gang. So let's chop that off lest our language model learn to generate such boring drivel.

In [4]:
print(time_machine[-38075:-37500])
time_machine = time_machine[:-38083]

End of Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells

*** END OF THIS PROJECT GUTENBERG EBOOK THE TIME MACHINE ***

***** This file should be named 35.txt or 35.zip *****
This and all associated files of various formats will be found in:
        http://www.gutenberg.net/3/35/



        Updated editions will replace the previous one--the old editions
        will be renamed.

        Creating the works from public domain print editions means that no
        one owns a United States copyright in these works, so the Foundation
        (and you!) c


## Numerical representations of characters

When we create numerical representations of characters, we'll use one-hot representations. A one-hot is a vector that taked value 1 in the index corresponding to a character, and 0 elsewhere. Because this vector is as long as the vocab, let's get a definitive list of characters in this dataset so that our representation is not longer than necessary.

In [5]:
character_list = list(set(time_machine))
vocab_size = len(character_list)
print(character_list)
print("Length of vocab: %s" % vocab_size)

['T', 's', '5', ':', '2', 'E', 'D', 'r', '.', 'J', 'c', ';', 'd', 'O', 'K', 'n', 'l', 'x', '!', 'a', 'y', 'g', 'm', '*', 'X', 'L', 'U', '(', '?', ' ', 'H', 'N', '_', 'I', '-', '8', 'M', '4', '"', '#', 'C', ']', ',', 'e', 'p', 'G', '3', 'q', 'W', 'b', 'F', '\n', ')', 'w', 'u', 'k', 'h', 'j', 'A', '[', 'f', 'v', 'R', 'V', '9', 't', "'", 'i', 'P', 'S', '1', 'z', '0', 'Y', 'o', 'Q', 'B']
Length of vocab: 77


We'll often want to access the index corresponding to each character quickly so let's store this as a dictionary.

In [6]:
character_dict = {}
for e, char in enumerate(character_list):
    character_dict[char] = e
print(character_dict)

{'T': 0, 's': 1, '5': 2, 'S': 69, '2': 4, 'E': 5, 'J': 9, 'r': 7, 'c': 10, ';': 11, 'd': 12, 'O': 13, 'K': 14, 'n': 15, "'": 66, 'l': 16, 'x': 17, '!': 18, ':': 3, 'a': 19, 'y': 20, 'g': 21, '8': 35, '*': 23, 'X': 24, 'R': 62, 'U': 26, 'z': 71, '(': 27, '?': 28, '4': 37, 'N': 31, 'H': 30, 'w': 53, '_': 32, 'I': 33, '-': 34, 'D': 6, 'M': 36, 'i': 67, '"': 38, '#': 39, '.': 8, ']': 41, ',': 42, 'e': 43, 'G': 45, 'L': 25, '3': 46, 'q': 47, 'C': 40, 'W': 48, 'b': 49, '\n': 51, ')': 52, 'k': 55, 'u': 54, '[': 59, 'h': 56, 'j': 57, 'A': 58, 'F': 50, 'f': 60, 'V': 63, '9': 64, 't': 65, 'm': 22, 'p': 44, 'P': 68, ' ': 29, '1': 70, 'v': 61, '0': 72, 'Y': 73, 'o': 74, 'Q': 75, 'B': 76}


In [7]:
time_numerical = [character_dict[char] for char in time_machine]

In [8]:
#########################
#  Check that the length is right
#########################
print(len(time_numerical))

#########################
#  Check that the format looks right
#########################
print(time_numerical[:20])

#########################
#  Convert back to text
#########################
print("".join([character_list[idx] for idx in time_numerical[:39]]))

179533
[68, 7, 74, 57, 43, 10, 65, 29, 45, 54, 65, 43, 15, 49, 43, 7, 21, 66, 1, 29]
Project Gutenberg's The Time Machine, b


## One-hot representations

We can use NDArray's one_hot() render a one-hot representation of each character. But frack it, since this is the from scratch tutorial, let's right this ourselves.

In [None]:
def one_hots(numerical_list, vocab_size=vocab_size):
    result = []
    for idx in numerical_list:
        array = nd.zeros(shape=(1, vocab_size), ctx=ctx)
        array[0, idx] = 1.0
        result.append(array)
    return result

In [10]:
print(one_hots(time_numerical[:2]))

[
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
   0.  0.  0.  0.  0.]]
<NDArray 1x77 @gpu(2)>, 
[[ 0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]]
<NDArray 1x77 @gpu(2)>]


That looks about right. Now let's write a function to convert our one-hots back to readable text.

In [11]:
def textify(vector_list):
    result = ""
    for vector in vector_list:
        vector = vector[0]
        result += character_list[int(nd.argmax(vector, axis=0).asscalar())]
    return result

In [12]:
textify(one_hots(time_numerical[0:40]))

"Project Gutenberg's The Time Machine, by"

## Preparing the data for training

Great, it's not the most efficient implementation, but we know how it works. So we're already doing better than the majority of people with job titles in machine learning. Now, let's chop up our dataset in to sequences we could feed into our model.

You might think we could just feed in the entire dataset as one gigantic input and backpropagate across the entire sequence. When you try to backpropagate across thousands of steps a few things go wrong:
(1) The time it takes to compute a single gradient update will be unreasonably long
(2) The gradient across thousands of recurrent steps has a tendency to either blow up, causing NaN errors due to losing precision, or to vanish.

Thus we're going to look at feeding in our data in reasonably short sequences.

In [48]:
seq_length = 64
dataset = [one_hots(time_numerical[i*seq_length:(i+1)*seq_length]) for i in range(int(np.floor((len(time_numerical)-1)/seq_length)))]
textify(dataset[0])

"Project Gutenberg's The Time Machine, by H. G. (Herbert George) "

Now that we've chopped our dataset into sequences of length ``seq_length``, at every time step, our input is a single one-hot vector. This means that our computation of the hidden layer would consist of matrix-vector multiplications, which are not expecially efficient on GPU. To take advantage of the available computing resources, we'll want to feed through a batch of sequences at the same time. The following code may look tricky but it's just some plumbing to make the data look like this.

![](img/recurrent-batching.png)

In [None]:
batch_size = 32

In [50]:
print(len(dataset))
sequences_per_batch_row = int(np.floor(len(dataset))/batch_size)
print(sequences_per_batch_row)
data_rows = [dataset[i*sequences_per_batch_row:i*sequences_per_batch_row+sequences_per_batch_row] 
            for i in range(batch_size)]

2805
87


Let's sanity check that everything went the way we hop. For each data_row, the second sequence should follow the first:

In [62]:
for i in range(3):
    print("***Batch %s:***\n %s \n\n" % (i, textify(data_rows[i][0]) + textify(data_rows[i][1])))

***Batch 0:***
 Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells

This eBook is for the use of anyone anywhere at no cost a 


***Batch 1:***
 vement of the barometer. Yesterday it was so high, yesterday night
it fell, then this morning it rose again, and so gently upwar 


***Batch 2:***
 andlesticks upon the mantel and several in sconces, so that
the room was brilliantly illuminated. I sat in a low arm-chair
neare 




Now let's stack these data_rows together to form our batches.

In [63]:
def stack_the_datasets(datasets):
    full_dataset = []
    # iterate over the sequences
    for s in range(len(datasets[0])):
        sequence = []
        # iterate over the elements of the seqeunce
        for elem in range(len(datasets[0][0])):
            sequence.append(nd.concatenate([ds[s][elem].reshape((1,-1)) for ds in datasets], axis=0))
        full_dataset.append(sequence)
    return(full_dataset)
        

In [22]:
training_data = stack_the_datasets(datasets)

And let's check that the data stacking procedure worked as expected

In [79]:
print(training_data[0][0].shape)
print("Seq 0, Batch 0",textify([training_data[0][i][0].reshape((1,-1)) for i in range(seq_length)]))
print("Seq 1, Batch 0", textify([training_data[1][i][0].reshape((1,-1)) for i in range(seq_length)]))

(32, 77)
Seq 0, Batch 0 Project Gutenberg's The Time Machine, by H. G. (Herbert George) 
Seq 1, Batch 0 Wells

This eBook is for the use of anyone anywhere at no cost a


## Preparing our labels

Now let's repurpose the same batching code to create our label batches

In [81]:
labels = [one_hots(time_numerical[i*seq_length+1:(i+1)*seq_length+1]) for i in range(int(np.floor((len(time_numerical)-1)/seq_length)))]
label_sets = [labels[i*sequences_per_batch_row:i*sequences_per_batch_row+sequences_per_batch_row] for i in range(batch_size)]
training_labels = stack_the_datasets(label_sets)
print(training_labels[0][0].shape)

(32, 77)


## A final sanity check

Remember that our target at every time step is to predict the next character in the sequence. So our labels should look just like our inputs but offset by one character. Let's look at corresponding inputs and outputs to make sure everything lined up as expected.

In [83]:
print(textify([training_data[0][i][2].reshape((1,-1)) for i in range(seq_length)]))
print(textify([training_labels[0][i][2].reshape((1,-1)) for i in range(seq_length)]))

andlesticks upon the mantel and several in sconces, so that
the 
ndlesticks upon the mantel and several in sconces, so that
the r


## Recurrent neural networks

[Explain RNN updates]

![](img/simple-rnn.png)

Recall that the update for an ordinary hidden layer in a neural network with activation function $phi$ is given by 
$$h = \phi(X  W) + b$$

To make this a recurrent neural network, we're simply going to add a weight sum of theprevious hidden state $h_{t-1}$:

$$h_t = \phi(X_t  W_{xh} + h_{t-1} W_{hh} + b )$$

## Allocate parameters

In [87]:
num_inputs = 77
num_hidden = 256
num_outputs = 77

########################
#  Weights connecting the inputs to the hidden layer
########################
Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01

########################
#  Recurrent weights connecting the hidden layer across time steps
########################
Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01

########################
#  Bias vector for hidden layer
########################
bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01


########################
# Weights to the output nodes
########################
Why = nd.random_normal(shape=(num_hidden,num_inputs), ctx=ctx) * .01
by = nd.random_normal(shape=num_inputs, ctx=ctx) * .01

In [88]:
params = [Wxh, Whh, bh, Why, by]

for param in params:
    param.attach_grad()

## Softmax Activation

In [89]:
# def softmax(y_linear):
#     exp = nd.exp(y_linear-nd.max(y_linear).asscalar())
# #     print(exp.shape)
#     partition = nd.sum(exp).reshape((-1,1))[0][0]
# #     print(partition.shape)
#     return exp / partition


def softmax(y_linear, temperature=1.0):
    lin = (y_linear-nd.max(y_linear)) / temperature
    exp = nd.exp(lin)
    partition =nd.sum(exp, axis=0, exclude=True).reshape((-1,1))
    return exp / partition

In [90]:
####################
# With a temperature of 1 (always 1 during training), we get back some set of proabilities
####################
softmax(nd.array([[1,-1],[-1,1]]), temperature=1000.0)


[[ 0.50049996  0.49949998]
 [ 0.49949998  0.50049996]]
<NDArray 2x2 @cpu(0)>

In [91]:
####################
# If we set a high temperature, we can get more entropic (*noisier*) probabilities
####################
softmax(nd.array([[1,-1],[-1,1]]), temperature=1000.0)


[[ 0.50049996  0.49949998]
 [ 0.49949998  0.50049996]]
<NDArray 2x2 @cpu(0)>

In [92]:
####################
# Often we want to sample with low temperatures to produce sharp proababilities
####################
softmax(nd.array([[10,-10],[-10,10]]), temperature=.1)


[[ 1.  0.]
 [ 0.  1.]]
<NDArray 2x2 @cpu(0)>

## Define the model

In [93]:
def simple_rnn(inputs, state, temperature=1.0):
    outputs = []
    h = state
    for X in inputs:
        h_linear = nd.dot(X, Wxh) + nd.dot(h, Whh) + bh
        h = nd.tanh(h_linear)
        yhat_linear = nd.dot(h, Why) + by
        yhat = softmax(yhat_linear, temperature=temperature) 
        outputs.append(yhat)
    return (outputs, h)

## Cross-entropy loss function

In [94]:
# def cross_entropy(yhat, y):
#     return - nd.sum(y * nd.log(yhat))

def cross_entropy(yhat, y):
    return - nd.mean(nd.sum(y * nd.log(yhat), axis=0, exclude=True))

In [95]:
cross_entropy(nd.array([.2,.5,.3]), nd.array([1.,0,0]))


[ 0.53647929]
<NDArray 1 @cpu(0)>

In [96]:
def average_ce_loss(outputs, labels):
    assert(len(outputs) == len(labels))
    total_loss = nd.array([0.], ctx=ctx)
    for (output, label) in zip(outputs,labels):
        total_loss = total_loss + cross_entropy(output, label)
#         print(total_loss.shape)
#     loss_list = [cross_entropy(outputs[i], labels[i]) for (i, _) in enumerate(outputs)]
    return total_loss / len(outputs)

## Optimizer

In [97]:
def SGD(params, lr):    
    for param in params:
        param[:] = param - lr * param.grad

## Sampler

In [98]:
def sample(prefix, num_chars, temperature=1.0):
    print("prefix: %s" % prefix)
    string = prefix
    prefix_numerical = [character_dict[char] for char in prefix]
    input = one_hots(prefix_numerical)
    sample_state = nd.zeros(shape=(1, num_hidden), ctx=ctx)  

    for i in range(num_chars):
        outputs, sample_state = simple_rnn(input, sample_state, temperature=temperature)
        choice = np.random.choice(77, p=outputs[-1][0].asnumpy())
        string += character_list[choice]
        input = one_hots([choice])
    return string

In [None]:
epochs = 2000
moving_loss = 0.

learning_rate = .5

# state = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx)
for e in range(epochs):
    ############################
    # Attenuate the learning rate by a factor of 2 every 100 epochs.
    ############################
    if ((e+1) % 100 == 0):
        learning_rate = learning_rate / 2.0
    state = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx)
    for i, (data, label) in enumerate(zip(training_data, training_labels)):
        data_one_hot = data
        label_one_hot = label
        with autograd.record():
            outputs, state = simple_rnn(data_one_hot, state)
            loss = average_ce_loss(outputs, label_one_hot)
            loss.backward()
        SGD(params, learning_rate)

        ##########################
        #  Keep a moving average of the losses
        ##########################
        if (i == 0) and (e == 0):
            moving_loss = np.mean(loss.asnumpy()[0])
        else:
            moving_loss = .99 * moving_loss + .01 * np.mean(loss.asnumpy()[0])
      
    print("Epoch %s. Loss: %s" % (e, moving_loss)) 
    print(sample("The Time Ma", 1024, temperature=.1))
    print(sample("The Medical Man rose, came to the lamp,", 1024, temperature=.1))
            

## Conclusions

Once you start running this code, it will spit out a sample at the end of each epoch. I'll leave this output cell blank so you don't see megabytes of text, but here are some patterns that I observed when I ran this code. 

The network seems to first work out patterns with no sequential relationship and then slowly incorporate longer and longer windows of context. After just 1 epoch, my RNN generated this:

>         e       e e ee    e   eee     e     e ee   e  e      ee     e e   ee  e   e            ee    e   e   e     e  e   e     e          e   e ee e    aee    e e               ee  e     e   ee ee   e ee     e e       e e e        ete    e   e e   e e   e       ee  n eee    ee e     eeee  e e    e         e  e  e ee    e  e   e    e       e  e  eee ee      e         e            e       e    e e    ee   ee e e e   e  e  e e  e t       e  ee         e eee  e  e      e ee    e    e       e                e      eee   e  e  e   eeeee      e     eeee e e   ee ee     ee     a    e e eee           ee  e e   e e   aee           e      e     e e               eee       e           e         e     e    e e   e      e   e e   e    e    e ee e      ee                 e  e  e   e    e  e   e                    e      e   e        e     ee  e    e    ee n  e   ee   e  e         e  e         e      e    t    ee  ee  ee   eee  et     e        e     e e              ee   e  e  e     e  e  e e       e              e       e"

It's learned that spaces and "e"s (to my knowledge, there's no aesthetically pleasing way to spell the plural form of the letter "e") are the most common characters.

I little bit later on it spits out strings like:

> the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the

At this point it's learned that after the space usually comes a nonspace character, and perhaps that "t" is the most common character to immediately follow a space, "h" to follow a "t" and "e" to follow "th". However it doesn't appear to be looking far enough back to realize that the word "the" should be very unlikely immesiately after the word "the"... 

By the 175th epoch, the model appears to be putting together a fairly large vocabulary although it the words together in ways that on might charitably describe as "creative".

> the little people had been as I store of the sungher had leartered along the realing of the stars of the little past and stared at the thing that I had the sun had to the stars of the sunghed a stirnt a moment the sun had come and fart as the stars of the sunghed a stirnt a moment the sun had to the was completely and of the little people had been as I stood and all amations of the staring and some of the really

In subsequent tutorials we'll explore sophisticated techniques for evaluating and improving language models. We'll also take a look at some related but more complicate problems like language translations and image captioning.

For whinges or inquiries, [open an issue on  GitHub.](https://github.com/zackchase/mxnet-the-straight-dope)