# <font color='#6629b2'>Generating text with recurrent neural networks using Keras</font>
by Melissa Roemmele, 7/17/17, roemmele @ usc.edu

## <font color='#6629b2'>Overview</font>

I am going to show how to build a recurrent neural network (RNN) language model that learns the relation between words in text, using the Keras library for machine learning. I will then show how this model can be used for text generation.

## <font color='#6629b2'>Recurrent Neural Networks (RNNs)</font>

RNNs are a general framework for modeling sequence data and are particularly useful for natural langugage processing tasks. Here an RNN will be used as a language model, which can predict which word is likely to occur next in a text given the words before it.

## <font color='#6629b2'>Keras</font>

[Keras](https://keras.io/) is a Python deep learning framework that lets you quickly put together neural network models with a minimal amount of code. It can be run on top of [Theano](http://deeplearning.net/software/theano/) or [Tensor Flow](https://www.tensorflow.org/) without you needing to know either of these underlying frameworks. It provides implementations of several of the layer architectures, objective functions, and optimization algorithms you need for building a model.

## <font color='#6629b2'>Dataset</font>

My research is on story generation, so I've selected a dataset of stories as the text to be modeled by the RNN. They come from the [ROCStories](http://cs.rochester.edu/nlp/rocstories/) dataset, which consists of thousands of five-sentence stories about everyday life events. Here the model will observe all five sentences in each story. Then we'll use the trained model to generate the final sentence in a set of stories not observed during training.

In [301]:
'''load the dataset'''
import csv

with open('roc_train_stories97027.csv', 'r') as f:
    stories = [story for story in csv.reader(f)]

In [302]:
'''designate a set of training stories'''

train_stories = stories[:500]
train_stories = [" ".join(story) for story in train_stories] #concatenate sentences in each story into one long string
print "\n\n".join(train_stories[:5])

Dan's parents were overweight. Dan was overweight as well. The doctors told his parents it was unhealthy. His parents understood and decided to make a change. They got themselves and Dan on a diet.

Carrie had just learned how to ride a bike. She didn't have a bike of her own. Carrie would sneak rides on her sister's bike. She got nervous on a hill and crashed into a wall. The bike frame bent and Carrie got a deep gash on her leg.

Morgan enjoyed long walks on the beach. She and her boyfriend decided to go for a long walk. After walking for over a mile, something happened. Morgan decided to propose to her boyfriend. Her boyfriend was upset he didn't propose to her first.

Jane was working at a diner. Suddenly, a customer barged up to the counter. He began yelling about how long his food was taking. Jane didn't know how to react. Luckily, her coworker intervened and calmed the man down.

I was talking to my crush today. She continued to complain about guys flirting with her. I decided t

## <font color='#6629b2'>Preparing the data</font>

The model we'll create is a word-based language model, which means each input unit is a single word (some language models learn subword units like characters). 

So first we need to tokenize each of the stories into (lowercased) individual words. I'll use Keras' built-in tokenizer here for convenience, but typically I like to use [spacy](https://spacy.io/), a fast and user-friendly library that performs various language processing tasks. 

A note: Keras' tokenizer does not do the same linguistic processing to separate punctuation from words, for instance, which should be their own tokens. You can see this below from words that end in punctuation like "." or ",".

We need to assemble a lexicon (aka vocabulary) of words that the model needs to know. Thus, each tokenized word in the stories is added to the lexicon. We use the fit_on_texts() function to map each word in the stories to a numerical index. When working with large datasets it's common to filter all words occurring less than a certain number of times, and replace them with some generic "UNKNOWN" token. Here, because this dataset is small, every word encountered in the stories is added to the lexicon.

In [303]:
'''make the lexicon'''

from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(lower=True, filters='')
tokenizer.fit_on_texts(train_stories) #split stories into words, assign number to each unique word
print "\n".join([word + ": " + str(word_idx) for word, word_idx 
                 in tokenizer.word_index.items()[:20]]) #print a sample of the dictionary

import pickle
with open('tokenizer.pkl', 'wb') as f: #save the tokenizer
    pickle.dump(tokenizer, f)

raining: 1139
ever.: 1140
limited: 1929
shouted.: 1930
better.: 612
jane's: 1931
foul: 1141
four: 811
obstruction: 1932
woods: 1933
sleep: 812
ocean.: 1142
hyped.: 1934
captain: 1143
party.: 613
party,: 1935
marching: 1936
up.: 164
belonged.: 1937
up,: 813


In [304]:
'''convert each story from text to numbers'''

train_idxs = tokenizer.texts_to_sequences(train_stories) #transform each word to its numerical index in lexicon
print train_stories[0], "\n"
print train_idxs[0] #show example of encoded story

Dan's parents were overweight. Dan was overweight as well. The doctors told his parents it was unhealthy. His parents understood and decided to make a change. They got themselves and Dan on a diet. 

[1986, 159, 33, 4116, 176, 4, 3784, 34, 517, 1, 3573, 57, 9, 159, 15, 4, 3633, 9, 159, 3840, 6, 27, 2, 69, 3, 503, 18, 32, 4070, 6, 176, 16, 3, 1879]


###  <font color='#6629b2'>Creating a matrix</font>

Finally, we need to put all the training stories into a single matrix, where each row is a story and each column is a word index in that story. This enables the model to process the stories in batches as opposed to one at a time, which significantly speeds up training. However, each story has a different number of words. So we create a padded matrix equal to the length on the longest story in the training set. For all stories with fewer words, we prepend the row with zeros representing an empty word position. Then we can actually tell Keras to ignore these zeros during training.

In [305]:
'''create a padded matrix of stories'''

from keras.preprocessing.sequence import pad_sequences

maxlen = max([len(story) for story in train_idxs]) # get length of longest story
# maxlen = int(numpy.ceil(maxlen * 1. / 10))\
#                                 * 10 + 1 #for reasons that will become clear below, round up to nearest 10 and add 1
print "matrix length:", maxlen

train_idxs = pad_sequences(train_idxs, maxlen=maxlen) #keras provides convenient padding function
train_idxs[0] #same example story as above

matrix length: 66


array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0, 1986,
        159,   33, 4116,  176,    4, 3784,   34,  517,    1, 3573,   57,
          9,  159,   15,    4, 3633,    9,  159, 3840,    6,   27,    2,
         69,    3,  503,   18,   32, 4070,    6,  176,   16,    3, 1879], dtype=int32)

### <font color='#6629b2'>Defining the input and output</font>

In an RNN language model, the data is set up so that each word in the text is mapped to the word that follows it. In a given story, for each input word x[idx], the output label y[idx] is just x[idx+1].

In [319]:
'''set up the model input and output'''

train_x = train_idxs[:, :-1]
print "x:", train_x[0]
    
train_y = train_idxs[:, 1:]#, None] #Keras requires extra dim for y: (batch_size, n_timesteps, 1)
print "y:", train_y[0]

x: [   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0 1986  159   33 4116  176    4 3784   34  517    1 3573   57    9
  159   15    4 3633    9  159 3840    6   27    2   69    3  503   18   32
 4070    6  176   16    3]
y: [   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 1986  159   33 4116  176    4 3784   34  517    1 3573   57    9  159
   15    4 3633    9  159 3840    6   27    2   69    3  503   18   32 4070
    6  176   16    3 1879]


##  <font color='#6629b2'>Creating the model</font>

We'll build an RNN with four layers: 
1. An input layer that converts word indices into distributed vector representations (embeddings).
2. A recurrent hidden layer, the main component of the network. As it observes each word in the story, it integrates the word embedding representation with what it's observed so far to compute a representation (hidden state) of the story at that timepoint. There are a few architectures for this layer - I use the GRU variation, Keras also provides LSTM or just the simple vanilla recurrent layer.
3. A second recurrent layer that takes the first as input and operates the same way, since adding more layers generally improves the model.
3. A prediction (dense) layer that outputs a probability for each word in the lexicon, where each probability indicates the chance of that word being the next word in the sequence. The model gets feedback during training about what the actual word should be.

Of course this is a very simplified explanation of the model, since the focus here is on how to implement it in Keras. For a more thorough explanation of RNNs, see the resources at the bottom.

For each layer, we need to specify the number of dimensions (nodes). Since a language model computes a probability distribution each word in the lexicon, the number of dimensions in the prediction layer is equal to the lexicon size. To account for the zeros in the input, we'll add one more dimension so that each word index corresponds to the index of its dimension.

When setting up the model, we specify the number of stories in each input batch (batch size) as well as the number of words observed at a time for each story in a batch (n_timesteps). For example, if n_timesteps is 10, the model will slide over each window of 10 words in the stories and perform an update to the parmaters by backpropogating the gradient over these 10 words (for the details of backpropogation, see below). However, we still want the model to "remember" everything in the story, not just the previous 10 words, so Keras provides the "stateful" option to do this. By setting "stateful=True", the hidden state of the model after observing 10 words will be carried over to the next word window. After all the words in a batch of stories have been processed, the reset_states() function can be called to indicate the model should now forget its hidden state and start over with the next batch of stories.

For each word in a story, the prediction layer will output a probability distribution for the next word. To get this sequence of probability distributions rather than just one, we wrap TimeDistributed() class around the Dense layer. The model is trained to maximize the probabilities of the words in the stories, which is what the sparse_categorical_crossentropy loss function does (again, see below for a full explanation of this). 

One huge benefit of Keras is that it has several optimization algorithms already implemented. I use Adam here, there are several other available including SGD, RMSprop, and Adagrad. You can change other parameters like learning rate and gradient clipping as well.

In [307]:
from keras.models import Sequential
from keras.layers import Dense, TimeDistributed
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU

def create_rnn(lexicon_size, n_embedding_nodes, n_hidden_nodes, batch_size, n_timesteps):

    rnn = Sequential()

    #Layer 1
    embedding_layer = Embedding(batch_input_shape=(batch_size, n_timesteps),
                                input_dim=lexicon_size + 1, #add 1 because word indices start at 1, not 0
                                output_dim=n_embedding_nodes, 
                                mask_zero=True) #mask_zero=True will ignore padding
    rnn.add(embedding_layer)

    #Layer 2
    recurrent_layer1 = GRU(output_dim=n_hidden_nodes,
                           return_sequences=True, #return hidden state for each word, not just last one
                           stateful=True) #keep track of hidden state while iterating through story
    rnn.add(recurrent_layer1)

    #Layer 3
    recurrent_layer2 = GRU(output_dim=n_hidden_nodes,
                           return_sequences=True, 
                           stateful=True)
    rnn.add(recurrent_layer2)

    #Layer 4
    prediction_layer = TimeDistributed(Dense(lexicon_size + 1,
                                       activation="softmax"))
    rnn.add(prediction_layer)

    #Specify loss function and optimization algorithm, compile model
    rnn.compile(loss="sparse_categorical_crossentropy", 
                optimizer='adam')
    
    return rnn



We'll create an RNN with 300 embedding nodes and 500 hidden nodes in each recurrent layer. It will process 10 words (n_timesteps) at a time in batches of 20 stories.

In [308]:
'''initialize the RNN'''

batch_size = 20
rnn = create_rnn(lexicon_size = len(tokenizer.word_index),
                 n_embedding_nodes = 300,
                 n_hidden_nodes = 500,
                 batch_size = batch_size,
                 n_timesteps = maxlen - 1)

We'll train the RNN for 10 iterations through the training stories (epochs). The cross-entropy loss indicates how well the model is learning - it should go down with each epoch.

In [309]:
'''train the RNN'''

import numpy

n_epochs = 10
print "Training RNN on", len(train_stories), "stories for", n_epochs, "epochs..."
for epoch in range(n_epochs):
    #import pdb;pdb.set_trace()
    losses = []  #track cross-entropy loss during training
    for batch_idx in range(0, len(train_stories), batch_size):
        batch_x = train_x[batch_idx:batch_idx+batch_size] #get batch for x
        batch_y = train_y[batch_idx:batch_idx+batch_size, :, None] #get batch for y, Keras requires extra final dimension
        loss = rnn.train_on_batch(batch_x, batch_y) #takes a few moments to initialize training
        losses.append(loss)
        rnn.reset_states() #reset hidden state after each batch
    print "epoch {}, mean loss: {:.3f}".format(epoch + 1, numpy.mean(losses))
    rnn.save('rnn.h5') #save model after each epoch


Training RNN on 500 stories for 50 epochs...
epoch 1, mean loss: 7.490
epoch 2, mean loss: 6.807
epoch 3, mean loss: 6.731
epoch 4, mean loss: 6.680
epoch 5, mean loss: 6.662
epoch 6, mean loss: 6.649
epoch 7, mean loss: 6.637
epoch 8, mean loss: 6.621
epoch 9, mean loss: 6.594
epoch 10, mean loss: 6.518
epoch 11, mean loss: 6.386
epoch 12, mean loss: 6.271
epoch 13, mean loss: 6.175
epoch 14, mean loss: 6.064
epoch 15, mean loss: 5.960
epoch 16, mean loss: 5.853
epoch 17, mean loss: 5.750
epoch 18, mean loss: 5.622
epoch 19, mean loss: 5.515
epoch 20, mean loss: 5.358
epoch 21, mean loss: 5.209
epoch 22, mean loss: 5.090
epoch 23, mean loss: 4.978
epoch 24, mean loss: 4.854
epoch 25, mean loss: 4.727
epoch 26, mean loss: 4.666
epoch 27, mean loss: 4.580
epoch 28, mean loss: 4.442
epoch 29, mean loss: 4.291
epoch 30, mean loss: 4.113
epoch 31, mean loss: 3.954
epoch 32, mean loss: 3.824
epoch 33, mean loss: 3.693
epoch 34, mean loss: 3.558
epoch 35, mean loss: 3.431
epoch 36, mean loss

In [314]:
'''load the trained model'''

from keras.models import load_model

with open('tokenizer_96000_stories.pkl', 'rb') as f:
    tokenizer = pickle.load(f)
    print "loaded tokenizer with", len(tokenizer.word_index), "words in lexicon"

rnn = load_model('rnn_96000_stories.h5')

## <font color='#6629b2'>Generating sentences</font>

Now that the model is trained, it can be used to generate new text. Here, I'll give the model the first four sentences of a new story and have it generate the fifth sentence. To do this, the model reads the initial story in order to produce a probability distribution for the first word in the fifth sentence. We can sample a word from this probability distribution and add it to the story. We repeat this process, each time generating the next word based on the story so far. We stop generating words either when an end-of-sentence token is generated (e.g. ".", "!", or "?").

In [315]:
'''set up stories used for generation by separating final sentence from first four'''

heldout_endings = [story[-1] for story in stories[-20:]]
heldout_stories = [" ".join(story[:-1]) for story in stories[-20:]]
heldout_idxs = tokenizer.texts_to_sequences(heldout_stories)
print "STORY:", heldout_stories[0], "\n", heldout_idxs[0], "\n"
print "GIVEN ENDING:", heldout_endings[0]

STORY: Dan's parents were overweight. Dan was overweight as well. The doctors told his parents it was unhealthy. His parents understood and decided to make a change. 
[1986, 159, 33, 4116, 176, 4, 3784, 34, 517, 1, 3573, 57, 9, 159, 15, 4, 3633, 9, 159, 3840, 6, 27, 2, 69, 3, 503] 


GIVEN ENDING: They got themselves and Dan on a diet.


When generating, the model predicts one word at a time for a given story, but the trained model expects that batch size = 20 and n_timesteps = 70. The easiest thing to do is duplicate the trained model but set the batch size = 1 and n_timesteps = 1. To do this, we just create a new model with these settings and then copy the parameters (weights) of the trained model over the new model.

In [316]:
'''duplicate the trained RNN but set batch size = 1 and n_timesteps = 1'''

generation_rnn = create_rnn(lexicon_size = len(tokenizer.word_index),
                            n_embedding_nodes = 300,
                            n_hidden_nodes = 500,
                            batch_size = 1,
                            n_timesteps = 1)
generation_rnn.set_weights(rnn.get_weights())

The model will generate word indices, so we need to map these numbers back to their corresponding strings. We'll reverse the lexicon dictionary to create a lookup table to get each word from its index.

In [322]:
'''create lookup table to get string words from their indices'''

lexicon_lookup = {index: word for word, index in tokenizer.word_index.items()}
eos_tokens = [".", "?", "!"] #specify which characters should indicate the end of a sentence and halt generation

print "\n".join([str(word_idx) + ": " + word for word_idx, word 
                 in lexicon_lookup.items()[:20]]) #print a sample of the lookup map

1: the
2: to
3: a
4: was
5: he
6: and
7: she
8: her
9: his
10: i
11: for
12: in
13: of
14: had
15: it
16: on
17: my
18: they
19: with
20: that


In [318]:
'''use RNN to generate new endings for stories'''

for story, story_idxs, ending in zip(heldout_stories, heldout_idxs, heldout_endings):
    #import pdb;pdb.set_trace()
    print "STORY:", story
    print "GIVEN ENDING:", ending
    
    generated_ending = []
    
    story_idxs = numpy.array(story_idxs)[None] #format story with shape (1, length)
    
    for step_idx in range(story_idxs.shape[-1]):
        p_next_word = generation_rnn.predict_on_batch(story_idxs[:, step_idx])[0,-1] #input shape will be (1, 1)

    #now start predicting new words
    while not generated_ending or lexicon_lookup[next_word][-1] not in eos_tokens:
        next_word = numpy.random.choice(a=p_next_word.shape[-1], p=p_next_word)
        generated_ending.append(next_word)
        p_next_word = generation_rnn.predict_on_batch(numpy.array(next_word)[None,None])[0,-1]
    
    generation_rnn.reset_states() #reset hidden state after generating ending
    
    generated_ending = " ".join([lexicon_lookup[word] 
                                 for word in generated_ending]) #decode from numbers back into words
    print "GENERATED ENDING:", generated_ending
    print "\n"

STORY: Dan's parents were overweight. Dan was overweight as well. The doctors told his parents it was unhealthy. His parents understood and decided to make a change.
GIVEN ENDING: They got themselves and Dan on a diet.
GENERATED ENDING: they concert.


STORY: Carrie had just learned how to ride a bike. She didn't have a bike of her own. Carrie would sneak rides on her sister's bike. She got nervous on a hill and crashed into a wall.
GIVEN ENDING: The bike frame bent and Carrie got a deep gash on her leg.
GENERATED ENDING: carrie agreed and positive over her side but time she would sneak but she bike into the bike for the wall.


STORY: Morgan enjoyed long walks on the beach. She and her boyfriend decided to go for a long walk. After walking for over a mile, something happened. Morgan decided to propose to her boyfriend.
GIVEN ENDING: Her boyfriend was upset he didn't propose to her first.
GENERATED ENDING: the boyfriend would laughed for her with her morning searching something that tr

## <font color='#6629b2'>Helpful resources about RNNs for text processing</font>

Among the [Theano tutorials](http://deeplearning.net/tutorial/) mentioned above, there are two specifically on RNNs for NLP: [semantic parsing](http://deeplearning.net/tutorial/rnnslu.html#rnnslu) and [sentiment analysis](http://deeplearning.net/tutorial/lstm.html#lstm)

[The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) (same model as shown here, with raw Python code) 

TensorFlow also has an RNN language model [tutorial](https://www.tensorflow.org/versions/r0.12/tutorials/recurrent/index.html) using the Penn Treebank dataset

This [explanation](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) of how LSTMs work and why they are better than plain RNNs (this explanation also applies to the GRU used here)

Another [tutorial](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) that documents well both the theory of RNNs and their implementation in Python (and if you care to implement the details of the stochastic gradient descent and backprogation through time algorithms, this is very informative)