## Script Generation with Recurrent Neural Networks

In this project, I've implemented a Recurrent Neural Network with an LSTM architecture that generates sentences based on "The Adventures of Sherlock Holmes" by Arthur Conan Doyle, by building them up character-by-character.

In [1]:
#importing some useful packages
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

### Read and clean the text dataset 
Read in the text, transforming everything to lower case

In [2]:
text = open('datasets/The-Adventures-of-Sherlock-Holmes.txt').read().lower()
print('the text has ' + str(len(text)) + ' characters')

the text has 581864 characters


Print out the first characters of the raw text to get a sense of what we need to throw out

In [3]:
text[:2000]

"\ufeffproject gutenberg's the adventures of sherlock holmes, by arthur conan doyle\n\nthis ebook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  you may copy it, give it away or\nre-use it under the terms of the project gutenberg license included\nwith this ebook or online at www.gutenberg.net\n\n\ntitle: the adventures of sherlock holmes\n\nauthor: arthur conan doyle\n\nposting date: april 18, 2011 [ebook #1661]\nfirst posted: november 29, 2002\n\nlanguage: english\n\n\n*** start of this project gutenberg ebook the adventures of sherlock holmes ***\n\n\n\n\nproduced by an anonymous project gutenberg volunteer and jose menendez\n\n\n\n\n\n\n\n\n\nthe adventures of sherlock holmes\n\nby\n\nsir arthur conan doyle\n\n\n\n   i. a scandal in bohemia\n  ii. the red-headed league\n iii. a case of identity\n  iv. the boscombe valley mystery\n   v. the five orange pips\n  vi. the man with the twisted lip\n vii. the adventure of the blue carbuncle\nvii

Cut out the first characters that are not part of the story.

In [4]:
text = text[1198:]

Remove line break characters

In [5]:
text = text.replace('\n',' ') 
text = text.replace('\r',' ')

Lets see how the first characters of our text look now

In [6]:
text[:2000]

" sherlock holmes she is always the woman. i have seldom heard him mention her under any other name. in his eyes she eclipses and predominates the whole of her sex. it was not that he felt any emotion akin to love for irene adler. all emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. he was, i take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. he never spoke of the softer passions, save with a gibe and a sneer. they were admirable things for the observer--excellent for drawing the veil from men's motives and actions. but for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emot

Print all different unique characters that appear in the text

In [7]:
set(text)

{' ',
 '!',
 '"',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '?',
 '@',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 'à',
 'â',
 'è',
 'é'}

Replace any unwanted characters with the space character

In [8]:
# Import regular expressions library
import re

In [9]:
def cleaned_text(text):
    punctuation = ['!', ',', '.', ':', ';', '?']
    text_clean = ''
    for char in set(text):
        if (re.match('[a-z ]',char) is None) and (char not in punctuation):
            text = text.replace(char,' ')

    return text

In [10]:
text = cleaned_text(text)

# shorten any extra dead space created above
text = text.replace('  ',' ')

In [11]:
set(text)

{' ',
 '!',
 ',',
 '.',
 ':',
 ';',
 '?',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z'}

Print out some statistics about the dataset

In [12]:
# count the number of unique characters in the text
chars = sorted(list(set(text)))

print ("this corpus has " +  str(len(text)) + " total number of characters")
print ("this corpus has " +  str(len(chars)) + " unique characters")

this corpus has 573785 total number of characters
this corpus has 33 unique characters


### Cut data into input/output pairs

Slide a window of length $T$ along the text corpus. Everything in the window becomes one input while the character following becomes its corresponding output.  This process of extracting input/output pairs is illustrated in the gif below on a small example text using a window size of T = 5.

<img src="images/text_windowing_training.gif" width=400 height=400/>

We do not need to slide the window along one character at a time but can move by a fixed step size $M$ greater than 1 (in the gif indeed $M = 1$).  This is done with large input texts (like ours which has over 500,000 characters!) when sliding the window along one character at a time we would create far too many input/output pairs to be able to reasonably compute with.

Sliding a window of size T = 5 with a step length of M = 1 (these are the parameters shown in the gif above) over this sequence produces the following list of input/output pairs


$$\begin{array}{c|c}
\text{Input} & \text{Output}\\
\hline \color{CornflowerBlue} {\langle s_{1},s_{2},s_{3},s_{4},s_{5}\rangle} & \color{Goldenrod}{ s_{6}} \\
\ \color{CornflowerBlue} {\langle s_{2},s_{3},s_{4},s_{5},s_{6} \rangle } & \color{Goldenrod} {s_{7} } \\
\color{CornflowerBlue}  {\vdots} & \color{Goldenrod} {\vdots}\\
\color{CornflowerBlue} { \langle s_{P-5},s_{P-4},s_{P-3},s_{P-2},s_{P-1} \rangle } & \color{Goldenrod} {s_{P}}
\end{array}$$

Each input is a sequence (or vector) of 5 characters (and in general has length equal to the window size T) while each corresponding output is a single character.  We created around P total number of input/output pairs  (for general step size M we create around ceil(P/M) pairs).

This function runs a sliding window along the input text and creates associated input/output pairs.

In [13]:
def window_transform_text(text, window_size, step_size):
    # containers for input/output pairs
    inputs = []
    outputs = []

    for n in range(0, len(text)-window_size, step_size):
        inputs.append(text[n:n+window_size])
        outputs.append(text[n+window_size])

    return inputs,outputs

Extract input/output pairs with the sliding window function

In [14]:
window_size = 50
step_size = 5
inputs, outputs = window_transform_text(text,window_size,step_size)

Print out a few input/output pairs

In [15]:
# print out a few of the input/output pairs to verify that we've made the right kind of stuff to learn from
print('input = ' + inputs[6])
print('output = ' + outputs[6])
print('--------------')
print('input = ' + inputs[101])
print('output = ' + outputs[101])

input =  the woman. i have seldom heard him mention her un
output = d
--------------
input = oke of the softer passions, save with a gibe and a
output =  


### One-hot encoding the characters

Transform each character in our inputs/outputs into a vector with length equal to the number of unique characters in our text. This vector is all zeros except one location where we place a 1 - and this location is unique to each character type.  e.g., we transform 'a', 'b', and 'c' as follows

$$a\longleftarrow\left[\begin{array}{c}
1\\
0\\
0\\
\vdots\\
0\\
0
\end{array}\right]\,\,\,\,\,\,\,b\longleftarrow\left[\begin{array}{c}
0\\
1\\
0\\
\vdots\\
0\\
0
\end{array}\right]\,\,\,\,\,c\longleftarrow\left[\begin{array}{c}
0\\
0\\
1\\
\vdots\\
0\\
0 
\end{array}\right]\cdots$$

where number of entries = number of unique characters in text

Form a dictionary mapping each unique character to a unique integer, and one dictionary to do the reverse mapping.  We can then use these dictionaries to quickly make our one-hot encodings, as well as re-translate (from integers to characters) the results of our trained RNN classification model.

In [16]:
# this dictionary maps each unique character to a unique integer
chars_to_indices = dict((c, i) for i, c in enumerate(chars))

# this dictionary maps each unique integer back to a unique character
indices_to_chars = dict((i, c) for i, c in enumerate(chars))

This function takes in the raw character input/outputs and returns their numerical versions

In [17]:
def encode_io_pairs(text,window_size,step_size):
    # number of unique chars
    chars = sorted(list(set(text)))
    num_chars = len(chars)
    
    # cut up text into character input/output pairs
    inputs, outputs = window_transform_text(text,window_size,step_size)
    
    # create empty matrix for one-hot encoded input/output
    X = np.zeros((len(inputs), window_size, num_chars), dtype=np.bool)
    y = np.zeros((len(inputs), num_chars), dtype=np.bool)
    
    # loop over inputs/outputs and transform and store in X/y
    for i, sentence in enumerate(inputs):
        for t, char in enumerate(sentence):
            X[i, t, chars_to_indices[char]] = 1
        y[i, chars_to_indices[outputs[i]]] = 1
        
    return X,y

One-hot encode the input/output pairs

In [18]:
window_size = 50
step_size = 5
X,y = encode_io_pairs(text,window_size,step_size)

### Build and train the Recurrent Neural Network

Build the model

In [19]:
### necessary functions from the keras library
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import keras
import random

Using TensorFlow backend.


In [26]:
model = None
model = Sequential()
model.add(LSTM(1024, input_shape=(window_size,len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

# initialize optimizer
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

# compile model
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Train the model

In [None]:
model.fit(X[:10000], y[:10000], batch_size=500, epochs=200, verbose=1)

# save weights
model.save_weights('model_weights/best_RNN_textdata_weights.hdf5')

Load the model weights, if not already loaded

In [28]:
model.load_weights('model_weights/best_RNN_textdata_weights.hdf5')

This function uses the trained model to predict a desired number of future characters

In [29]:
def predict_next_chars(model,input_chars,num_to_predict):     
    # create output
    predicted_chars = ''
    for i in range(num_to_predict):
        # convert this round's predicted characters to numerical input    
        x_test = np.zeros((1, window_size, len(chars)))
        for t, char in enumerate(input_chars):
            x_test[0, t, chars_to_indices[char]] = 1.

        # make this round's prediction
        test_predict = model.predict(x_test,verbose = 0)[0]

        # translate numerical prediction back to characters
        r = np.argmax(test_predict) # predict class of each test input
        d = indices_to_chars[r] 

        # update predicted_chars and input
        predicted_chars+=d
        input_chars+=d
        input_chars = input_chars[1:]
    return predicted_chars

Generate text using the trained model

In [33]:
start_inds = [0, 10200, 350000]

for s in start_inds:
    start_index = s
    input_chars = text[start_index: start_index + window_size]

    # use the prediction function
    predict_input = predict_next_chars(model,input_chars,num_to_predict = 100)

    # print out input characters
    line = '-------------------' + '\n'
    print(line)

    input_line = 'input chars = ' + '\n' +  input_chars + '"' + '\n'
    print(input_line)

    # print out predicted characters
    predict_line = 'predicted chars = ' + '\n' +  predict_input + '"' + '\n'
    print(predict_line)

-------------------

input chars = 
 sherlock holmes she is always the woman. i have s"

predicted chars = 
ee  not gill. i lat at uicer. tidd we d ar an om the riaming rowhit the king ind lefriog at as se pa"

-------------------

input chars = 
.  come in! said holmes. a man entered who could h"

predicted chars = 
al her i undineer to y more anterteds ntrestqut to yourreas norem, and i way neruphed aid outito hom"

-------------------

input chars = 
hat will you do?  we shall spend the night in your"

predicted chars = 
 bof ir anced to te was  ntershar. i  a d tor tifnt kith.  the ntumh terr the rhorofraph i sacganit "



As we can see, character-by-character text generation isn't ideal for generating valid english words.