# Recurrent Neural Netorks and LSTMs

Notebook in which we attempt to learn the structure of a corpus of text on a character by character basis and output some pseudo text of this form. The cononcial example is the complete works of shakespeare, but will work with anything that has a certain consistent style.

The notebook [prep_text](.prep_text.ipynb) is used to download and preprocess a suitable text file for use with this notebook.

In [2]:
import keras
import os
import numpy as np

from sklearn.model_selection import train_test_split

Using TensorFlow backend.


Parameters. The number of enrollings is equivalent to how many characters we look back at a time. Thus 10 will be enough to learn the structure of most words, but not sentence and line structure. Around 100 is necessary to begin to learn the actual text structuring of the verses etc.

In [3]:
num_enrollings = 84
hidden_units = 128

## Data Prep

Load the data into memory. This is simple a single string of text. Then we are going to need to iterate over batches of data. Each training sample is (for example) 10 characters one-hot encoded, and the label is the 11th character (also one hot encoded). The model therefore tries to predict the 11th character based on the proceeding 10. It uses a LSTM to do this.

In [4]:
BASE_DIR = "../data/text"
file = "cleanshake.txt"
file_name = os.path.join(BASE_DIR, file)

In [5]:
with open(file_name, 'rt') as f:
    text = f.read()

This first step is a create a mapping from characters to numbers, and for convinience one the other way round. Then we replace our data with a list of integers.

In [6]:
all_chars = set(text)
total_chars = len(all_chars)
char2num = {c:i for i,c in enumerate(all_chars)}
num2chars = {char2num[c]:c for c in char2num}

Now take the data set and create the list of numbers, and split this list into a small number of batches. There are some edge effects here, but these are pretty unimportant as we may split the whole 6 mil + data set into four batches.

In [7]:
all_int = [char2num[t] for t in text]
int_data = [all_int[i:i + num_enrollings] for i in range(len(text) - num_enrollings)]
int_labels = [all_int[i + num_enrollings] for i in range(len(text) - num_enrollings)]
train_data, test_data, train_labels, test_labels = train_test_split(int_data, int_labels, train_size=0.9)
# The bigger the enrolling number the more batches you will need.
# For enrolling of 10 you don't need to batch at all, for around
# 64 8 batches will do fine.
batches = 16
batch_size = int(np.ceil(len(train_data)/batches))
batched_data = [train_data[i*batch_size:(i+1)*batch_size] for i in range(batches)]
batched_labels = [train_labels[i*batch_size:(i+1)*batch_size] for i in range(batches)]



In [8]:
# test_data = keras.utils.to_categorical(test_data)
# test_labels = keras.utils.to_categorical(test_labels)

Now write a function that take the whole of the integer data and returns the training data and test data, but does this in batches. The batches can be quite big, as we are close to being able to do the whole thing in one go.

In [9]:
def get_batch(n):
    data = batched_data[n]
    labels = batched_labels[n]
    data = keras.utils.to_categorical(data)
    labels = keras.utils.to_categorical(labels)
    return data, labels

## Building the model

Use keras to construct the model. Use a LSTM followed by a dense output layer with softmax.

In [10]:
model = keras.models.Sequential()
model.add(keras.layers.LSTM(input_shape=(num_enrollings, total_chars), units=hidden_units, use_bias=False))
model.add(keras.layers.Dense(total_chars, activation='softmax', use_bias=False))

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

## Train the model

On a three year old mid-range mac this takes about 3 hours to go through all of the data once. Consider optimising the code or renting a GPU unit if you have more ambitious plans, as this is about the limit of what it can handle.

In [None]:
# For testing only
sample_size = 10000
for b in range(batches):
    print("Preparing batch...")
    d, l = get_batch(b)
    print("Fitting Model")
    model.fit(d, l)
    # Test periodically. This slows it down! Just use the final output
    # as evaluation for now!
#     _, ac = model.evaluate(test_data, test_labels)
#     print("Accuracy: {}%".format(ac*100))

Preparing batch...
Fitting Model
Epoch 1/1

## Generate some text

Use the trained model to generate some pseudo text in the style of shakespear.

To do this two helper functions are of use. Firstly, taking a one hot encoded string and take the most likely string representation of this string, and secondly to one-hot encode a string.

In [None]:
def onehot2string(onehot, r=0):
    """
    Turn the onehot to a string.
    r is a measure of randomness. r=0 means that
    we always take the most likely string. r=1 means
    that we sample the string, taking the values
    as the sample probabilities.
    """
    
    ran = np.random.random()
    if ran > r:
        return num2chars[onehot.argmax()]
    else:
        return num2chars[np.random.choice(list(range(total_chars)), p=onehot)]

In [None]:
def char2onehot(s):
    """
    Returns the one hot representation of the charachter
    """
    x = np.zeros(total_chars)
    x[char2num[s]] = 1
    return x

In [None]:
def next_char(s, r=0):
    """
    Gets the next string by calling the model on the input.
    """
    
    onehots = np.array([[char2onehot(c) for c in s]])
    out = model.predict(onehots)[0]
    return onehot2string(out, r)
    

In [None]:
start = np.random.randint(len(text) - num_enrollings)
all_text = list(text[start: start+num_enrollings])
print("".join(all_text))
output_length = 2000
r = 0.1
for i in range(output_length):
    nc = next_char(all_text[-num_enrollings:], r)
    all_text.append(nc)

In [None]:
print("".join(all_text))

Try using a beam search. Generate the next n (=5) next most likely letters, keep doing this, and get ```5**s``` s= number of steps, do this for a small number of steps, then work out the most likely next 10 letters say, and repeat.