# Recurrent Neural Netorks and LSTMs

Notebook in which we attempt to learn the structure of a corpus of text on a character by character basis and output some pseudo text of this form. The cononcial example is the complete works of shakespeare, but will work with anything that has a certain consistent style.

The notebook [prep_text](.prep_text.ipynb) is used to download and preprocess a suitable text file for use with this notebook.

In [2]:
import keras
import os
import numpy as np

from sklearn.model_selection import train_test_split

Using TensorFlow backend.


Parameters. The number of enrollings is equivalent to how many characters we look back at a time. Thus 10 will be enough to learn the structure of most words, but not sentence and line structure. Around 100 is necessary to begin to learn the actual text structuring of the verses etc.

In [3]:
num_enrollings = 84
hidden_units = 128

## Data Prep

Load the data into memory. This is simple a single string of text. Then we are going to need to iterate over batches of data. Each training sample is (for example) 10 characters one-hot encoded, and the label is the 11th character (also one hot encoded). The model therefore tries to predict the 11th character based on the proceeding 10. It uses a LSTM to do this.

In [4]:
BASE_DIR = "../data/text"
file = "cleanshake.txt"
file_name = os.path.join(BASE_DIR, file)

In [5]:
with open(file_name, 'rt') as f:
    text = f.read()

This first step is a create a mapping from characters to numbers, and for convinience one the other way round. Then we replace our data with a list of integers.

In [6]:
all_chars = set(text)
total_chars = len(all_chars)
char2num = {c:i for i,c in enumerate(all_chars)}
num2chars = {char2num[c]:c for c in char2num}

Now take the data set and create the list of numbers, and split this list into a small number of batches. There are some edge effects here, but these are pretty unimportant as we may split the whole 6 mil + data set into four batches.

In [7]:
all_int = [char2num[t] for t in text]
int_data = [all_int[i:i + num_enrollings] for i in range(len(text) - num_enrollings)]
int_labels = [all_int[i + num_enrollings] for i in range(len(text) - num_enrollings)]
train_data, test_data, train_labels, test_labels = train_test_split(int_data, int_labels, train_size=0.9)
# The bigger the enrolling number the more batches you will need.
# For enrolling of 10 you don't need to batch at all, for around
# 64 8 batches will do fine.
batches = 16
batch_size = int(np.ceil(len(train_data)/batches))
batched_data = [train_data[i*batch_size:(i+1)*batch_size] for i in range(batches)]
batched_labels = [train_labels[i*batch_size:(i+1)*batch_size] for i in range(batches)]



In [8]:
# test_data = keras.utils.to_categorical(test_data)
# test_labels = keras.utils.to_categorical(test_labels)

Now write a function that take the whole of the integer data and returns the training data and test data, but does this in batches. The batches can be quite big, as we are close to being able to do the whole thing in one go.

In [9]:
def get_batch(n):
    data = batched_data[n]
    labels = batched_labels[n]
    data = keras.utils.to_categorical(data)
    labels = keras.utils.to_categorical(labels)
    return data, labels

## Building the model

Use keras to construct the model. Use a LSTM followed by a dense output layer with softmax.

In [10]:
model = keras.models.Sequential()
model.add(keras.layers.LSTM(input_shape=(num_enrollings, total_chars), units=hidden_units, use_bias=False))
model.add(keras.layers.Dense(total_chars, activation='softmax', use_bias=False))

In [11]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

## Train the model

On a three year old mid-range mac this takes about 3 hours to go through all of the data once. Consider optimising the code or renting a GPU unit if you have more ambitious plans, as this is about the limit of what it can handle.

In [None]:
# For testing only
sample_size = 10000
for b in range(batches):
    print("Preparing batch...")
    d, l = get_batch(b)
    print("Fitting Model")
    model.fit(d, l)
    # Test periodically. This slows it down! Just use the final output
    # as evaluation for now!
#     _, ac = model.evaluate(test_data, test_labels)
#     print("Accuracy: {}%".format(ac*100))

Save the model. The generation of the model takes no time at all, so we only need to save the weights, so we can pick up from here if we need to.

In [36]:
model.save_weights("lstm_{}_{}.h5".format(num_enrollings, hidden_units))

## Generate some text

Use the trained model to generate some pseudo text in the style of shakespear.

To do this two helper functions are of use. Firstly, taking a one hot encoded string and take the most likely string representation of this string, and secondly to one-hot encode a string.

In [14]:
def onehot2string(onehot, r=0):
    """
    Turn the onehot to a string.
    r is a measure of randomness. r=0 means that
    we always take the most likely string. r=1 means
    that we sample the string, taking the values
    as the sample probabilities.
    """
    
    ran = np.random.random()
    if ran > r:
        return num2chars[onehot.argmax()]
    else:
        return num2chars[np.random.choice(list(range(total_chars)), p=onehot)]

In [15]:
def char2onehot(s):
    """
    Returns the one hot representation of the charachter
    """
    x = np.zeros(total_chars)
    x[char2num[s]] = 1
    return x

In [16]:
def next_char(s, r=0):
    """
    Gets the next string by calling the model on the input.
    """
    
    onehots = np.array([[char2onehot(c) for c in s]])
    out = model.predict(onehots)[0]
    return onehot2string(out, r)
    

In [28]:
def print_text(length, r=0.1):
    start = np.random.randint(len(text) - num_enrollings)
    all_text = list(text[start: start+num_enrollings])
    print("".join(all_text))
    for i in range(length):
        nc = next_char(all_text[-num_enrollings:], r)
        all_text.append(nc)
    print("".join(all_text))

Remember that the beginning of the text is actual shakespeare, which is used to seed the process. Without any randomness (i.e. always selecting the most likely letter, the model gets stuck in an infinite loop. Nonetheless, it does seem to learn what words, sentences and even lines are, which is something, as none of these concepts have been put into the model by hand. For example it know to put only one space between words, follow a full-stop with a capital letter and break the sentences onto different lines.

In [29]:
print_text(25000, 0)

     Alexandria. CLEOPATRA's palace

      Enter ANTONY, CLEOPATRA, ENOBARBUS, CHARM
     Alexandria. CLEOPATRA's palace

      Enter ANTONY, CLEOPATRA, ENOBARBUS, CHARMIS and SIR TOBY

  MESSENGER. What is the state of the street?
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of 

With a little bit more randomness it seems to get stuck in loops, which it is then able to break out of. It occassionally does something more interesting, like ending a big chunk of text with the word Exit, and then doing lots of white space, which is exactly what happens in the text. It also finds one or two more interesting words, such as beauty and promise, it seems to love the letter 's' which is a common letter for a word to start with, so possibly in the absense of any other strong knowledge it might start a word with s, which may be why it often falls back on certain words like street and something. It also uses 'the', 'of', 'to' and 'and' very often, which makes sense.

In [30]:
print_text(25000, 0.01)

nd felt them knowingly- the art o' th' court,
    As hard to leave as keep, whose to
nd felt them knowingly- the art o' th' court,
    As hard to leave as keep, whose to the streets for the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the sweet.
    The street of the streets and the streets and something to the street.
    The street of the serves and strength and streets and stands
    That the promise of the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street.
    The street of the streets and the streets and something to the street

With a little bit more randomness it seems to be doing some more interesting things, such as starting blocks of text with a characters name in capitals; it even uses actual characters from the plays. It uses simple stage directions as well, such as enter and exit etc. It also announces scenes, "SCENE II" followed by some pseudo description of the play.

In [31]:
print_text(25000, 0.1)

w me after. O'er my spirit
    Thy full supremacy thou knew'st, and that
    Thy bec
w me after. O'er my spirit
    Thy full supremacy thou knew'st, and that
    Thy become to the streets as the strong to be some to the see's
    The street of the streets as the street of the street,
    And with the mirdle to what he shall be some to the street.
    The street of the streets and the streets and something to the street.
    The street of Troam of the streets to the streets and stands
    That the promise of the streets and something to the street.
    I do laughter the promise of the street,
    Make the promise of the great strenciness of the street.
    The sun in the part old anquother stand to the street.
    The street upon the strenger of the string the promise
    That the promise of to my hearts and stand to me.
    The street of the streath of the streets are not I shall
    the streetures of the streets and something the promise of the street.
    The street of the streets an

In [32]:
print_text(25000, 0.25)

avenly moisture, air of grace,         64
  Wishing her cheeks were gardens full of 
avenly moisture, air of grace,         64
  Wishing her cheeks were gardens full of that
    beness to the streets and something to the substance of them and sometimes
    That she was the promits of the stwell.
  MENENIUS OF SYRA. I shall be speak to the better court.
    The street of thy restcurs.
                                             Re-enter SECOND GENTLEMaS

  LEONTES. What shall I do not to the stauld!
    The should be grace therefores the promise.
    The street on the part of the intertion of this fool.
    Then the bear and pition not here me.
    The man that the behalf and something the death
    That must be stay to into the streets and stay.
    And so thou shalt not be affection to me.
    The interts and soul to this beaden to this resmies.
    Therefore tears the please the peaces of tongues man
    That the beauty of his something to the doors
    That Prince the promise of my

In [33]:
print_text(25000, 0.5)

 for a falconer’s voice
To lure this tassel-gentle back again.
Bondage is hoarse and
 for a falconer’s voice
To lure this tassel-gentle back again.
Bondage is hoarse and so fellow to stroke,
To Cassion, and so flood, to battly, after buy,
This buse take the slack to the unto the steel,
Thou hast not be so ftam to the streen of the streathes. This is therefore,
Thy strence the promised Sir John, in his son,
Then do not thou shalt they forsworn cannot
  That fortunes with somets with whiles to meet the place
    And soundly to the story beside!
  QUEEN. Here is comfort, and smell and by the enemies and stand forgeblf to promise
    The story of his strenting to his enight,
    And so more garden; he will to the strest for their joys
    That she since, and son so for comfor
-    Flours his slates of the stap to the end of the strutter.
    The father targe ralvery to my weaks
    With Antony of Brothein.
  BY By trumpet son, and what a lie the street;
    At he is donz.
  ROSALIND. The s

In [35]:
print_text(25000, 1.)

orld enrag'd,
    Nor met with fortune other than at feasts,
    Full of warm blood,
orld enrag'd,
    Nor met with fortune other than at feasts,
    Full of warm blood, this needied, after
    a vas thee eqump tongue.' Hadge?.                           Exeunt DUFLH
  3XANSIUS,
               Forth
    Nonge.
  CLOTENCH. Haw well a discan do ward her comforment God?
    You have a destrain, and I ever love,
    True!
    Yea. O, noulity lords.
  CORIN. And concea, man, she senven? set, wonly, my lord.
  FIRST PEMBOT. Well, and King, maughtian, here,
    Enter Butt.
  MESSIA. I do it yeak the obedencition;
    They severe duny from 'that coll to those towhing.
    'Them lim' him and with teek, or everge I
    To make now wherefore you are doors. FOOL.
      By thom; my most queen of a strength. For Queen you succect
    To be kneel's there in our soleming;
    Three mer revole lord, and depince,
    By hand on her that bithis, and mock to
    hear and bloody Majesty some affection, swen

Try using a beam search. Generate the next n (=5) next most likely letters, keep doing this, and get ```5**s``` s= number of steps, do this for a small number of steps, then work out the most likely next 10 letters say, and repeat.