https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/

# Word-Level Text Generator in Keras

In this project, I will develop a word level text generator in Keras using LSTM. I will train it on a Harry Potter dataset.

In [1]:
import string
from numpy import array
from random import randint
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.callbacks import LambdaCallback

from pickle import dump

in_filename = '../datasets/harry-potter-1-2.txt'
out_filename = 'harry-potter.txt'
input_size = 50
output_size = 1

Using TensorFlow backend.


## Load the Document

The first step in creating the model is to load the corpus into memory.

In [2]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

Once we've got the function to load the document, we find the path to the corpus that we are using, which in this case is:

`../datasets/harry-potter-1.txt`

In [3]:
# load document
doc = load_doc(in_filename)
print(doc[:200])

Harry Potter and the Sorcerer's Stone CHAPTER ONE THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They wer


## Clean the Document

Now that we've loaded the document into memory, we want to clean the document. For example, before splitting the document into words, we may want to replace all "-" with spaces to so that the words split more nicely. We also take out the punctuation from each word.

In [4]:
# turn a doc into clean tokens
def clean_doc(doc):
    # make lower case
    doc = doc.lower()
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    doc = doc.replace('-', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    return tokens

We can then run the cleaning function on the document that we've stored in memory.

In [5]:
# clean document
tokens = clean_doc(doc)
print(tokens[:20])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['harry', 'potter', 'and', 'the', 'sorcerers', 'stone', 'chapter', 'one', 'the', 'boy', 'who', 'lived', 'mr', 'and', 'mrs', 'dursley', 'of', 'number', 'four', 'privet']
Total Tokens: 273026
Unique Tokens: 12845


## Dataset Preparation

Now that we've tokenized our data (that is, separate the document into the list of words), we can now organize the dataset into input and output words. In this case, we've set the input to be 50 words followed by the next word. In otherwords, sequences of length 51, or `input_size + output_size`.

The resulting list, `sequences`, is a list of strings with only 51 words each.

In [6]:
# organize into sequences of tokens
length = input_size + output_size
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 272975


In addition, we also create a function that will export the above list of strings to a separate document.

In [7]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

Now we call the function

In [8]:
# save sequences to file
save_doc(sequences, out_filename)

# Training The Model

Now that we've done the data preparation, we can load the dataset and really prepare it for the model by one hot encoding.

## Load the Sequences

First, we load the file!

In [9]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
doc = load_doc("harry-potter.txt")
lines = doc.split('\n')
lines[0]

'harry potter and the sorcerers stone chapter one the boy who lived mr and mrs dursley of number four privet drive were proud to say that they were perfectly normal thank you very much they were the last people youd expect to be involved in anything strange or mysterious because they'

## Encode the Sequences

Now, before we can train the model on the data, we need to tokenize the data again. We load the Tokenizer, and use it to prepare the lines. We save these tokenized sentences into an array of sequences. 

In [10]:
# integer encode sequences of words
tokenizer = Tokenizer(filters=string.punctuation)
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

Now let's take a quick look at the vocabulary size of the model. This basically shows how many words the model has available to pick from when predicting the next word.

We'll also store this vocabulary size into a variable `vocab_size` to use in defining our model later.

In [11]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

12846


Now that we've encoded the data, we need to separate the dataset into input `X` and output `y` elements.

In [12]:
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

## Fit the Model

Now, we get to the meat of things. We are going to define the model that we are going to use

In [13]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout

# define model
model = Sequential()
model.add(Embedding(vocab_size, input_size, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 50)            642300    
_________________________________________________________________
lstm_1 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 50, 100)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dropout_3 (Dropout)          (None, 100)               0         
__________

Finally, the model is ready to be fit on the data for some amount of epochs. This takes a few hours even on modern hardware without gpu's. You can speed up the training by increasing the `batch_size` or decreasing the number of `epochs`.

We then also save this model. This is the point where the tutorial that I've been following ends, but personally I like to save my models periodically, so before we train, I'm going to define a callback function which will checkpoint the model every five epochs in case I get impatient. This utilizes the keras `LambdaCallback` class. 

In [14]:
# ===================================================================
# Define function to define the generated text

def generate_text(words_to_generate=80):
    
    result = list()
    # select a seed text
    seed_text = lines[randint(0,len(lines))]
    
    for i in range(words_to_generate):
        # encode the seed text
        encoded = tokenizer.texts_to_sequences([seed_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)

        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break

        # append to input
        seed_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)



def on_epoch_end (epoch, _):
    if (epoch + 1) % 5 == 0:
        print("Checkpointing the model...")
        model.save("harry-potter-testing.h5")
        # save the tokenizer
        dump(tokenizer, open('tokenizer.pkl', 'wb'))
    #print("Generating Text...")
    #print(generate_text())
        
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [15]:
# fit model
model.fit(X, y, batch_size=256, epochs=10, callbacks=[print_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Checkpointing the model...
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



55157

Here, I'm going to save the model's tokenizer into an array, who's index is the number output by the model (Given 1-indexing). 

In [None]:
# save the model to file
model.save('model.h5')

js = "["
for word, index in tokenizer.word_index.items():
    js += '"' + word + '",'
js += "]"
f = open("tokenizer-array.txt", "w")
f.write(js)

# Using the Model!

## Load the Models and the Tokenizer

In [55]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load cleaned text sequences
in_filename = 'harry-potter.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

seq_length = len(lines[0].split()) - 1

In [56]:
from keras.models import load_model
from pickle import load

# load the model
model = load_model('model.h5')
# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

## Generate Text

Before you generate the text. we first need to have a model loaded, which obviously we already do, since we just finished training it. Then, the generation algorithm is as follows.

1. Randomly choose a line of text to be the seed text
2. Encode the words into numbers using the Tokenizer
3. Ensure that the Encoded seed text is the right length
4. Have the model predict the next class to code
5. Convert the number back into word format
6. Append the new word to the seed text
7. Repeat with the newly generated word in the seed text


In [57]:
from random import randint



def generate_text():
    
    result = list()
    # select a seed text
    seed_text = lines[randint(0,len(lines))]
    
    for i in range(words_to_generate):
        # encode the seed text
        encoded = tokenizer.texts_to_sequences([seed_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)

        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break

        # append to input
        seed_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

arms around the trolls neck from behind the troll couldnt feel harry hanging there but even a troll will notice if you stick a long bit of wood up its nose and harrys wand had still been in his hand when hed jumped it had gone straight up one of the



## Complete Code For Loading and Using

In [None]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

# load cleaned text sequences
in_filename = 'harry-potter.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
seq_length = len(lines[0].split()) - 1

# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)