https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/

# Word-Level Text Generator in Keras

In this project, I will develop a word level text generator in Keras using LSTM. I will train it on a Harry Potter dataset.

In [None]:
import string
from numpy import array

import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.callbacks import LambdaCallback

from pickle import dump

## Load the Document

The first step in creating the model is to load the corpus into memory.

In [1]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

Once we've got the function to load the document, we find the path to the corpus that we are using, which in this case is:

`../datasets/harry-potter-1.txt`

In [2]:
# load document
in_filename = '../datasets/harry-potter-1.txt'
doc = load_doc(in_filename)
print(doc[:200])

Harry Potter and the Sorcerer's Stone  CHAPTER ONE  THE BOY WHO LIVED  Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They 


## Clean the Document

Now that we've loaded the document into memory, we want to clean the document. For example, before splitting the document into words, we may want to replace all "-" with spaces to so that the words split more nicely. We also take out the punctuation from each word.

In [3]:
# turn a doc into clean tokens
def clean_doc(doc):
    # make lower case
    doc = doc.lower()
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    doc = doc.replace('-', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    return tokens

We can then run the cleaning function on the document that we've stored in memory.

In [4]:
# clean document
tokens = clean_doc(doc)
print(tokens[:20])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['harry', 'potter', 'and', 'the', 'sorcerers', 'stone', 'chapter', 'one', 'the', 'boy', 'who', 'lived', 'mr', 'and', 'mrs', 'dursley', 'of', 'number', 'four', 'privet']
Total Tokens: 77888
Unique Tokens: 5904


## Dataset Preparation

Now that we've tokenized our data (that is, separate the document into the list of words), we can now organize the dataset into input and output words. In this case, we've set the input to be 50 words followed by the next word. In otherwords, sequences of length 51, or `input_size + output_size`.

The resulting list, `sequences`, is a list of strings with only 51 words each.

In [5]:
input_size = 50
output_size = 1
# organize into sequences of tokens
length = input_size + output_size
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 77837


In addition, we also create a function that will export the above list of strings to a separate document.

In [6]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

Now we call the function

In [7]:
# save sequences to file
out_filename = 'harry-potter.txt'
save_doc(sequences, out_filename)

# Training The Model

Now that we've done the data preparation, we can load the dataset and really prepare it for the model by one hot encoding.

## Load the Sequences

First, we load the file!

In [8]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'harry-potter.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

## Encode the Sequences

Now, before we can train the model on the data, we need to tokenize the data again. We load the Tokenizer, and use it to prepare the lines. We save these tokenized sentences into an array of sequences. 

In [9]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

Using TensorFlow backend.


Now let's take a quick look at the vocabulary size of the model! We also store this vocabulary size into a variable `vocab_size`

In [10]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

5905


Now that we've encoded the data, we need to separate the dataset into input `X` and output `y` elements.

In [11]:
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

## Fit the Model

Now, we get to the meat of things. We are going to define the model that we are going to use

In [12]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# define model
model = Sequential()
model.add(Embedding(vocab_size, input_size, input_length=seq_length))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(128))
model.add(Dense(128, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 50)            295250    
_________________________________________________________________
lstm_1 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 5905)              596405    
Total params: 1,042,555
Trainable params: 1,042,555
Non-trainable params: 0
_________________________________________________________________
None


Finally, the model is ready to be fit on the data for some amount of epochs. This takes a few hours even on modern hardware without gpu's. You can speed up the training by increasing the `batch_size` or decreasing the number of `epochs`.

In [13]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=256, epochs=60)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<keras.callbacks.History at 0x12b897c18>

We then also save this model. This is the end of the tutorial linked above, but personally, I prefer to save my progress while training and not at the very end. 

In [68]:
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

In [76]:
def on_epoch_end (epoch, _):
    
    if (epoch + 1) % 5 == 0:
        print("Checkpointing the model...")
        model.save("%s.h5" % ("harry-potter"))
        
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [77]:
# fit model
model.fit(X, y, batch_size=256, epochs=20, verbose=2, callbacks=[print_callback])

Epoch 1/20
 - 154s - loss: 3.2023 - acc: 0.3784
Epoch 2/20
 - 151s - loss: 3.1868 - acc: 0.3819
Epoch 3/20
 - 151s - loss: 3.1741 - acc: 0.3848
Epoch 4/20
 - 153s - loss: 3.1625 - acc: 0.3867
Epoch 5/20
 - 154s - loss: 3.1501 - acc: 0.3881
Checkpointing the model...
Epoch 6/20
 - 152s - loss: 3.1364 - acc: 0.3897
Epoch 7/20
 - 167s - loss: 3.1270 - acc: 0.3929
Epoch 8/20
 - 161s - loss: 3.1150 - acc: 0.3939
Epoch 9/20
 - 165s - loss: 3.1006 - acc: 0.3955
Epoch 10/20
 - 156s - loss: 3.0897 - acc: 0.3978
Checkpointing the model...
Epoch 11/20
 - 154s - loss: 3.0773 - acc: 0.3990
Epoch 12/20
 - 462s - loss: 3.0671 - acc: 0.4028
Epoch 13/20
 - 179s - loss: 3.0573 - acc: 0.4029
Epoch 14/20
 - 153s - loss: 3.0460 - acc: 0.4042
Epoch 15/20
 - 154s - loss: 3.0355 - acc: 0.4078
Checkpointing the model...
Epoch 16/20
 - 153s - loss: 3.0246 - acc: 0.4083
Epoch 17/20
 - 158s - loss: 3.0108 - acc: 0.4108
Epoch 18/20
 - 154s - loss: 3.0044 - acc: 0.4117
Epoch 19/20
 - 162s - loss: 2.9962 - acc: 0.41

<keras.callbacks.History at 0x12b948518>

# Using the Model!

## Load the Models and the Tokenizer

In [55]:
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load cleaned text sequences
in_filename = 'harry-potter.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

seq_length = len(lines[0].split()) - 1

In [56]:
from keras.models import load_model
from pickle import load

# load the model
model = load_model('model.h5')
# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

## Generate Text

In [57]:
from random import randint

# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

arms around the trolls neck from behind the troll couldnt feel harry hanging there but even a troll will notice if you stick a long bit of wood up its nose and harrys wand had still been in his hand when hed jumped it had gone straight up one of the



In [58]:
encoded = tokenizer.texts_to_sequences([seed_text])[0]
print(encoded)

[456, 80, 1, 1218, 574, 41, 150, 1, 371, 109, 530, 7, 800, 35, 22, 108, 4, 371, 136, 1132, 40, 12, 995, 4, 172, 181, 6, 228, 27, 44, 365, 2, 98, 195, 14, 130, 52, 10, 11, 178, 66, 83, 744, 9, 14, 235, 424, 27, 38, 6, 1]


In [62]:
# truncate sequences to a fixed length
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
# predict probabilities for each word
yhat = model.predict_classes(encoded, verbose=0)

In [63]:
# map predicted word index to word
out_word = ''
for word, index in tokenizer.word_index.items():
    if index == yhat:
        out_word = word
        break

In [65]:

# append to input
seed_text += ' ' + out_word
result.append(out_word)

NameError: name 'result' is not defined

In [None]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

# load cleaned text sequences
in_filename = 'harry-potter.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
seq_length = len(lines[0].split()) - 1

# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)