# Word-Level Neural Language Model and Use it to Generate Text

A language model that can predict the probability of next word in the sequence, based on the word already observed in the sequence

You will discover how to develop a statistical language model using deep learning in Python.

In this article,
How to prepare text for developing a word-based language model.
How to design and fit a neural language model with a learned embedding and an LSTM hidden layer.
How to use the learned language model to generate new text with similar statistical properties as the source text.

In [2]:
# Load text into memory
def load_doc(filename):
    #open the file as read only
    file = open(filename, 'r')
    #read all text
    text = file.read()
    #close the file
    file.close()
    return text

In [8]:
# load document
in_filename = 'C:\\Users\\G560042\\OneDrive - General Mills\\Desktop\\Rohan\\#Extra Projects\\Deep Learning\\projects\\Word Generation - NLP\\republic_book_input.txt'
doc = load_doc(in_filename)
print(doc[:500])

INTRODUCTION AND ANALYSIS.
The Republic of Plato is the longest of his works with the exception of the Laws, and is certainly the greatest of them. There are nearer approaches to modern metaphysics in the Philebus and in the Sophist; the Politicus or Statesman is more ideal; the form and institutions of the State are more clearly drawn out in the Laws; as works of art, the Symposium and the Protagoras are of higher excellence. But no other Dialogue of Plato has the same largeness of view and the


In [9]:
#clean text

# Replace ‘–‘ with a white space so we can split words better.
# Split words based on white space.
# Remove all punctuation from words to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).
# Remove all words that are not alphabetic to remove standalone punctuation tokens.
# Normalize all words to lowercase to reduce the vocabulary size.

import string

# turn a doc into clean tokens
def clean_doc(doc):
    doc = doc.replace('--', ' ')
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphanumeric
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [10]:
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d ' % len(set(tokens)))

['introduction', 'and', 'analysis', 'the', 'republic', 'of', 'plato', 'is', 'the', 'longest', 'of', 'his', 'works', 'with', 'the', 'exception', 'of', 'the', 'laws', 'and', 'is', 'certainly', 'the', 'greatest', 'of', 'them', 'there', 'are', 'nearer', 'approaches', 'to', 'modern', 'metaphysics', 'in', 'the', 'philebus', 'and', 'in', 'the', 'sophist', 'the', 'politicus', 'or', 'statesman', 'is', 'more', 'ideal', 'the', 'form', 'and', 'institutions', 'of', 'the', 'state', 'are', 'more', 'clearly', 'drawn', 'out', 'in', 'the', 'laws', 'as', 'works', 'of', 'art', 'the', 'symposium', 'and', 'the', 'protagoras', 'are', 'of', 'higher', 'excellence', 'but', 'no', 'other', 'dialogue', 'of', 'plato', 'has', 'the', 'same', 'largeness', 'of', 'view', 'and', 'the', 'same', 'perfection', 'of', 'style', 'no', 'other', 'shows', 'an', 'equal', 'knowledge', 'of', 'the', 'world', 'or', 'contains', 'more', 'of', 'those', 'thoughts', 'which', 'are', 'new', 'as', 'well', 'as', 'old', 'and', 'not', 'of', 'one'

In [11]:
# save clean text 

# We can organize the long list of tokens into sequences of 50 input words and 1 output word.
# That is, sequences of 51 words.
# We can do this by iterating over the list of tokens from token 51 onwards and 
# taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.
# We will transform the tokens into space-separated strings for later storage in a file.
# The code to split the list of clean tokens into sequences with a length of 51 tokens is listed below.

# organize into sequence of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequence: %d' % len(sequences))

Total Sequence: 214919


In [12]:
# Next, we can save the sequences to a new file for later loading.
# We can define a new function for saving lines of text to a file. 
# This new function is called save_doc() and is listed below. It takes as input a list of lines and a filename

# save tokens to file
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [15]:
# save sequences to file
out_filename = 'C:\\Users\\G560042\\OneDrive - General Mills\\Desktop\\Rohan\\#Extra Projects\\Deep Learning\\projects\\Word Generation - NLP\\republic_sequences.txt'
save_doc(sequences, out_filename)

# Train the model

The model we will train is a neural language model. It has a few unique characteristics:

It uses a distributed representation for words so that different words with similar meanings will have a similar representation.
It learns the representation at the same time as learning the model.
It learns to predict the probability for the next word using the context of the last 100 words.
Specifically, we will use an Embedding Layer to learn the representation of words, and a Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based on their context.

In [17]:
# Load sequences

# We can load our training data using the load_doc() function we developed in the previous section.
# Once loaded, we can split the data into separate training sequences by splitting based on new lines.

# load doc(republic_sequences.txt) into memory
in_filename = 'C:\\Users\\G560042\\OneDrive - General Mills\\Desktop\\Rohan\\#Extra Projects\\Deep Learning\\projects\\Word Generation - NLP\\republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

In [20]:
print(lines[:5])

['introduction and analysis the republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them there are nearer approaches to modern metaphysics in the philebus and in the sophist the politicus or statesman is more ideal the form and institutions', 'and analysis the republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them there are nearer approaches to modern metaphysics in the philebus and in the sophist the politicus or statesman is more ideal the form and institutions of', 'analysis the republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them there are nearer approaches to modern metaphysics in the philebus and in the sophist the politicus or statesman is more ideal the form and institutions of the', 'the republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them

# Encode Sequences

In [22]:
# The word embedding layer expects input sequences to be comprised of integers.
# We can map each word in our vocabulary to a unique integer and encode our input sequences. 
# Later, when we make predictions, we can convert the prediction to numbers and look up their 
# associated words in the same mapping.
# To do this encoding, we will use the Tokenizer class in the Keras API.
# First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the 
# unique words in the data and assigns each a unique integer.
# We can then use the fit Tokenizer to encode all of the training sequences, 
# converting each sequence from a list of words to a list of integers.

In [29]:
from keras.preprocessing.text import text_to_word_sequence

# integer encode sequences of words
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

In [30]:
# We can access the mapping of words to integers as a dictionary attribute called word_index on the Tokenizer object.
# We need to know the size of the vocabulary for defining the embedding layer later. 
# We can determine the vocabulary by calculating the size of the mapping dictionary.
# Words are assigned values from 1 to the total number of words (e.g. 7,409). 
# The Embedding layer needs to allocate a vector representation for each word in this vocabulary 
# from index 1 to the largest index and because indexing of arrays is zero-offset, the index of the 
# word at the end of the vocabulary will be 7,409; that means the array must be 7,409 + 1 in length.
# Therefore, when specifying the vocabulary size to the Embedding layer, we specify it as 1 larger than
# the actual vocabulary.

# vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# Sequence Inputs and Output

In [36]:
# Now that we have encoded the input sequences, we need to separate them into input (X) and output (y) elements.
# We can do this with array slicing.
# After separating, we need to one hot encode the output word. 
# This means converting it from an integer to a vector of 0 values, one for each 
# word in the vocabulary, with a 1 to indicate the specific word at the index of the words integer value.
# This is so that the model learns to predict the probability distribution for the next word and the 
# ground truth from which to learn from is 0 for all words except the actual word that comes next.
# Keras provides the to_categorical() that can be used to one hot encode the output words for 
# each input-output sequence pair.

# Finally, we need to specify to the Embedding layer how long input sequences are. 
# We know that there are 50 words because we designed the model, but a good generic way to specify that is to use the second dimension 
# (number of columns) of the input data’s shape. 
# That way, if you change the length of sequences when preparing data, 
# you do not need to change this data loading code; it is generic.


from numpy import array
from pickle import dump
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# separate into input and output
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

# Fit Model

In [37]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 50)            520500    
_________________________________________________________________
lstm_1 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 10410)             1051410   
Total params: 1,722,810
Trainable params: 1,722,810
Non-trainable params: 0
_________________________________________________________________
None


In [40]:
# Due to time and resources constraints on my laptop, epochs kepts only 2. And hence obviously accuracy is low.
# There are many techniques you can try to boost the accuracy of this model

# Try tuning following parameters:

# 1) learning rate - alpha
# 2) Beta - Momentum
# 3)Optimizer - beta1, beta2, error 
# 4) increase # of layers
# 5) increase # of hidden nodes
# 6) learning rate decay
# 7) Mini batch size
# 8) increase number of epochs

# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=128, epochs=2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1f1073b1898>

# Save model

In [43]:
# save the model to file
model.save('C:\\Users\\G560042\\OneDrive - General Mills\\Desktop\\Rohan\\#Extra Projects\\Deep Learning\\projects\\Word Generation - NLP\\model.h5')
# save the tokenizer
dump(tokenizer, open('C:\\Users\\G560042\\OneDrive - General Mills\\Desktop\\Rohan\\#Extra Projects\\Deep Learning\\projects\\Word Generation - NLP\\tokenizer.pkl', 'wb'))

# Generate Text

In [55]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

# load cleaned text sequences
in_filename = 'C:\\Users\\G560042\\OneDrive - General Mills\\Desktop\\Rohan\\#Extra Projects\\Deep Learning\\projects\\Word Generation - NLP\\republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
seq_length = len(lines[0].split()) - 1

# load the model
model = load_model('C:\\Users\\G560042\\OneDrive - General Mills\\Desktop\\Rohan\\#Extra Projects\\Deep Learning\\projects\\Word Generation - NLP\\model.h5')

# load the tokenizer
tokenizer = load(open('C:\\Users\\G560042\\OneDrive - General Mills\\Desktop\\Rohan\\#Extra Projects\\Deep Learning\\projects\\Word Generation - NLP\\tokenizer.pkl', 'rb'))

# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 5)
print(generated)

be to compel the best minds to attain that knowledge which we have already shown to be the greatest of must continue to ascend until they arrive at the good but when they have ascended and seen enough we must not allow them to do as they do now what do

that the same and the
