# Develop a Neural Language Model for Text Generation.

We know that a language model can predict the probability of the next word in a sequence, based on the words already observed in the sequence. If this is the case, then we can design systems that can attempt to generate sequences of words to form a text.

Neural network models are preferred when developing statistical language models for the following reasons:
1. They can use a distributed representation where different words with similar meanings have similar representation.
2. They can use a large context of recently observed words when making predictions.

#### 1. The Data

The Republic is the classical Greek philosopher Plato's most famous work.

It is structured as a dialog, i.e a conversation, on the topic of order and justice within a city state.

In [1]:
# Download the book as ASCII text
!wget -O Data/republic.txt http://www.gutenberg.org/cache/epub/1497/pg1497.txt

--2020-04-14 08:49:02--  http://www.gutenberg.org/cache/epub/1497/pg1497.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1239081 (1.2M) [text/plain]
Saving to: ‘Data/republic.txt’


2020-04-14 08:49:12 (148 KB/s) - ‘Data/republic.txt’ saved [1239081/1239081]



In [5]:
# load the book
filename = 'Data/republic.txt'
file = open(filename, 'rt', encoding='utf-8')
text = file.read()
file.close()

In [4]:
print(text)

﻿The Project Gutenberg EBook of The Republic, by Plato

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: The Republic

Author: Plato

Translator: B. Jowett

Posting Date: August 27, 2008 [EBook #1497]
Release Date: October, 1998
Last Updated: June 22, 2016

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK THE REPUBLIC ***




Produced by Sue Asscher





THE REPUBLIC

By Plato


Translated by Benjamin Jowett


Note: The Republic by Plato, Jowett, etext #150




INTRODUCTION AND ANALYSIS.

The Republic of Plato is the longest of his works with the exception
of the Laws, and is certainly the greatest of them. There are nearer
approaches to modern metaphysics in the Philebus and in the Sophist; the
Politicus or Statesman is more ideal; the form and institutions of
the St

Quite a text to read from. However, after going through the text, a few things start to pop out.

1. There is the multi-lined header and tail information.
2. The text is split into chapters with chapter numbers.
3. Some dialogs are in quotes.
4. There are lots of punctuations.
5. etc...

There are way more, but I'll list just a few. Explore the text and see others for self. Note this should then inform how we are able to tackle preparing the text. The more specific way we clean the data depends on how we intend to model it and in turn how we intend to use it.

With this in mind, let's define the language model design.

#### Language Model Design

The language model will be statistical and will predict the probability of each word given an input sequence of text.

This predicted word will be fed in as input to in turn generate the next word.

This being the case, a key design decision will be how long the input sequences should be. They need to be long enough to allow the model to learn the context for the words to predict.

This input length will also define the length of seed text used to generate new sequences when we use the model.

We can try out different input lengths, as there is no correct or wrong input lenght and all totally depends on how the model is able to understand the given sequences.

For this script, we'll pick a length of 20 words for the input sequences. 

We could process the data so that the model only ever deals with self-contained sequences and pad or truncate the text to meet this requirement for each input sequence.

To keep this example brief, we will allow the text flow together and train the model to predict the next word across sentences, paragraphs, and even books or chapters in the text.

In [10]:
import re, string

In [11]:
# Create a function to load our document
def load_document(filename=None):
    """
    params:
    filename: The path to a text document.
    
    return:
    doc: A loaded text document
    """
    if filename==None:
        return 'Please provide a valid text document...'
    print('Loading text...')
    text = open(filename, 'rt', encoding='utf-8')
    doc = text.read()
    text.close()
    print('Text loaded... Use your assigned variable to view text.')
    return doc

We then will create a function to clean our data using the following processes.

1. Split by whitespace.
2. Remove punctuation.
3. Replace double hyphen '--' with whitespace.
4. Remove all non-alphabetical words.
5. move all tokens to lower case.

You can add more, based on your findings.

In [12]:
# Cleaning function
def clean_document(document):
    """
    params: 
    document: A loaded text document to clean.
    
    return:
    document: A cleaned document.
    """
    print('Cleaning text document...')
    document = document.replace('--', ' ') # replace '--' with whitespace
    words = document.split() # split into tokens
    punctuations = re.compile('[%s]'%re.escape(string.punctuation)) # compile all punctuations using regex
    words = [punctuations.sub('', word) for word in words] # Substitute punctuations from text.
    words = [word for word in words if word.isalpha()] # keep only alphabetic tokens.
    document = [word.lower() for word in words] # Normalize words to same case
    print('Done cleaning text.')
    return document

In [114]:
# initialize
filename = 'Data/republic.txt'
document = load_document(filename)
document = clean_document(document)

Loading text...
Text loaded... Use your assigned variable to view text.
Cleaning text document...
Done cleaning text.


In [14]:
# print documents
print(document[:200])

['project', 'gutenberg', 'ebook', 'of', 'the', 'republic', 'by', 'plato', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 'reuse', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'wwwgutenbergorg', 'title', 'the', 'republic', 'author', 'plato', 'translator', 'b', 'jowett', 'posting', 'date', 'august', 'ebook', 'release', 'date', 'october', 'last', 'updated', 'june', 'language', 'english', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'the', 'republic', 'produced', 'by', 'sue', 'asscher', 'the', 'republic', 'by', 'plato', 'translated', 'by', 'benjamin', 'jowett', 'note', 'the', 'republic', 'by', 'plato', 'jowett', 'etext', 'introduction', 'and', 'analysis', 'the', 'republic', 'of', 'plato', 'is', 'the', 'longest', 'of', 'his', 'wo

In [16]:
# print info
print('Total tokens: %d' % len(document))
print('Unique tokens: %d'% len(set(document)))

Total tokens: 219633
Unique tokens: 10649


This number of tokens or words is inclusive of the header and footer information we dont want, so we'll process our document to drop all header and footer information before cleaning.

We can do this by slicing the document on import, or manually by removing the irrelevant information. I'll do the manual part because it is faster, however, here's a little starter code to help if you need to slice (simple list slicing)

In [None]:
# slice document
doc = load_document(filename)
doc = doc[688:1195638]
print(doc)

Load the reduced text after manual cleaning.

In [115]:
document = load_document(filename)
cleaned = clean_document(document)
print('Total tokens: %d'%len(cleaned))
print('Total tokens: %d'%len(set(cleaned)))

Loading text...
Text loaded... Use your assigned variable to view text.
Cleaning text document...
Done cleaning text.
Total tokens: 118684
Total tokens: 7409


In [86]:
print(cleaned[:200])

['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid',

Looks about right after cleaning. Lets save a copy of the cleaned text. But before we do, we will organize the words or tokens into sequences of 20 input words and 1 output word. This means a total sequence length of 21.

To do this, we'll iterate over the list of words or tokens from position 21 onwards and take the prior 20 tokens as a sequence, repeating to the end of the words list. Then store all this in space separated strings for later.

In [116]:
# push into sequences of tokens
seq_length = 20 + 1
word_sequences = list() # create a sequence list
for i in range(seq_length, len(cleaned)):
    sequence = cleaned[i-seq_length:i] # slice the tokens
    doc_line = ' '.join(sequence) # convert into a line
    word_sequences.append(doc_line) # append each document line to the sequence list
print('Total sequences: %d'% len(word_sequences))

Total sequences: 118663


In [94]:
# Save the tokens to file, one dialog per file.
def save_list(sequence_list, filename):
    """
    params:
    sequence_list: a cleaned sequence list.
    filename: filename to save to
    
    """
    documents = '\n'.join(sequence_list)
    file = open(filename, 'w')
    file.write(documents)
    file.close()
    print('Done saving to %s'%filename)

In [95]:
# Save file
filename = 'Data/republic_cleaned_seq.txt'
save_list(word_sequences, filename)

Done saving to Data/republic_cleaned_seq.txt


In [117]:
# View the first five sequences
word_sequences[:5]

['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the',
 'down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess']

Now we have our text as sequences and have saved a copy in memory, we can load the file to work with or use the copy already in memory.

If you will want to load the saved document, use the load_document() function, and once the document is loaded, split the lines by '\n'.

Next, we will encode the sequences.

Since our embedding layer expects the sequences to be integers, we can map each text or word in our vocabulary to a unique integer and encode our input sequences.

After training, we can then convert a prediction to numbers and lookup their associated words in the same mapping.

In [98]:
# import module for encoding
from keras.preprocessing.text import Tokenizer

In [118]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(word_sequences)
encoded_seq = tokenizer.texts_to_sequences(word_sequences)
encoded_seq[:5]

[[1046,
  11,
  11,
  1045,
  329,
  7409,
  4,
  1,
  2873,
  35,
  213,
  1,
  261,
  3,
  2251,
  9,
  11,
  179,
  817,
  123,
  92],
 [11,
  11,
  1045,
  329,
  7409,
  4,
  1,
  2873,
  35,
  213,
  1,
  261,
  3,
  2251,
  9,
  11,
  179,
  817,
  123,
  92,
  2252],
 [11,
  1045,
  329,
  7409,
  4,
  1,
  2873,
  35,
  213,
  1,
  261,
  3,
  2251,
  9,
  11,
  179,
  817,
  123,
  92,
  2252,
  4],
 [1045,
  329,
  7409,
  4,
  1,
  2873,
  35,
  213,
  1,
  261,
  3,
  2251,
  9,
  11,
  179,
  817,
  123,
  92,
  2252,
  4,
  1],
 [329,
  7409,
  4,
  1,
  2873,
  35,
  213,
  1,
  261,
  3,
  2251,
  9,
  11,
  179,
  817,
  123,
  92,
  2252,
  4,
  1,
  1863]]

Now we have created our encoded integers, we need to know the size of the vocabulary for defining the embedding layer later. We can define the vocabulary by calculating the size of the mapping dictionary.

Words are assigned vaues from 1 to the total number of words. 

The embedding layer needs to allocate a vector representation for each word in this vocabulary from index 1 to the largest index. Because indexing of arrays is zero-offset, the index of the word at the end of the vocabulary will be 7409. Meaning the array must be 7409 + 1 in length.

In [119]:
# Get the vocabulary size
vocab_size = len(tokenizer.word_index)+1
vocab_size

7410

#### Split Data into Inputs and Outputs.

Now we have our sequences, we can split into our X for the first 19 encoded tokens and y for the last token.

Then we one-hot encode our target or y so that the model learns to predict the probability distribution for the next word.

In [108]:
from keras.utils import to_categorical
from numpy import array

In [120]:
# split into X and y
encoded_seq = array(encoded_seq)
X = encoded_seq[:,:-1]
y = encoded_seq[:,-1]
y = to_categorical(y, num_classes = vocab_size)
max_length = X.shape[1]

In [124]:
max_length, vocab_size

(20, 7410)

In [122]:
X.shape, y.shape

((118663, 20), (118663, 7410))

#### Fit a Model

Let's define our model.

For the embedding layer, we will need to know the size of the vocabulary and the length of the input sequence, also the number of dimensions we will like to represent each word.

We will use two LSTM layers with 100 memory cells each. We can experiment later with more memory cells.

A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to interpret the features extracted from the sequence. The output layer predicts the next words as a single vector the size of the vocabulary with a probability for each word in the vocabulary. 

We use the softmax activation function to ensure the outputs have the same characteristics of normalized probabilities.

In [123]:
# Modules for models 
from keras.layers import Dense, Embedding, LSTM
from keras.models import Sequential

In [125]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=max_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 50)            370500    
_________________________________________________________________
lstm_1 (LSTM)                (None, 20, 100)           60400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 7410)              748410    
Total params: 1,269,810
Trainable params: 1,269,810
Non-trainable params: 0
_________________________________________________________________
None


In [127]:
# Compile the model 
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X, y, batch_size=172, epochs=100, verbose=2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/100
 - 63s - loss: 6.2074 - accuracy: 0.0689
Epoch 2/100
 - 60s - loss: 5.7508 - accuracy: 0.1009
Epoch 3/100
 - 59s - loss: 5.5201 - accuracy: 0.1230
Epoch 4/100
 - 58s - loss: 5.3564 - accuracy: 0.1376
Epoch 5/100
 - 61s - loss: 5.2430 - accuracy: 0.1480
Epoch 6/100
 - 55s - loss: 5.1513 - accuracy: 0.1551
Epoch 7/100
 - 56s - loss: 5.0684 - accuracy: 0.1606
Epoch 8/100
 - 55s - loss: 4.9901 - accuracy: 0.1649
Epoch 9/100
 - 56s - loss: 4.9170 - accuracy: 0.1685
Epoch 10/100
 - 56s - loss: 4.8493 - accuracy: 0.1723
Epoch 11/100
 - 56s - loss: 4.7834 - accuracy: 0.1759
Epoch 12/100
 - 56s - loss: 4.7224 - accuracy: 0.1791
Epoch 13/100
 - 56s - loss: 4.6619 - accuracy: 0.1830
Epoch 14/100
 - 56s - loss: 4.6050 - accuracy: 0.1848
Epoch 15/100
 - 59s - loss: 4.5514 - accuracy: 0.1882
Epoch 16/100
 - 56s - loss: 4.4998 - accuracy: 0.1903
Epoch 17/100
 - 55s - loss: 4.4504 - accuracy: 0.1924
Epoch 18/100
 - 56s - loss: 4.4028 - accuracy: 0.1950
Epoch 19/100
 - 56s - loss: 4.3566 - 

<keras.callbacks.callbacks.History at 0x7f082c3cf630>

Our model has been trained, however, we can see the accuracy is not that high. 

There are several ways we can improve learning here and we can explore these ways. But we can try to generate text to see if it works and how well it works before we move on to improving the network.

In [128]:
# Save our model
from pickle import dump

In [129]:
# Save the model to file
model.save('Models/Text_gen.h5')
# Save the tokenizer
dump(tokenizer, open('Weights/Text_gen.pkl', 'wb'))

Now we need to evaluate our model on various target words.

We will create a function that will allow us take a seed text and predict an output.

In [160]:
# generate a sequence from a language model
def sequence_generator(tokenizer, model, sequence_length, seed_text, num_words):
    result = list()
    input_text = seed_text
    for _ in range(num_words):
        encoded = tokenizer.texts_to_sequences([input_text])[0]
        encoded = pad_sequences([encoded], maxlen=sequence_length, truncating='pre')
        predicted_y = model.predict_classes(encoded, verbose=0)
        output_text = ''
        for word, index in tokenizer.word_index.items():
            if index == predicted_y:
                output_text = word
                break
        input_text += ' ' + output_text
        result.append(output_text)
    return ' '.join(result)

In [157]:
# Load cleaned text sequences
in_filename = 'Data/republic_cleaned_seq.txt'
document = load_document(in_filename)
lines = document.split('\n')
sequence_length = len(lines[0].split()) - 1
sequence_length

Loading text...
Text loaded... Use your assigned variable to view text.


20

In [133]:
from keras.models import load_model
from random import randint
from keras.preprocessing.sequence import pad_sequences
from pickle import load

In [134]:
# Load the model
model = load_model('Models/Text_gen.h5')
# load the tokenizer
tokenizer = load(open('Weights/Text_gen.pkl', 'rb'))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [158]:
# select a seed text from our text
seed_text = lines[randint(0, len(lines))]
print(seed_text + '\n')

and the soul with shrilling cry passed like smoke beneath the earth and as bats in hollow of mystic cavern whenever



In [161]:
# generate new text
generated = sequence_generator(tokenizer, model, sequence_length, seed_text, 20)
print(generated)

any one of the philosophic nature they are holy angels upon the earth authors of pain or princes who are


Now we have been able to generate text using our model and a seed text. Explore the options around Data cleaning, more simplified vocabulary, and tuning the models to deeper or wider dimensions. See what works. 

Cheers