# A generative model


# Lesson 1 - Load data

The first thing we did was to download Hans Christian Andersen fairytales from Gutenberg.

What do we need to look out for when cleaning the data:

- Punctuation 
- The file consist of multiple stories --> need to divide it into different parts (might be possible to split after headers written in capital letters) --> we will do this in lesson 5
- Capital letters as headers, might want to delete these
- Change everything to lower caser 
- Illustrations, delete them, written as ([Illustration: _His limbs were numbed, his beautiful eyes were closing.-])


We have to decide the input sequences, which is the number of words that the model will train on. This is also the number of words which will be used as the seed text, when using the model to actually generate new sequences. 

In the tutorial 50 words are suggested. We will start with that and then in the fifth lesson we might try to only use sequences inside sentences, though this will minimize the amount of training data. 

Documentation on the following two coding blocks:
https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python
- How to open text

In [1]:
#import packages
import re
import string
from numpy import array
from pickle import dump
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.models import load_model
from pickle import load
from random import randint
from keras.preprocessing.sequence import pad_sequences



Using TensorFlow backend.


In [2]:
#define a function to load the text into memory
def load_text(filename): #we decide that the function is called load_text
    file = open(filename, 'r') #open() is used to open/loading a filename, the mode indicates how it should be opened, r = read, w = writing, a = appending 
    text = file.read() # read the file and assign it to the variable text
    file.close() #close the file again
    return text #ends function and specify what output the want, we want text

In [3]:
#load the documents
#text_1 = 'Data/gutenberg_hc_1.txt'
#text_2 = 'Data/gutenberg_hc_1.txt'

#doc_1 = load_text(text_1)
#doc_2 = load_text(text_2)

#but we actually want to concatenate the two files, so we will use another function

#concatenate text files
filenames = ['Data/500-0.txt', 'Data/503-0.txt', 'Data/677-0.txt', 'Data/902-0.txt', 'Data/2591-0.txt', 'Data/2892-0.txt', 'Data/7439-0.txt', 'Data/22693-0.txt', 'Data/52521.txt', 'Data/pg128.txt', 'Data/pg708.txt', 'Data/pg4018.txt', 'Data/pg4357.txt', 'Data/pg11027.txt', 'Data/pg17860.txt', 'Data/pg19068.txt', 'Data/pg19860.txt', 'Data/pg29021.txt'] #specify the filenames that we want to concatenate

with open('Data/fairytales.txt', 'w') as outfile: #specify where the file should be saved and what it should be called (if a file with that name already exists, it will be overwritten, 'w' is used to edit and write new information to a file 
    for text in filenames: #for all the text in the filenames, do the following...
        with open(text) as infile:
            outfile.write(infile.read())
            
                  

In [4]:

# load document
in_filename = 'Data/fairytales.txt'
doc = load_text(in_filename)
print(doc[:200])


CHAPTER 1

How it happened that Mastro Cherry, carpenter, found a piece of wood
that wept and laughed like a child.


Centuries ago there lived--

“A king!” my little readers will say immediately.

N


# Lesson 2 - clean data

In [5]:
#removing illustration descriptions
doc = re.sub(r'\[[^)]*\]', '', doc) #using re.sub function and regular expressions 
#[ - an opening bracket 
#[^()]* - zero or more characters other than those defined, that is, any characters other than [ and ]
#\] - a closing bracket



In [6]:
#remove headers
doc = re.sub(r'[A-Z]{2,}', '', doc) #remove capital letters everytime there is more than two 

print(doc[:10000]) #lets look at it 



 1

How it happened that Mastro Cherry, carpenter, found a piece of wood
that wept and laughed like a child.


Centuries ago there lived--

“A king!” my little readers will say immediately.

No, children, you are mistaken. Once upon a time there was a piece of
wood. It was not an expensive piece of wood. Far from it. Just a common
block of firewood, one of those thick, solid logs that are put on the
fire in winter to make cold rooms cozy and warm.

I do not know how this really happened, yet the fact remains that
one fine day this piece of wood found itself in the shop of an old
carpenter. His real name was Mastro Antonio, but everyone called him
Mastro Cherry, for the tip of his nose was so round and red and shiny
that it looked like a ripe cherry.

As soon as he saw that piece of wood, Mastro Cherry was filled with joy.
Rubbing his hands together happily, he mumbled half to himself:

“This has come in the nick of time. I shall use it to make the leg of a
table.”

He grasped the hatc

Documentation for the following code: 
https://www.geeksforgeeks.org/python-maketrans-translate-functions/
- How to use translate together with the helper function maketrans

In [7]:
# turn the document into clean tokens
def clean_doc(doc): #make a function called clean_doc
	# split into tokens by white space
	tokens = doc.split() #get a list with all words (tokens)
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation) #third argument specifies the character we want to delete
	tokens = [w.translate(table) for w in tokens] #use translation mapping from table and loop through all words in token
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()] #isalpha() checks whether the string consists of alphabetic characters only. Replace word with word if it is alphabetical. 
	# make lower case
	tokens = [word.lower() for word in tokens] #replace word in lower case with word in token 
	return tokens

clean_doc(doc)



['how',
 'it',
 'happened',
 'that',
 'mastro',
 'cherry',
 'carpenter',
 'found',
 'a',
 'piece',
 'of',
 'wood',
 'that',
 'wept',
 'and',
 'laughed',
 'like',
 'a',
 'child',
 'centuries',
 'ago',
 'there',
 'lived',
 'my',
 'little',
 'readers',
 'will',
 'say',
 'immediately',
 'no',
 'children',
 'you',
 'are',
 'mistaken',
 'once',
 'upon',
 'a',
 'time',
 'there',
 'was',
 'a',
 'piece',
 'of',
 'wood',
 'it',
 'was',
 'not',
 'an',
 'expensive',
 'piece',
 'of',
 'wood',
 'far',
 'from',
 'it',
 'just',
 'a',
 'common',
 'block',
 'of',
 'firewood',
 'one',
 'of',
 'those',
 'thick',
 'solid',
 'logs',
 'that',
 'are',
 'put',
 'on',
 'the',
 'fire',
 'in',
 'winter',
 'to',
 'make',
 'cold',
 'rooms',
 'cozy',
 'and',
 'warm',
 'i',
 'do',
 'not',
 'know',
 'how',
 'this',
 'really',
 'happened',
 'yet',
 'the',
 'fact',
 'remains',
 'that',
 'one',
 'fine',
 'day',
 'this',
 'piece',
 'of',
 'wood',
 'found',
 'itself',
 'in',
 'the',
 'shop',
 'of',
 'an',
 'old',
 'carpent

In [8]:
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens)) #print total number of tokens
print('Unique Tokens: %d' % len(set(tokens))) #group similar tokens and print the number

['how', 'it', 'happened', 'that', 'mastro', 'cherry', 'carpenter', 'found', 'a', 'piece', 'of', 'wood', 'that', 'wept', 'and', 'laughed', 'like', 'a', 'child', 'centuries', 'ago', 'there', 'lived', 'my', 'little', 'readers', 'will', 'say', 'immediately', 'no', 'children', 'you', 'are', 'mistaken', 'once', 'upon', 'a', 'time', 'there', 'was', 'a', 'piece', 'of', 'wood', 'it', 'was', 'not', 'an', 'expensive', 'piece', 'of', 'wood', 'far', 'from', 'it', 'just', 'a', 'common', 'block', 'of', 'firewood', 'one', 'of', 'those', 'thick', 'solid', 'logs', 'that', 'are', 'put', 'on', 'the', 'fire', 'in', 'winter', 'to', 'make', 'cold', 'rooms', 'cozy', 'and', 'warm', 'i', 'do', 'not', 'know', 'how', 'this', 'really', 'happened', 'yet', 'the', 'fact', 'remains', 'that', 'one', 'fine', 'day', 'this', 'piece', 'of', 'wood', 'found', 'itself', 'in', 'the', 'shop', 'of', 'an', 'old', 'carpenter', 'his', 'real', 'name', 'was', 'mastro', 'antonio', 'but', 'everyone', 'called', 'him', 'mastro', 'cherry'

In [9]:
# organize the text into sequences of tokens
length = 50 + 1 #define the length of the sequences
sequences = list() #create an empty list, where we will store all the sequences

for i in range(length, len(tokens)): #range([start], stop[, step])
	# select sequence of tokens
	seq = tokens[i-length:i] #make sequences with length of 50 + 1
	# convert into a line
	line = ' '.join(seq) #join the tokens into a line in the sequence
	# store
	sequences.append(line) #append all the lines to the empty list created before the loop
print('Total Sequences: %d' % len(sequences)) #print the total amount (length of sequence) of sequences 

Total Sequences: 845588


In [10]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

In [12]:
# save sequences to file
out_filename = 'Data/clean_text.txt'
save_doc(sequences, out_filename)



clean_text.txt


# Lesson 3


In [12]:
#load the cleaned data
doc = load_text('Data/clean_text.txt')

#split the text according to line shift (every string is 51 words)
lines = doc.split('\n')

#check the type
type(lines)#its a list

list

#Documentation
https://keras.io/getting-started/functional-api-guide/

https://www.tensorflow.org/install/source

In [None]:
tokenizer = Tokenizer() #define function Tokenizer() as the variable tokenizer
tokenizer.fit_on_texts(lines) #fit_on_text() tells it what data to train on. We train it on the entire training data, finds all of the unique words in the data and assigns each a unique integer, from 1 to the total number of words
sequences = tokenizer.texts_to_sequences(lines) #convert each sequence from a list of words to a list of integers

print(sequences) #all sequences are represented by vectors, consisting of an integer corresponding to each word in the vector

tokenizer.word_index #check out the mapping of integers to words


In [None]:
#check vocabulary size
vocab_size = len(tokenizer.word_index) + 1 #+1 because indexing of arrays starts at zero, and the first word is assigned the integer of 1

vocab_size

In [None]:

# separate into input and output
sequences = array(sequences) #create an array with the sequences
X, y = sequences[:,:-1], sequences[:,-1] #separate into input and putput sequences
print(X) #input
print(y) #output
y = to_categorical(y, num_classes=vocab_size) #convert to one-hot encoding
print(y)
seq_length = X.shape[1] #set sequence length to the number of columns in X

In [None]:
model = Sequential() #define model, use sequential model function
model.add(Embedding(vocab_size, 50, input_length=seq_length)) #add embedding layer, 50 is the dimensions which will be used to represent each word vector
model.add(LSTM(100, return_sequences=True)) #add hidden layer, long short term memory layer
model.add(LSTM(100))
model.add(Dense(100, activation='relu')) #dense specifies the structure of the neural network, 100 refer to that there are 100 neurons in the first hidden layer, as defined above. Relu is an argument of rectified linear unit.  
model.add(Dense(vocab_size, activation='softmax')) #softmax transform to probabilities 
print(model.summary())

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) #it's a multilevel predicter, because it's categorical, we want a measure which is accuracy, and a cost function which is optimized by adam
# fit model
model.fit(X, y, batch_size=128, epochs=100)

In [None]:
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb')) #w stands for write

# Lesson 4

In [None]:
#need the text, so we can use a sequence as a source input for generating the text
doc = load_text('Data/clean_text.txt') #load cleaned sequences 

lines = doc.split('\n') #split by new line



In [None]:
#specify the sequence length
seq_length = len(lines[0].split()) - 1 #look at the first line, split sequences by words, count and minus 1

In [None]:
#load the model
model = load_model('model.h5')


# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb')) #r stands for read

In [None]:
#select a seed text
seed_text = lines[randint(0,len(lines))] #randint returns a random integer from 0 to the length of lines, index it
print(seed_text + '\n') #print the selected seed text and make a line shift

In [None]:
#use tokenizer to assign an integer to every word in the seed text
encoded = tokenizer.texts_to_sequences([seed_text])[0]

#exclude the output word, so take the last 50 words
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')

len(encoded)


In [None]:
#the model.preict_classes will return the index of the word with the highest probability 
yhat = model.predict_classes(encoded, verbose=0) #verbose specifies how  you want to 'see' the training progress for each epoch

In [None]:
#look up in the tokenizer what word the indexed word corresponds to

out_word = '' #empty variable to store the word output word in 
for word, index in tokenizer.word_index.items(): #loop through the tokenizers
	if index == yhat: #if the index is equally to yhat, assign the word to the variable out_word
		out_word = word
		break
        


In [None]:
#the input sequences will get too long, when adding the output word, need to make sure that it's always 50 words long
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') #pre indicates that it should take the values from the begining of the sequence

In [None]:
#create a function for generating new sequences 
def generate_seq(model, tokenizer, seq_length, seed_text, n_words): #n_words is the number of words to generate 
	result = list() #create and empty list
	in_text = seed_text #assign the see_text to a variable 
	# generate a fixed number of words
	for _ in range(n_words): #range(), first argument is the number of integers to generate, starting from zero
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0] #take the first sequence
		# truncate sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') #make sure that the sequence is 50 words
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word #add the output word to the sequence we are currently working with 
		result.append(out_word) #append the word to the empty list
	return ' '.join(result) #''.join creates a space between each word in the result list and makes a string

In [None]:

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)


#save the generated text

out_filename = 'Data/first_output1.txt'

save_doc(generated, out_filename)

# Things to look at in lecture 5 


Input sequences


We could process the data so that the model only ever deals with self-contained sentences and pad or truncate the text to meet this requirement for each input sequence. You could explore this as an extension to this tutorial.
- Instead, to keep the example brief, we will let all of the text flow together and train the model to predict the next word across sentences, paragraphs, and even books or chapters in the text.

- The file consist of multiple stories --> need to divide it into different parts (might be possible to split after headers written in capital letters) --> we will do this in lesson 5

- FInd more fairytales, data is not big enough

We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and a deeper network may achieve better results.