# A generative model


# Lesson 1 - Load data

The first thing we did was to download Hans Christian Andersen fairytales from Gutenberg.

What do we need to look out for when cleaning the data:

- Punctuation 
- The file consist of multiple stories --> need to divide it into different parts (might be possible to split after headers written in capital letters) --> we will do this in lesson 5
- Capital letters as headers, might want to delete these
- Change everything to lower caser 
- Illustrations, delete them, written as ([Illustration: _His limbs were numbed, his beautiful eyes were closing.-])


We have to decide the input sequences, which is the number of words that the model will train on. This is also the number of words which will be used as the seed text, when using the model to actually generate new sequences. 

In the tutorial 50 words are suggested. We will start with that and then in the fifth lesson we might try to only use sequences inside sentences, though this will minimize the amount of training data. 

Documentation on the following two coding blocks:
https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python
- How to open text

In [3]:
#import packages
import re
import string
from numpy import array
from pickle import dump
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.models import load_model
from pickle import load
from random import randint
from keras.preprocessing.sequence import pad_sequences
import nltk.data
from nltk.tokenize import sent_tokenize, word_tokenize

from keras.preprocessing.sequence import pad_sequences

from keras.layers import Masking


Using TensorFlow backend.


In [4]:
#define a function to load the text into memory
def load_text(filename): #we decide that the function is called load_text
    file = open(filename, 'r', encoding = 'utf-8') #open() is used to open/loading a filename, the mode indicates how it should be opened, r = read, w = writing, a = appending 
    text = file.read() # read the file and assign it to the variable text
    file.close() #close the file again
    return text #ends function and specify what output the want, we want text

In [5]:
#set path and working directory
os.chdir("C:\\Users\\au581290\\Desktop\\Data")
path = "C:\\Users\\au581290\\Desktop\\Data"

#concatenate text files
filenames = [f for f in listdir(path) if isfile(join(path, f))]

filenames

with open('adventures.txt', 'w', encoding="utf8") as outfile:
    for fname in filenames:
        with open(fname, encoding="utf8") as infile:
            outfile.write(infile.read())

                  

In [6]:

# load document
in_filename = 'Data/fairytales.txt'
doc = load_text(in_filename)
print(doc[:200])


CHAPTER 1

How it happened that Mastro Cherry, carpenter, found a piece of wood
that wept and laughed like a child.


Centuries ago there lived--

“A king!” my little readers will say immediately.

N


# Lesson 2 - clean data

In [7]:
#removing illustration descriptions
doc = re.sub(r'\[[^)]*\]', '', doc) #using re.sub function and regular expressions 
#[ - an opening bracket 
#[^()]* - zero or more characters other than those defined, that is, any characters other than [ and ]
#\] - a closing bracket



In [8]:
#remove headers
doc = re.sub(r'[A-Z]{2,}', '', doc) #remove capital letters everytime there is more than two 

print(doc[:1000]) #lets look at it 



 1

How it happened that Mastro Cherry, carpenter, found a piece of wood
that wept and laughed like a child.


Centuries ago there lived--

“A king!” my little readers will say immediately.

No, children, you are mistaken. Once upon a time there was a piece of
wood. It was not an expensive piece of wood. Far from it. Just a common
block of firewood, one of those thick, solid logs that are put on the
fire in winter to make cold rooms cozy and warm.

I do not know how this really happened, yet the fact remains that
one fine day this piece of wood found itself in the shop of an old
carpenter. His real name was Mastro Antonio, but everyone called him
Mastro Cherry, for the tip of his nose was so round and red and shiny
that it looked like a ripe cherry.

As soon as he saw that piece of wood, Mastro Cherry was filled with joy.
Rubbing his hands together happily, he mumbled half to himself:

“This has come in the nick of time. I shall use it to make the leg of a
table.”

He grasped the hatc

Documentation for the following code: https://www.geeksforgeeks.org/python-maketrans-translate-functions/

How to use translate together with the helper function maketrans

In [9]:
# turn the document into clean tokens


def clean_doc(doc): #make a function called clean_doc
	# split into tokens by white space
	tokens = doc.split() #get a list with all words (tokens)
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation) #third argument specifies the character we want to delete
	tokens = [w.translate(table) for w in tokens] #use translation mapping from table and loop through all words in token
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()] #isalpha() checks whether the string consists of alphabetic characters only. Replace word with word if it is alphabetical. 
	# make lower case
	tokens = [word.lower() for word in tokens] #replace word in lower case with word in token 
	return tokens

# The following coding blocks are part of lesson 5

In [10]:
#load nltk library
nltk.download('punkt')

#define tokenizer, use pretrained from the corpus
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

#assign doc to data
data = doc

#divide into sentences
sentence_list = sent_tokenize(data)

#print the number of sentences
print(len(sentence_list))

#check the type, it's a list
type(sentence_list)

#index to see an example
sentence_list[5] #need to remove the punctuation and have in lower case




[nltk_data] Downloading package punkt to
[nltk_data]     /Users/signeklovekjaer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
35188


'Far from it.'

In [11]:
#check if it works on one sentence in the sentence list
print(clean_doc(sentence_list[3])) #it does

#make an empty list
sentence_clean_list = []

#make a loop which runs through the list of sentences and uses the clean_doc() function
for sentence in sentence_list:
    y = clean_doc(sentence) #save in y and append to empty list
    sentence_clean_list.append(y)
    

#print(sentence_clean_list[:10])

#check to see if the punctuation is removed and if it's in lower case
print(sentence_clean_list[5]) #it is



        

['once', 'upon', 'a', 'time', 'there', 'was', 'a', 'piece', 'of', 'wood']
['far', 'from', 'it']


We need to check the length of the sentences, as we need to use this then either padding or truncating the sentences into sequences. 

In [12]:
#check the length of a random sentence
print(len(sentence_clean_list[3]))

length = []
for sentence in sentence_clean_list:
    l = len(sentence)
    length.append(l)

    
print(max(length))

print(min(length))


10
348
0


In [13]:
#remove sentences less than 1 word long and longer than 130
sentence_new_list1 = [s for s in sentence_clean_list if len(s) >= 1] 
sentence_new_list = [s for s in sentence_new_list1 if len(s) <= 130]


length_new = []
for sentence in sentence_new_list:
    l = len(sentence)
    length_new.append(l)

    
print(max(length_new))

print(min(length_new))


#check how many sentences we have removed
print('Sentences removed after deleting sentences shorter than 1:', len(sentence_clean_list) - len(sentence_new_list1) )
print('Sentences removed after deleting sentences longer than :', len(sentence_new_list1) - len(sentence_new_list)) 



130
1
Sentences removed after deleting sentences shorter than 1: 151
Sentences removed after deleting sentences longer than : 67


In [14]:
# turn into (not separated) sentence, check method on number 5 in the list
line = ' '.join(sentence_new_list[5]) 

print(line) #it works

#create empty list
sequences_sent = []

#create a loop to do it at all the sentences
for sentence in sentence_new_list:
    line = ' '.join(sentence)
    sequences_sent.append(line)

print(sequences_sent[0])

far from it
how it happened that mastro cherry carpenter found a piece of wood that wept and laughed like a child


In [15]:
# save the sentences to file, one dialog per line
def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w', encoding = 'utf-8')
	file.write(data)
	file.close()

In [16]:
# save sentences to file
out_filename = 'Data/clean_text_sentences.txt'
save_doc(sequences_sent, out_filename)

In [18]:
#load the cleaned data with sentences
doc = load_text('Data/clean_text_sentences.txt')

#split the text according to line shift
lines = doc.split('\n')

#check the type
type(lines)#its a list


print(lines[0])

len(lines)

how it happened that mastro cherry carpenter found a piece of wood that wept and laughed like a child


34970

In [18]:
tokenizer = Tokenizer() #define function Tokenizer() as the variable tokenizer
tokenizer.fit_on_texts(lines) #fit_on_text() tells it what data to train on. We train it on the entire training data, finds all of the unique words in the data and assigns each a unique integer, from 1 to the total number of words
sequences_sent = tokenizer.texts_to_sequences(lines) #convert each sequence from a list of words to a list of integers

#print(sequences_sent[:100]) #all sequences are represented by vectors, consisting of an integer corresponding to each word in the vector

tokenizer.word_index #check out the mapping of integers to words

print(sequences_sent[0])

[82, 13, 259, 8, 795, 1622, 1623, 117, 4, 356, 6, 224, 8, 938, 2, 730, 88, 4, 770]


In the model all the input sequences need to be the same length. We therefore need to pad or truncate the different sentences. When you pad the sequences, you add zeros, so that they all become the same length (lenght of the longest sentence. In that case we will have to create an masking input layer which will ignore padded values. This means that padded inputs have no impact on learning. But we will start by padding all the sentences. 

In [49]:
#will padd zeros in front, pre is the default option
padded = pad_sequences(sequences_sent)
print(padded)


len(padded[3]) #all sequences are now 130 long



[[   0    0    0 ...   88    4  770]
 [   0    0    0 ...   51  209  308]
 [   0    0    0 ...   11   66 3699]
 ...
 [   0    0    0 ... 3498  179   92]
 [   0    0    0 ...  782   22   80]
 [   0    0    0 ...  471   55  290]]


130

In [20]:
#check vocabulary size
vocab_size = len(tokenizer.word_index) + 1 #+1 because indexing of arrays starts at zero, and the first word is assigned the integer of 1

vocab_size

7905

In [35]:
# separate into input and output
sequences_sent = array(padded) #create an array with the sequences

X, y = sequences_sent[:,:-1], sequences_sent[:,-1] #separate into input and output sequences
print(X) #input
print(y) #output
y = to_categorical(y, num_classes=vocab_size) #convert to one-hot encoding
print(y)
seq_length_sent = X.shape[1] #set sequence length to the number of columns in X

len(X[1])
len(y)

print(y[[3]])

print(max(y[3])) #check if there is one-hot encoding for the output word

print(min(y[3]))

print(len(X[1]))

[[   0    0    0 ...  730   88    4]
 [   0    0    0 ... 3026   51  209]
 [   0    0    0 ...  217   11   66]
 ...
 [   0    0    0 ...    1 3498  179]
 [   0    0    0 ... 4804  782   22]
 [   0    0    0 ...    1  471   55]]
[ 770  308 3699 ...   92   80  290]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]]
1.0
0.0
129


In [56]:
#create the model
model = Sequential() #define model, use sequential model function
model.add(Masking(mask_value=0, input_shape=(seq_length_sent,)))
model.add(Embedding(vocab_size, 50, input_length=seq_length_sent)) #add embedding layer, 50 is the dimensions which will be used to represent each word vector
model.add(LSTM(100, return_sequences=True)) #add hidden layer, long short term memory layer
model.add(LSTM(100))
model.add(Dense(100, activation='relu')) #dense specifies the structure of the neural network, 100 refer to that there are 100 neurons in the first hidden layer, as defined above. Relu is an argument of rectified linear unit.  
model.add(Dense(vocab_size, activation='softmax')) #softmax transform to probabilities 
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking_17 (Masking)         (None, 129)               0         
_________________________________________________________________
embedding_16 (Embedding)     (None, 129, 50)           395250    
_________________________________________________________________
lstm_9 (LSTM)                (None, 129, 100)          60400     
_________________________________________________________________
lstm_10 (LSTM)               (None, 100)               80400     
_________________________________________________________________
dense_9 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_10 (Dense)             (None, 7905)              798405    
Total params: 1,344,555
Trainable params: 1,344,555
Non-trainable params: 0
_________________________________________________________________


In [57]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) #it's a multilevel predicter, because it's categorical, we want a measure which is accuracy, and a cost function which is optimized by adam
# fit model
model.fit(X, y, batch_size=128, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x1a2667b240>

In [58]:
# save the model to file
model.save('model.h5_sentence')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl_sentence', 'wb')) #w stands for write

In [32]:
#need the text, so we can use a sequence as a source input for generating the text
doc = load_text('Data/clean_text_sentences.txt') #load cleaned sequences 

lines = doc.split('\n') #split by new line


lines[5]

'far from it'

In [37]:
#specify the sequence length
seq_length_sent = X.shape[1]

print(seq_length_sent)

129


In [38]:
#load the model
model = load_model('model.h5_sentence')


# load the tokenizer
tokenizer = load(open('tokenizer.pkl_sentence', 'rb')) #r stands for read

In [44]:
#select a seed text
seed_text_sentence = lines[randint(0,len(lines))] #randint returns a random integer from 0 to the length of lines, index it
print(seed_text_sentence + '\n') #print the selected seed text and make a line shift

you can imagine how fast he traveled when i tell you that they reached the palace in just half the time it had taken the wooden horse to get there



In [62]:
#use tokenizer to assign an integer to every word in the seed text
encoded_sentence = tokenizer.texts_to_sequences([seed_text_sentence])[0]

#pad the seed text
encoded_sentence = pad_sequences([encoded_sentence], maxlen=seq_length_sent, truncating='pre')

print(encoded_sentence)



[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0   11  114  523   83  387    5 1124   33   12  144   11    8   22
   256    1  154    9  129  284    1   68   13   17  324    1  587  208
     3  149   53]]


In [65]:
#the model.preict_classes will return the index of the word with the highest probability 
yhat = model.predict_classes(encoded_sentence, verbose=0) #verbose specifies how  you want to 'see' the training progress for each epoch

yhat



array([69])

In [66]:
#look up in the tokenizer what word the indexed word corresponds to

out_word = '' #empty variable to store the word output word in 
for word, index in tokenizer.word_index.items(): #loop through the tokenizers
	if index == yhat: #if the index is equally to yhat, assign the word to the variable out_word
		out_word = word
		break
        
out_word

'more'

In [None]:
#the input sequences will get too long, when adding the output word, need to make sure that it's always 129 words long
#encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') #pre indicates that it should take the values from the begining of the sequence

In [68]:
#create a function for generating new sequences 
def generate_seq(model, tokenizer, seq_length_sent, seed_text_sentence, n_words): #n_words is the number of words to generate 
	result = list() #create and empty list
	in_text = seed_text_sentence #assign the see_text to a variable 
	# generate a fixed number of words
	for _ in range(n_words): #range(), first argument is the number of integers to generate, starting from zero
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0] #take the first sequence
		# truncate sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=seq_length_sentence, truncating='pre') #make sure that the sequence is 50 words
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word #add the output word to the sequence we are currently working with 
		result.append(out_word) #append the word to the empty list
	return ' '.join(result) #''.join creates a space between each word in the result list and makes a string

In [72]:

# generate new text
generated = generate_seq(model, tokenizer, seq_length_sent, seed_text_sentence, 20)
print(generated)


#save the generated text

out_filename = 'Data/first_output_sentence.txt'

save_doc(generated, out_filename)

more man man themselves gone free gone do go man ask place fastened heard me afterward before you themselves about


# Things to look at in lecture 5 


Input sequences


We could process the data so that the model only ever deals with self-contained sentences and pad or truncate the text to meet this requirement for each input sequence. You could explore this as an extension to this tutorial.
- Instead, to keep the example brief, we will let all of the text flow together and train the model to predict the next word across sentences, paragraphs, and even books or chapters in the text.

- The file consist of multiple stories --> need to divide it into different parts (might be possible to split after headers written in capital letters) --> we will do this in lesson 5

- FInd more fairytales, data is not big enough

We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and a deeper network may achieve better results.