# A generative model


# Lesson 1 - Load data and get acquainted with the corpus


Step by step:

- Choose and download considerably amount of data from Gutenberg.org
- Look into the data, review some of the text
- Observe things (e.g. strange names), which must be considered when preparing the data


The first thing we did was to download Hans Christian Andersen fairytales from Gutenberg and look into some of the texts.

What do we need to look out for when cleaning the data:

- Punctuation 
- The file consist of multiple stories --> need to divide it into different parts (might be possible to split after headers written in capital letters) --> we will do this in lesson 5
- Capital letters as headers, might want to delete these
- Change everything to lower caser 
- Illustrations, delete them, written as ([Illustration: _His limbs were numbed, his beautiful eyes were closing.-])


In [None]:
#import packages
import re
import string
from numpy import array
from pickle import dump
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.models import load_model
from pickle import load
from random import randint
from keras.preprocessing.sequence import pad_sequences
import nltk.data
from nltk.tokenize import sent_tokenize, word_tokenize

from keras.preprocessing.sequence import pad_sequences

from keras.layers import Masking
import os


Documentation on the following coding block:
https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python
- How to open text

In [61]:
#define a function to load the text into memory
def load_text(filename): #we decide that the function is called load_text
    file = open(filename, 'r') #open() is used to open/loading a filename 
    #the mode indicates how it should be opened, r = read, w = writing, a = appending 
    text = file.read() # read the file and assign it to the variable text
    file.close() #close the file again
    return text #ends function and specify what output the want, we want text

We have saved all the text files in a local folder on the computer. We need to first load these and then combine them into a single file. 

In [62]:
#define a path, the folder where the text files are placed
path = "/Users/signeklovekjaer/Documents/CognitiveScience/3.semester/Learning_with_digital_media/Exam_project/Jupiter_notebook/Data/"
#make a list that contains all the names of the files in the directory and assign it to a variable
filenames = os.listdir( path )

# print all the files in the directory 
for file in filenames:
   print(file)

.DS_Store
22693-0.txt
2435-0.txt
2591-0.txt
2892-0.txt
32571-0.txt
3454-0.txt
43212-0.txt
500-0.txt
503-0.txt
52521.txt
5615-0.txt
56914-0.txt
640-0.txt
641-0.txt
677-0.txt
7277-0.txt
7439-0.txt
902-0.txt
fairytales.txt
pg11027.txt
pg128.txt
pg14241.txt
pg17034.txt
pg17860.txt
pg19068.txt
pg19207.txt
pg19860.txt
pg28314.txt
pg29021.txt
pg3027.txt
pg3152.txt
pg31763.txt
pg33511.txt
pg33571.txt
pg34705.txt
pg37193.txt
pg4018.txt
pg4357.txt
pg51762.txt
pg708.txt
pg7871.txt
pg7885.txt


In [68]:
#use .remove() to remove filenames 
filenames.remove('.DS_Store');

filenames.remove('fairytales.txt');

filenames

['22693-0.txt',
 '2435-0.txt',
 '2591-0.txt',
 '2892-0.txt',
 '32571-0.txt',
 '3454-0.txt',
 '43212-0.txt',
 '500-0.txt',
 '503-0.txt',
 '52521.txt',
 '5615-0.txt',
 '56914-0.txt',
 '640-0.txt',
 '641-0.txt',
 '677-0.txt',
 '7277-0.txt',
 '7439-0.txt',
 '902-0.txt',
 'pg11027.txt',
 'pg128.txt',
 'pg14241.txt',
 'pg17034.txt',
 'pg17860.txt',
 'pg19068.txt',
 'pg19207.txt',
 'pg19860.txt',
 'pg28314.txt',
 'pg29021.txt',
 'pg3027.txt',
 'pg3152.txt',
 'pg31763.txt',
 'pg33511.txt',
 'pg33571.txt',
 'pg34705.txt',
 'pg37193.txt',
 'pg4018.txt',
 'pg4357.txt',
 'pg51762.txt',
 'pg708.txt',
 'pg7871.txt',
 'pg7885.txt']

In [69]:
#combine all the text files into a single file called "fairytales.text"
with open('fairytales.txt', 'w') as outfile: #specify where the file should be saved and what it should be called (if a file with that name already exists, it will be overwritten, 'w' is used to edit and write new information to a file 
    for text in filenames: #for all the text in the filenames, do the following...
        with open(text) as infile:
            outfile.write(infile.read())

In [70]:

# load document
in_filename = 'fairytales.txt'
doc = load_text(in_filename)
print(doc[:1000])

﻿Just as a little child holds out its hands to catch the sunbeams, to
feel and to grasp what, so its eyes tell it, is actually there, so,
down through the ages, men have stretched out their hands in eager
endeavour to know their God. And because only through the human was
the divine knowable, the old peoples of the earth made gods of their
heroes and not unfrequently endowed these gods with as many of the
vices as of the virtues of their worshippers. As we read the myths of
the East and the West we find ever the same story. That portion of the
ancient Aryan race which poured from the central plain of Asia,
through the rocky defiles of what we now call "The Frontier," to
populate the fertile lowlands of India, had gods who must once have
been wholly heroic, but who came in time to be more degraded than the
most vicious of lustful criminals. And the Greeks, Latins, Teutons,
Celts, and Slavonians, who came of the same mighty Aryan stock, did
even as those with whom they owned a common anc

# Lesson 2 - Prepare the training data

Step by step:
- Clean the text e.g. normalize words to lowercase and delete punctuation from words, to reduce vocabulary size
- Organize the words into sequences and save these to a file


In [71]:
#removing illustration descriptions
doc = re.sub(r'\[[^)]*\]', '', doc) #using re.sub function and regular expressions 
#[ - an opening bracket 
#[^()]* - zero or more characters other than those defined, that is, any characters other than [ and ]
#\] - a closing bracket



In [72]:
#remove headers
doc = re.sub(r'[A-Z]{2,}', '', doc) #remove capital letters everytime there is more than two 

print(doc[:1000]) #lets look at it 


﻿Just as a little child holds out its hands to catch the sunbeams, to
feel and to grasp what, so its eyes tell it, is actually there, so,
down through the ages, men have stretched out their hands in eager
endeavour to know their God. And because only through the human was
the divine knowable, the old peoples of the earth made gods of their
heroes and not unfrequently endowed these gods with as many of the
vices as of the virtues of their worshippers. As we read the myths of
the East and the West we find ever the same story. That portion of the
ancient Aryan race which poured from the central plain of Asia,
through the rocky defiles of what we now call "The Frontier," to
populate the fertile lowlands of India, had gods who must once have
been wholly heroic, but who came in time to be more degraded than the
most vicious of lustful criminals. And the Greeks, Latins, Teutons,
Celts, and Slavonians, who came of the same mighty Aryan stock, did
even as those with whom they owned a common anc

Documentation for the following code: https://www.geeksforgeeks.org/python-maketrans-translate-functions/

- How to use translate together with the helper function maketrans

In [73]:
#create function that can clean the text
def clean_doc(doc): #make a function called clean_doc
	# split into tokens by white space
	tokens = doc.split() #get a list with all words (tokens)
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation) #third argument specifies the character we want to delete
	tokens = [w.translate(table) for w in tokens] #use translation mapping from table and loop through all words in token
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()] #isalpha() checks whether the string consists of alphabetic characters only. Replace word with word if it is alphabetical. 
	# make lower case
	tokens = [word.lower() for word in tokens] #replace word in lower case with word in token 
	return tokens

In [None]:
#use the cleaning function on the whole document
tokens = clean_doc(doc)
print(tokens[:20])
print('Total Tokens: %d' % len(tokens)) #print total number of tokens
print('Unique Tokens: %d' % len(set(tokens))) #group similar tokens and print the number

tokens[1]

Now the text is cleaned and we need to divide into sequences.

We have to decide the length of the input sequences, which is the number of words that the model will train on. This is also the number of words which will be used as the seed text, when using the model to actually generate new sequences. 

In the tutorial 50 words are suggested. We will start with that and then in the fifth lesson we might try to only use sequences inside sentences, though this will minimize the amount of training data. 

In [None]:
# organize the text into sequences of tokens
length = 50 + 1 #define the length of the sequences, 1 is the output word
sequences = list() #create an empty list, where we will store all the sequences

for i in range(length, len(tokens)): #range([start], stop[, step])
	# select sequence of tokens
	seq = tokens[i-length:i] #make sequences with length of 50 + 1
	# convert into a line
	line = ' '.join(seq) #join the tokens into a line in the sequence
	# store
	sequences.append(line) #append all the lines to the empty list created before the loop
print('Total Sequences: %d' % len(sequences)) #print the total amount (length of sequence) of sequences 


print(sequences[2])

In [None]:
# save tokens to file, one dialog per line
def save_doc(lines, filename): #define function called save_doc() which takes two inputs
	data = '\n'.join(lines) #join lines, separate by line shift 
	file = open(filename, 'w')
	file.write(data)
	file.close()

In [None]:
# save sequences to file
out_filename = 'Data/clean_text.txt'
save_doc(sequences, out_filename)#use the function on the sequences

# Lesson 3 - Train the generative model

Step by step:
 
- Load the cleaned data and divide it into training sequences
- Map each word to a unique integer i.e. word to vector representation
- Separate the input sequences into input and output elements



In [None]:
#load the cleaned data
doc = load_text('Data/clean_text.txt')

#split the text according to line shift (every string is 51 words)
lines = doc.split('\n') #use split() function, \n is a regular expression for line shift

#check the type
type(lines)#its a list

Documentation for getting keras and tensorflow packages to work:

https://keras.io/getting-started/functional-api-guide/

https://www.tensorflow.org/install/source

In [None]:
tokenizer = Tokenizer() #define function Tokenizer() as the variable tokenizer
tokenizer.fit_on_texts(lines) #fit_on_text() tells it what data to train on. We train it on the entire training data, finds all of the unique words in the data and assigns each a unique integer, from 1 to the total number of words
sequences = tokenizer.texts_to_sequences(lines) #convert each sequence from a list of words to a list of integers

print(sequences) #all sequences are represented by vectors, consisting of an integer corresponding to each word in the vector

tokenizer.word_index #check out the mapping of integers to words


In [None]:
#check vocabulary size
vocab_size = len(tokenizer.word_index) + 1 #+1 because indexing of arrays starts at zero, and the first word is assigned the integer of 1

vocab_size

Now we have divided the cleaned text into training sequences and assigned each to work to an integer, so that all the sequences are represented by vectors of intergers. 

Next step is to divide it into input and output sequences. The outputs are the +1, which we have added to each sequences in lesson 2. 

In [None]:

# separate into input and output
sequences = array(sequences) #create an array with the sequences
#separate into input and output sequences
X, y = sequences[:,:-1], sequences[:,-1] #X is the whole sequences without the last integer, y is the last integer in the sequence
print(X) #input
print(y) #output
y = to_categorical(y, num_classes=vocab_size) #convert to one-hot encoding, represent by 1 and 0
print(y)
seq_length = X.shape[1] #set sequence length to the number of columns in X, so to 50 in this case

# Lesson 4.a - Fit and use the generative model


Step by step:
 
- Fit and save the model
- Load the data and the model
- Generate text


First we will create the model, by adding layers to it

In [None]:
model = Sequential() #define model, use sequential model function
model.add(Embedding(vocab_size, 50, input_length=seq_length)) #add embedding layer, 50 is the dimensions which will be used to represent each word vector
model.add(LSTM(100, return_sequences=True)) #add hidden layer, long short term memory layer
model.add(LSTM(100)) #100 is the number of units in our hidden layer
model.add(Dense(100, activation='relu')) #dense specifies the structure of the neural network, 100 refer to that there are 100 neurons in the first hidden layer, as defined above. Relu is an argument of rectified linear unit.  
model.add(Dense(vocab_size, activation='softmax')) #softmax transform to probabilities 
print(model.summary())

In [None]:
#it's a multilevel predicter, because it's categorical, we want a measure which is accuracy, and a cost function which is optimized by adam
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) 

Now we are ready to train the model.

In [None]:
# fit model
model.fit(X, y, batch_size=128, epochs=100)

Training the model took a long time, between 20 and 40 hours. 

In [None]:
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb')) #w stands for write

The model is ready for use. First we will try just to generate one word at a time. After this we will create a function which generates a specified number of words. 

In [None]:
#need the text, so we can use a sequence as a source input for generating the text
doc = load_text('Data/clean_text.txt') #load cleaned sequences 

lines = doc.split('\n') #split by new line



In [None]:
#specify the sequence length
seq_length = len(lines[0].split()) - 1 #look at the first line, split sequences by words, count and minus 1

In [None]:
#load the model
model = load_model('model.h5')


# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb')) #r stands for read

In [None]:
#select a seed text
seed_text = lines[randint(0,len(lines))] #randint returns a random integer from 0 to the length of lines, index it
print(seed_text + '\n') #print the selected seed text and make a line shift

In [None]:
#use tokenizer to assign an integer to every word in the seed text
encoded = tokenizer.texts_to_sequences([seed_text])[0]

#exclude the output word, so take the last 50 words
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') #pre indicates that it should take the values from the begining of the sequence

len(encoded)


In [None]:
#the model.preict_classes will return the index of the word with the highest probability 
yhat = model.predict_classes(encoded, verbose=0) #verbose specifies how  you want to 'see' the training progress for each epoch

In [None]:
#look up in the tokenizer what word the indexed word corresponds to
out_word = '' #empty variable to store the word output word in 
for word, index in tokenizer.word_index.items(): #loop through the tokenizers
	if index == yhat: #if the index is equally to yhat, assign the word to the variable out_word
		out_word = word
		break
        


Lets create a function

In [None]:
#the input sequences will get too long, when adding the output word, need to make sure that it's always 50 words long
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') #pre indicates that it should take the values from the begining of the sequence

In [None]:
#create a function for generating new sequences 
#the function takes 5 inputs
def generate_seq(model, tokenizer, seq_length, seed_text, n_words): #n_words is the number of words to generate 
	result = list() #create and empty list
	in_text = seed_text #assign the see_text to a variable 
	# generate a fixed number of words
	for _ in range(n_words): #range(), first argument is the number of integers to generate, starting from zero
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0] #take the first sequence
		# truncate sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') #make sure that the sequence is 50 words
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word #add the output word to the sequence we are currently working with 
		result.append(out_word) #append the word to the empty list
	return ' '.join(result) #''.join creates a space between each word in the result list and makes a string

In [None]:

# generate new text, use the function on the model
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)


#save the generated text
out_filename = 'Data/first_output1.txt'

save_doc(generated, out_filename)

# Lesson 5 - Improve Language Model


I have decided to create sequences which correspond to single sentences, instead of having sequences extending over different sentences. But for the model to work, all the sequences need to be the same length (e.g. some sentences have a length of 3 and others a length of 130). This means that we need to create padded sequences, where we add zeros in the beginning of each sequence, so all of them ends up being the same length.  


In [74]:
#load nltk library
nltk.download('punkt') #the library can use punctuation to divide the text into sentences

#define tokenizer, use pretrained from the corpus
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

#assign doc to data
data = doc

#use function sent_tokenize from the nltk package to divide the text into sentences
sentence_list = sent_tokenize(data)

#print the number of sentences
print(len(sentence_list))

#check the type, it's a list
type(sentence_list)

#index to see an example
sentence_list[5] #need to remove the punctuation and have in lower case




[nltk_data] Downloading package punkt to
[nltk_data]     /Users/signeklovekjaer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
80842


'Originally they\ngave to their gods of their best.'

Now we need to clean the text and prepare the sequences as we did in lesson 2 and 3. We cannot use the exact same code, as the lines are now in unequal lengths. 

In [75]:
#check if cleaning function works on one sentence in the sentence list
print(clean_doc(sentence_list[3])) #it does

#make an empty list
sentence_clean_list = []

#make a loop which runs through the list of sentences and uses the clean_doc() function
for sentence in sentence_list:
    y = clean_doc(sentence) #save in y and append to empty list
    sentence_clean_list.append(y)
    

#print(sentence_clean_list[:10])

#check to see if the punctuation is removed and if it's in lower case
print(sentence_clean_list[5]) #it is

        

['that', 'portion', 'of', 'the', 'ancient', 'aryan', 'race', 'which', 'poured', 'from', 'the', 'central', 'plain', 'of', 'asia', 'through', 'the', 'rocky', 'defiles', 'of', 'what', 'we', 'now', 'call', 'the', 'frontier', 'to', 'populate', 'the', 'fertile', 'lowlands', 'of', 'india', 'had', 'gods', 'who', 'must', 'once', 'have', 'been', 'wholly', 'heroic', 'but', 'who', 'came', 'in', 'time', 'to', 'be', 'more', 'degraded', 'than', 'the', 'most', 'vicious', 'of', 'lustful', 'criminals']
['originally', 'they', 'gave', 'to', 'their', 'gods', 'of', 'their', 'best']


We need to check the length of the sentences, as we need to use this then padding the sentences into sequences. 

In [76]:
#check the length of a random sentence
print(len(sentence_clean_list[3]))

length = []
for sentence in sentence_clean_list:
    l = len(sentence)
    length.append(l)

    
print(max(length))#print the maximum length of a sentence

print(min(length))#print the minimum length of a sentence


58
348
0


In [77]:
#remove sentences less than 1 word long and longer than 130
sentence_new_list1 = [s for s in sentence_clean_list if len(s) >= 1] 
sentence_new_list = [s for s in sentence_new_list1 if len(s) <= 130]


length_new = []
for sentence in sentence_new_list:
    l = len(sentence)
    length_new.append(l)

    
print(max(length_new))

print(min(length_new))


#check how many sentences we have removed
print('Sentences removed after deleting sentences shorter than 1:', len(sentence_clean_list) - len(sentence_new_list1) )
print('Sentences removed after deleting sentences longer than :', len(sentence_new_list1) - len(sentence_new_list)) 



130
1
Sentences removed after deleting sentences shorter than 1: 393
Sentences removed after deleting sentences longer than : 97


In [78]:
# turn into (not separated) sentence, check method on number 5 in the list
line = ' '.join(sentence_new_list[5]) 

print(line) #it works

#create empty list
sequences_sent = []

#create a loop to do it at all the sentences
for sentence in sentence_new_list:
    line = ' '.join(sentence)
    sequences_sent.append(line)

print(sequences_sent[0])

originally they gave to their gods of their best
as a little child holds out its hands to catch the sunbeams to feel and to grasp what so its eyes tell it is actually there so down through the ages men have stretched out their hands in eager endeavour to know their god


In [79]:
# save the sentences to file, one dialog per line
def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

In [81]:
# save sentences to file
out_filename = 'clean_text_sentences.txt'
save_doc(sequences_sent, out_filename)

In [82]:
#load the cleaned data with sentences
doc = load_text('clean_text_sentences.txt')

#split the text according to line shift
lines = doc.split('\n')

#check the type
type(lines)#its a list


print(lines[0])



as a little child holds out its hands to catch the sunbeams to feel and to grasp what so its eyes tell it is actually there so down through the ages men have stretched out their hands in eager endeavour to know their god


In [83]:
tokenizer = Tokenizer() #define function Tokenizer() as the variable tokenizer
tokenizer.fit_on_texts(lines) #fit_on_text() tells it what data to train on. We train it on the entire training data, finds all of the unique words in the data and assigns each a unique integer, from 1 to the total number of words
sequences_sent = tokenizer.texts_to_sequences(lines) #convert each sequence from a list of words to a list of integers

#print(sequences_sent[:100]) #all sequences are represented by vectors, consisting of an integer corresponding to each word in the vector

tokenizer.word_index #check out the mapping of integers to words

print(sequences_sent[0])

[18, 5, 49, 265, 4280, 36, 135, 262, 3, 860, 1, 4529, 3, 616, 2, 3, 4401, 51, 27, 135, 168, 156, 11, 31, 2426, 45, 27, 63, 145, 1, 4530, 210, 34, 832, 36, 50, 262, 8, 2387, 12099, 3, 125, 50, 639]


In the model all the input sequences need to be the same length. We therefore need to pad or truncate the different sentences. When you pad the sequences, you add zeros, so that they all become the same length (lenght of the longest sentence). In that case we will have to create an masking input layer which will ignore padded values. This means that padded inputs have no impact on learning. But we will start by padding all the sentences. 

In [84]:
#will padd zeros in front, pre is the default option
padded = pad_sequences(sequences_sent)
print(padded)


len(padded[3]) #all sequences are now 130 long



[[    0     0     0 ...   125    50   639]
 [    0     0     0 ...     4    50 14117]
 [    0     0     0 ...     1   214   347]
 ...
 [    0     0     0 ...     2  1308    82]
 [    0     0     0 ...     1   633   863]
 [    0     0     0 ...     4    13   242]]


130

In [85]:
#check vocabulary size
vocab_size = len(tokenizer.word_index) + 1 #+1 because indexing of arrays starts at zero, and the first word is assigned the integer of 1

vocab_size

28730

In [86]:
# separate into input and output
sequences_sent = array(padded) #create an array with the sequences

X, y = sequences_sent[:,:-1], sequences_sent[:,-1] #separate into input and output sequences
print(X) #input
print(y) #output
y = to_categorical(y, num_classes=vocab_size) #convert to one-hot encoding
print(y)
seq_length_sent = X.shape[1] #set sequence length to the number of columns in X

len(X[1])
len(y)

print(y[[3]])

print(max(y[3])) #check if there is one-hot encoding for the output word

print(min(y[3]))

print(len(X[1]))

[[   0    0    0 ...    3  125   50]
 [   0    0    0 ... 4402    4   50]
 [   0    0    0 ...  188    1  214]
 ...
 [   0    0    0 ... 1260    2 1308]
 [   0    0    0 ...   19    1  633]
 [   0    0    0 ...  255    4   13]]
[  639 14117   347 ...    82   863   242]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]]
1.0
0.0
129


In [87]:
#create the model
model = Sequential() #define model, use sequential model function
model.add(Masking(mask_value=0, input_shape=(seq_length_sent,))) #the masking layer, which is needed because of the padded sequences
model.add(Embedding(vocab_size, 50, input_length=seq_length_sent)) #add embedding layer, 50 is the dimensions which will be used to represent each word vector
model.add(LSTM(100, return_sequences=True)) #add hidden layer, long short term memory layer
model.add(LSTM(100))
model.add(Dense(100, activation='relu')) #dense specifies the structure of the neural network, 100 refer to that there are 100 neurons in the first hidden layer, as defined above. Relu is an argument of rectified linear unit.  
model.add(Dense(vocab_size, activation='softmax')) #softmax transform to probabilities 
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking_1 (Masking)          (None, 129)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 129, 50)           1436500   
_________________________________________________________________
lstm_1 (LSTM)                (None, 129, 100)          60400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 28730)             2901730   
Total params: 4,489,130
Trainable params: 4,489,130
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) #it's a multilevel predicter, because it's categorical, we want a measure which is accuracy, and a cost function which is optimized by adam
# fit model
model.fit(X, y, batch_size=128, epochs=100)

Epoch 1/100
 1664/80352 [..............................] - ETA: 24:05 - loss: 10.2595 - acc: 0.0186

In [None]:
# save the model to file
model.save('model.h5_sentence')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl_sentence', 'wb')) #w stands for write

In [None]:
#need the text, so we can use a sequence as a source input for generating the text
doc = load_text('Data/clean_text_sentences.txt') #load cleaned sequences 

lines = doc.split('\n') #split by new line


lines[5]

In [None]:
#specify the sequence length
seq_length_sent = X.shape[1]

print(seq_length_sent)

In [None]:
#load the model
model = load_model('model.h5_sentence')


# load the tokenizer
tokenizer = load(open('tokenizer.pkl_sentence', 'rb')) #r stands for read

In [None]:
#select a seed text
seed_text_sentence = lines[randint(0,len(lines))] #randint returns a random integer from 0 to the length of lines, index it
print(seed_text_sentence + '\n') #print the selected seed text and make a line shift

In [None]:
#use tokenizer to assign an integer to every word in the seed text
encoded_sentence = tokenizer.texts_to_sequences([seed_text_sentence])[0]

#pad the seed text
encoded_sentence = pad_sequences([encoded_sentence], maxlen=seq_length_sent, truncating='pre')

print(encoded_sentence)



In [None]:
#the model.preict_classes will return the index of the word with the highest probability 
yhat = model.predict_classes(encoded_sentence, verbose=0) #verbose specifies how  you want to 'see' the training progress for each epoch

yhat



In [None]:
#look up in the tokenizer what word the indexed word corresponds to

out_word = '' #empty variable to store the word output word in 
for word, index in tokenizer.word_index.items(): #loop through the tokenizers
	if index == yhat: #if the index is equally to yhat, assign the word to the variable out_word
		out_word = word
		break
        
out_word

In [None]:
#the input sequences will get too long, when adding the output word, need to make sure that it's always 129 words long
#encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') #pre indicates that it should take the values from the begining of the sequence

In [None]:
#create a function for generating new sequences 
def generate_seq(model, tokenizer, seq_length_sent, seed_text_sentence, n_words): #n_words is the number of words to generate 
	result = list() #create and empty list
	in_text = seed_text_sentence #assign the see_text to a variable 
	# generate a fixed number of words
	for _ in range(n_words): #range(), first argument is the number of integers to generate, starting from zero
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0] #take the first sequence
		# truncate sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=seq_length_sent, truncating='pre') #make sure that the sequence is 50 words
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word #add the output word to the sequence we are currently working with 
		result.append(out_word) #append the word to the empty list
	return ' '.join(result) #''.join creates a space between each word in the result list and makes a string

In [None]:

# generate new text
generated = generate_seq(model, tokenizer, seq_length_sent, seed_text_sentence, 20)
print(generated)


#save the generated text

out_filename = 'Data/first_output_sentence.txt'

save_doc(generated, out_filename)

# The following coding blocks are NOT part of lesson 5

In [None]:
#use the function on the whole document
tokens = clean_doc(doc)
print(tokens[:20])
print('Total Tokens: %d' % len(tokens)) #print total number of tokens
print('Unique Tokens: %d' % len(set(tokens))) #group similar tokens and print the number

tokens[1]

In [None]:
# organize the text into sequences of tokens
length = 50 + 1 #define the length of the sequences
sequences = list() #create an empty list, where we will store all the sequences

for i in range(length, len(tokens)): #range([start], stop[, step])
	# select sequence of tokens
	seq = tokens[i-length:i] #make sequences with length of 50 + 1
	# convert into a line
	line = ' '.join(seq) #join the tokens into a line in the sequence
	# store
	sequences.append(line) #append all the lines to the empty list created before the loop
print('Total Sequences: %d' % len(sequences)) #print the total amount (length of sequence) of sequences 


print(sequences[2])

In [None]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

In [None]:
# save sequences to file
out_filename = 'Data/clean_text.txt'
save_doc(sequences, out_filename)



# Lesson 3


In [None]:
#load the cleaned data
doc = load_text('Data/clean_text.txt')

#split the text according to line shift (every string is 51 words)
lines = doc.split('\n')

#check the type
type(lines)#its a list

#Documentation
https://keras.io/getting-started/functional-api-guide/

https://www.tensorflow.org/install/source

In [None]:
tokenizer = Tokenizer() #define function Tokenizer() as the variable tokenizer
tokenizer.fit_on_texts(lines) #fit_on_text() tells it what data to train on. We train it on the entire training data, finds all of the unique words in the data and assigns each a unique integer, from 1 to the total number of words
sequences = tokenizer.texts_to_sequences(lines) #convert each sequence from a list of words to a list of integers

print(sequences) #all sequences are represented by vectors, consisting of an integer corresponding to each word in the vector

tokenizer.word_index #check out the mapping of integers to words


In [None]:
#check vocabulary size
vocab_size = len(tokenizer.word_index) + 1 #+1 because indexing of arrays starts at zero, and the first word is assigned the integer of 1

vocab_size

In [None]:

# separate into input and output
sequences = array(sequences) #create an array with the sequences
X, y = sequences[:,:-1], sequences[:,-1] #separate into input and putput sequences
print(X) #input
print(y) #output
y = to_categorical(y, num_classes=vocab_size) #convert to one-hot encoding
print(y)
seq_length = X.shape[1] #set sequence length to the number of columns in X

In [None]:
model = Sequential() #define model, use sequential model function
model.add(Embedding(vocab_size, 50, input_length=seq_length)) #add embedding layer, 50 is the dimensions which will be used to represent each word vector
model.add(LSTM(100, return_sequences=True)) #add hidden layer, long short term memory layer
model.add(LSTM(100))
model.add(Dense(100, activation='relu')) #dense specifies the structure of the neural network, 100 refer to that there are 100 neurons in the first hidden layer, as defined above. Relu is an argument of rectified linear unit.  
model.add(Dense(vocab_size, activation='softmax')) #softmax transform to probabilities 
print(model.summary())

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) #it's a multilevel predicter, because it's categorical, we want a measure which is accuracy, and a cost function which is optimized by adam
# fit model
model.fit(X, y, batch_size=128, epochs=100)

In [None]:
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb')) #w stands for write

# Lesson 4

In [None]:
#need the text, so we can use a sequence as a source input for generating the text
doc = load_text('Data/clean_text.txt') #load cleaned sequences 

lines = doc.split('\n') #split by new line



In [None]:
#specify the sequence length
seq_length = len(lines[0].split()) - 1 #look at the first line, split sequences by words, count and minus 1

In [None]:
#load the model
model = load_model('model.h5')


# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb')) #r stands for read

In [None]:
#select a seed text
seed_text = lines[randint(0,len(lines))] #randint returns a random integer from 0 to the length of lines, index it
print(seed_text + '\n') #print the selected seed text and make a line shift

In [None]:
#use tokenizer to assign an integer to every word in the seed text
encoded = tokenizer.texts_to_sequences([seed_text])[0]

#exclude the output word, so take the last 50 words
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')

len(encoded)


In [None]:
#the model.preict_classes will return the index of the word with the highest probability 
yhat = model.predict_classes(encoded, verbose=0) #verbose specifies how  you want to 'see' the training progress for each epoch

In [None]:
#look up in the tokenizer what word the indexed word corresponds to

out_word = '' #empty variable to store the word output word in 
for word, index in tokenizer.word_index.items(): #loop through the tokenizers
	if index == yhat: #if the index is equally to yhat, assign the word to the variable out_word
		out_word = word
		break
        


In [None]:
#the input sequences will get too long, when adding the output word, need to make sure that it's always 50 words long
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') #pre indicates that it should take the values from the begining of the sequence

In [None]:
#create a function for generating new sequences 
def generate_seq(model, tokenizer, seq_length, seed_text, n_words): #n_words is the number of words to generate 
	result = list() #create and empty list
	in_text = seed_text #assign the see_text to a variable 
	# generate a fixed number of words
	for _ in range(n_words): #range(), first argument is the number of integers to generate, starting from zero
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0] #take the first sequence
		# truncate sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') #make sure that the sequence is 50 words
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word #add the output word to the sequence we are currently working with 
		result.append(out_word) #append the word to the empty list
	return ' '.join(result) #''.join creates a space between each word in the result list and makes a string

In [None]:

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)


#save the generated text

out_filename = 'Data/first_output1.txt'

save_doc(generated, out_filename)

# Things to look at in lecture 5 


Input sequences


We could process the data so that the model only ever deals with self-contained sentences and pad or truncate the text to meet this requirement for each input sequence. You could explore this as an extension to this tutorial.
- Instead, to keep the example brief, we will let all of the text flow together and train the model to predict the next word across sentences, paragraphs, and even books or chapters in the text.

- The file consist of multiple stories --> need to divide it into different parts (might be possible to split after headers written in capital letters) --> we will do this in lesson 5

- FInd more fairytales, data is not big enough

We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and a deeper network may achieve better results.