# Generative Models for Text

### (a) In this problem, we are trying to build a generative model to mimic the writing style of prominent British Mathematician, Philosopher, proliﬁc writer, and political activist, Bertrand Russell. 


In [None]:
from __future__ import print_function
from keras.callbacks import LambdaCallback,ModelCheckpoint
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Activation,Dropout
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from keras.preprocessing.text import one_hot
import keras
import numpy as np
import random
import sys
import io,os
import re

### (b) Download the following books from Project Gutenberg http://www.gutenberg. org/ebooks/author/355 in text format: 

i. The Problems of Philosophy 

ii. The Analysis of Mind 

iii. Mysticism and Logic and Other Essays 

iv. Our Knowledge of the External World as a Field for Scientiﬁc Method in Philosophy Project Gutenberg adds a standard header and footer to each book and this is not part of the original text. Open the ﬁle in a text editor and delete the header and footer. The header is obvious and ends with the text: ** START OF THIS PROJECT GUTENBERG EBOOK AN INQUIRY INTO MEANING AND TRUTH ** 
The footer is all of the text after the line of text that says: THE END To have a better model, it is strongly recommended that you download the following books from The Library of Congress https://archive.org and convert them to text ﬁles:

i. The History of Western Philosophy https://archive.org/details/westernphilosophy4

ii. The Analysis of Matter https://archive.org/details/in.ernet.dli.2015.221533 

iii. An Inquiry into Meaning and Truth https://archive.org/details/BertrandRussell-AnInquaryIntoMeaningAndTruth

Try to only use the text of the books and throw away unwanted text before and after the text, although in a large corpus, these are considered as noise and should not make big problems.1 


In [0]:
file = open('TAM.txt', 'rt')
text0  = file.read().lower()
file.close()
file = open('TPP.txt', 'rt')
text1  = file.read().lower()
file.close()
file = open('OKEWFSMP.txt', 'rt')
text2  = file.read().lower()
file.close()
file = open('MLOE.txt', 'rt')
text4  = file.read().lower()
file.close()
file = open('THWP.txt', 'rt',encoding = "ISO-8859-1")
text3  = file.read().lower()
file.close()



In [0]:
text=text0+text1+text2+text4

In [7]:
text4[:100]

'\ufeffmysticism and logic and other essays\n\n\n\n\ni\n\nmysticism and logic\n\n\nmetaphysics, or the attempt to co'

In [0]:
text = re.sub('[^a-zA-Z\.\,]',' ', text)

In [0]:
text=re.sub( '\s+', ' ', text ).strip()

In [0]:
alpharray=list(text)

In [0]:
asciiarray=[ord(c) for c in alpharray]
scaledarray=[((c-32)/90) for c in asciiarray]

In [0]:
tochar=dict((c,chr(int((c*90)+32))) for i, c in enumerate(set(scaledarray)))
toascii=dict((chr(int((c*90)+32)),c) for i, c in enumerate(set(scaledarray)))
del alpharray

In [13]:
toascii

{' ': 0.0,
 ',': 0.13333333333333333,
 '.': 0.15555555555555556,
 'a': 0.7222222222222222,
 'b': 0.7333333333333333,
 'c': 0.7444444444444445,
 'd': 0.7555555555555555,
 'e': 0.7666666666666667,
 'f': 0.7777777777777778,
 'g': 0.7888888888888889,
 'h': 0.8,
 'i': 0.8111111111111111,
 'j': 0.8222222222222222,
 'k': 0.8333333333333334,
 'l': 0.8444444444444444,
 'm': 0.8555555555555555,
 'n': 0.8666666666666667,
 'o': 0.8777777777777778,
 'p': 0.8888888888888888,
 'q': 0.9,
 'r': 0.9111111111111111,
 's': 0.9222222222222223,
 't': 0.9333333333333333,
 'u': 0.9444444444444444,
 'v': 0.9555555555555556,
 'w': 0.9666666666666667,
 'x': 0.9777777777777777,
 'y': 0.9888888888888889,
 'z': 1.0}

In [0]:
toenc={}
inc=0
for c in sorted(toascii.values()):
    toenc.update({c:inc})
    inc+=1

In [0]:
decode={}
inc=0
for c in sorted(toascii.values()):
    decode.update({inc:c})
    inc+=1

In [0]:
chars=set(scaledarray)


iii. Choose a window size, e.g., W = 100.

iv. Inputs to the network will be the ﬁrst W−1 = 99 characters of each sequence, and the output of the network will be the Lth character of the sequence. Basically, we are training the network to predict the each character using the 99 characters that precede it. Slide the window in strides of S = 1 on the text. For example, if W = 5 and S = 1 and we want to train the network with the sequence ABRACADABRA, The ﬁrst input to the network will be ABRA and the corresponding output will be C. The second input will be BRAC and the second output will be A, etc. 

v. Note that the output has to be encoded using a one-hot encoding scheme with N = 256 (or less) elements. This means that the network reads integers, but outputs a vector of N = 256 (or less) elements. 

vi. Use a single hidden layer for the LSTM with N = 256 (or less) memory units. 

In [17]:
N = 99

def nSentences(fullarray,maxlen):
    step = 1
    sentences = []
    next_chars = []
    for i in range(0, len(fullarray) - maxlen, step):
        sentences.append(fullarray[i: i + maxlen])
        next_chars.append(fullarray[i + maxlen])
    print('nb sequences:', len(sentences))
    return sentences,next_chars
sentences,next_char=nSentences(scaledarray,N)
del scaledarray

nb sequences: 1558506


In [18]:
print('Produce X and Y to train')
x = np.zeros((len(sentences), N, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
     for k , char in enumerate(sentence):
        x[i, k , toenc[char]] = 1
        y[i, toenc[next_char[i]]] = 1
x.shape

Produce X and Y to train


(1558506, 99, 29)

In [0]:
del sentences

In [20]:
model = Sequential()
model.add(LSTM(256, input_shape=(N, len(chars))))
model.add(Dropout(0.4))
model.add(Dense(len(chars),activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 256)               292864    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 29)                7453      
Total params: 300,317
Trainable params: 300,317
Non-trainable params: 0
_________________________________________________________________




x. Use model checkpointing to keep the network weights to determine each time an improvement in loss is observed at the end of the epoch. Find the best set of weights in terms of loss. 


In [0]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

vii. Use a Softmax output layer to yield a probability prediction for each of the characters between 0 and 1. This is actually a character classiﬁcation problem with N classes. Choose log loss (cross entropy) as the objective function for the network (research what it means).

In [0]:
#optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(loss='categorical_crossentropy', optimizer='adam')

**Cross Entropy**

Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

Given the range of possible loss values given a true observation. As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predications that are confident and wrong!

Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.


In binary classification, where the number of classes M equals 2, cross-entropy can be calculated as:

−(ylog(p)+(1−y)log(1−p))
If M>2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.



### (c) LSTM: Train an LSTM to mimic Russell’s style and thoughts:

i. Concatenate text ﬁles to create a corpus of Russell’s writings. 

ii. Use a character-level representation for this model by using extended ASCII that has N = 256 characters. Each character will be encoded into a an integer using its ASCII code. Rescale the integers to the range [0,1], because LSTM uses a sigmoid activation function. LSTM will receive the rescaled integers as its input.

In [23]:
model.fit(x,y, epochs=50, batch_size=5000,callbacks=callbacks_list)

Epoch 1/50

Epoch 00001: loss improved from inf to 2.50396, saving model to weights-improvement-01-2.5040.hdf5
Epoch 2/50

Epoch 00002: loss improved from 2.50396 to 1.96880, saving model to weights-improvement-02-1.9688.hdf5
Epoch 3/50

Epoch 00003: loss improved from 1.96880 to 1.75601, saving model to weights-improvement-03-1.7560.hdf5
Epoch 4/50

Epoch 00004: loss improved from 1.75601 to 1.61138, saving model to weights-improvement-04-1.6114.hdf5
Epoch 5/50

Epoch 00005: loss improved from 1.61138 to 1.51292, saving model to weights-improvement-05-1.5129.hdf5
Epoch 6/50

Epoch 00006: loss improved from 1.51292 to 1.44123, saving model to weights-improvement-06-1.4412.hdf5
Epoch 7/50

Epoch 00007: loss improved from 1.44123 to 1.38760, saving model to weights-improvement-07-1.3876.hdf5
Epoch 8/50

Epoch 00008: loss improved from 1.38760 to 1.34485, saving model to weights-improvement-08-1.3449.hdf5
Epoch 9/50

Epoch 00009: loss improved from 1.34485 to 1.31080, saving model to weig


Epoch 00035: loss improved from 1.06346 to 1.05985, saving model to weights-improvement-35-1.0598.hdf5
Epoch 36/50

Epoch 00036: loss improved from 1.05985 to 1.05727, saving model to weights-improvement-36-1.0573.hdf5
Epoch 37/50

Epoch 00037: loss improved from 1.05727 to 1.05432, saving model to weights-improvement-37-1.0543.hdf5
Epoch 38/50

Epoch 00038: loss improved from 1.05432 to 1.05045, saving model to weights-improvement-38-1.0505.hdf5
Epoch 39/50

Epoch 00039: loss improved from 1.05045 to 1.04720, saving model to weights-improvement-39-1.0472.hdf5
Epoch 40/50

Epoch 00040: loss improved from 1.04720 to 1.04506, saving model to weights-improvement-40-1.0451.hdf5
Epoch 41/50

Epoch 00041: loss improved from 1.04506 to 1.04230, saving model to weights-improvement-41-1.0423.hdf5
Epoch 42/50

Epoch 00042: loss improved from 1.04230 to 1.03977, saving model to weights-improvement-42-1.0398.hdf5
Epoch 43/50

Epoch 00043: loss improved from 1.03977 to 1.03723, saving model to wei

<keras.callbacks.History at 0x7fddaa1e9390>

In [78]:
ques="There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object."
ques=ques.lower()
ques=ques[-N:]
len(ques)

99

In [0]:
testasciiarray=[ord(c) for c in ques]
testscaledarray=[((c-32)/90) for c in testasciiarray]

In [0]:
testsentences=[testscaledarray]
x_t = np.zeros((len(testsentences), N, len(chars)), dtype=np.bool)
for i, tsentence in enumerate(testsentences):
     for k , char in enumerate(tsentence):
        x_t[i, k , toenc[char]] = 1

In [0]:
x_t.shape
prediction = model.predict(x_t, verbose=0)

viii. We do not use a test dataset. We are using the whole training dataset to learn the probability of each character in a sequence. We are not seeking for a very accurate model of. Instead we are interested in a generalization of the dataset that can mimic the gist of the text. 

ix. Choose a reasonable number of epochs for training (e.g., 30, although the network will need more epochs to yield a better model). 

## Result after a initial few epochs

In [99]:
print(ques)
print("Generated text :")
for i in range(1000):
    prediction = model.predict(x_t, verbose=0)
    print(tochar[decode[np.argmax(prediction)]],end="")
    x_t[0][:-1]=x_t[0][1:]
    k=np.zeros(len(chars), dtype=np.bool)
    k[np.argmax(prediction)]=1
    x_t[0][-1]=k

t as they would physical phenomena. this school of psychologists tends not to emphasize the object.
Generated text :
ing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thing in the same thi

xi. Use the network with the best weights to generate 1000 characters, using the following text as initialization of the network: There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object. 
  

## Improvement after several epochs:

In [88]:
print(ques)
print("Generated text :")
for i in range(1000):
    prediction = model.predict(x_t, verbose=0)
    print(tochar[decode[np.argmax(prediction)]],end="")
    x_t[0][:-1]=x_t[0][1:]
    k=np.zeros(len(chars), dtype=np.bool)
    k[np.argmax(prediction)]=1
    x_t[0][-1]=k

t as they would physical phenomena. this school of psychologists tends not to emphasize the object.
Generated text :
s and the sense data which are the same thing to the problem of the proposition that the sense data is a proposition as the problem of the proposition is that the same thing is the same thing as the true interests of the sense data and the same thing as the subject of the sense data and the same thing as the subjective content of the proposition in the same particular theory of propositions and the sense data which are the same thing to the problem of the proposition that the sense data is a proposition as the problem of the proposition is that the same thing is the same thing as the true interests of the sense data and the same thing as the subject of the sense data and the same thing as the subjective content of the proposition in the same particular theory of propositions and the sense data which are the same thing to the problem of the proposition that the sense data

In [47]:
print(ques)
print("Generated text :")
for i in range(400):
    prediction = model.predict(x_t, verbose=0)
    print(tochar[decode[np.argmax(prediction)]],end="")
    x_t[0][:-1]=x_t[0][1:]
    k=np.zeros(len(chars), dtype=np.bool)
    k[np.argmax(prediction)]=1
    x_t[0][-1]=k

t as they would physical phenomena. this school of psychologists tends not to emphasize the object.
Generated text :
m of the proposition is that the same thing is the same thing as the true interests of the sense data and the same thing as the subject of the sense data and the same thing as the subjective content of the proposition in the same particular theory of propositions and the sense data which are the same thing to the problem of the proposition that the sense data is a proposition as the problem of the

In [61]:
print(ques)
print("Generated text :")
for i in range(1000):
    prediction = model.predict(x_t, verbose=0)
    print(tochar[decode[np.argmax(prediction)]],end="")
    x_t[0][:-1]=x_t[0][1:]
    k=np.zeros(len(chars), dtype=np.bool)
    k[np.argmax(prediction)]=1
    x_t[0][-1]=k

t as they would physical phenomena. this school of psychologists tends not to emphasize the object.
Generated text :
ame thing is the same thing to be a property of the proposition that there is a constituent of the proposition in the same particular thing and the sense data of sense data, and therefore there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposition that there is a constituent of the proposi

In [77]:
print(ques)
print("Generated text :")
for i in range(1000):
    prediction = model.predict(x_t, verbose=0)
    print(tochar[decode[np.argmax(prediction)]],end="")
    x_t[0][:-1]=x_t[0][1:]
    k=np.zeros(len(chars), dtype=np.bool)
    k[np.argmax(prediction)]=1
    x_t[0][-1]=k

d a specially intimate relation. there was a military aristocracy, and also a priestly aristocracy.
Generated text :
ing is the same thing as the true interests of the sense data and the same thing as the subject of the sense data and the same thing as the subjective content of the proposition in the same particular theory of propositions and the sense data which are the same thing to the problem of the proposition that the sense data is a proposition as the problem of the proposition is that the same thing is the same thing as the true interests of the sense data and the same thing as the subject of the sense data and the same thing as the subjective content of the proposition in the same particular theory of propositions and the sense data which are the same thing to the problem of the proposition that the sense data is a proposition as the problem of the proposition is that the same thing is the same thing as the true interests of the sense data and the same thing as the subject of 

In [0]:
#Load backed up model
filename = "weights-improvement-03-1.7560.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

## Conclusion
####  LSTM  repeats most occuring elements such as space and articles when trained over a few epochs.
#### As the number of neurons are increased along with the epochs the model learns more words and their usage. Although with repetition of phrases later, finally I was able to generate slightly meaningful text. 

Future Work:

Use one-hot encoding for the input sequence. Use a large number of epochs, e.g., 150. Add dropout to the network, and use a deeper LSTM (e.g. with 3 or more layers). Generate 3000 characters using the above initialization and check if we get more meaningful text. 
        
Train a Hidden Markov Model with V hidden states and V possible outputs using Baum-Welch Algorithm (or any other modern algorithm that is available) using the Russell corpus, where V is the number of distinct words in the corpus.