## 1. Generative Models for Text
## (a) In this problem, we are trying to build a generative model to mimic the writing style of promiment British Mathematician, Philosopher, prolific writer, and political activist, Bertrand Russel.

In [1]:
import sys
import numpy as np
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

#config = tf.ConfigProto()
#config.gpu_options.allow_growth = True
#sess = tf.Session(config=config)
#import math
#from sklearn.preprocessing import MinMaxScaler

Using TensorFlow backend.


In [2]:
EPOCH = 50

## (b) Download the books. Open the file in a text editor and delete the header and footer. Try to only use the text of the books and throw away unwanted text before and after the text, although in a large corpus, these are considered as noise and should not make big problems.

In [3]:
from urllib.request import urlopen

filename1 = 'https://raw.githubusercontent.com/seongohr/ML/master/mysticism_utf.txt'
filename2 = 'https://raw.githubusercontent.com/seongohr/ML/master/OurKnowledge_asc.txt'
filename3 = 'https://raw.githubusercontent.com/seongohr/ML/master/TheAnalysis_asc.txt'
filename4 = 'https://raw.githubusercontent.com/seongohr/ML/master/TheProblem_asc.txt'

raw_text = urlopen(filename1).read()
    

In [4]:
raw_text = raw_text.decode("utf-8")
raw_text = raw_text.encode("ascii", "ignore")

In [5]:
#raw_text

## (C) LSTM : Train an LSTM to mimic Russell's style and thoughts:
### i. Concatenate your text files to create a corpus of Russell's writings

In [6]:
files = [filename2, filename3, filename4]

for f in files:
  temp = urlopen(f).read()
  raw_text = raw_text + temp
raw_text = raw_text.lower()

In [7]:
raw_text



### ii. Use a character-level representation for this model by using extended ASCII that has N=256 characters. Each character will be encoded into an integer using its ASCII code. Rescale the integers to the range [0, 1], because  LSTM uses a sigmoid activation function. LSTM will receive the rescaled integers as its input.

In [8]:
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
ascii_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_ascii = dict((i, c) for i, c in enumerate(chars))

In [9]:
int_to_ascii

{0: 10,
 1: 13,
 2: 32,
 3: 33,
 4: 34,
 5: 38,
 6: 39,
 7: 40,
 8: 41,
 9: 42,
 10: 43,
 11: 44,
 12: 45,
 13: 46,
 14: 47,
 15: 48,
 16: 49,
 17: 50,
 18: 51,
 19: 52,
 20: 53,
 21: 54,
 22: 55,
 23: 56,
 24: 57,
 25: 58,
 26: 59,
 27: 61,
 28: 62,
 29: 63,
 30: 91,
 31: 93,
 32: 95,
 33: 97,
 34: 98,
 35: 99,
 36: 100,
 37: 101,
 38: 102,
 39: 103,
 40: 104,
 41: 105,
 42: 106,
 43: 107,
 44: 108,
 45: 109,
 46: 110,
 47: 111,
 48: 112,
 49: 113,
 50: 114,
 51: 115,
 52: 116,
 53: 117,
 54: 118,
 55: 119,
 56: 120,
 57: 121,
 58: 122,
 59: 123,
 60: 124,
 61: 125,
 62: 126}

In [10]:
# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("total charcters: ", n_chars)
print('total vocab: ', n_vocab)

total charcters:  1606095
total vocab:  63


In [11]:
# prepare the dataset of input to output pairs encoded as integers
window_size = 100
s = 1
dataX = []
dataY = []
for i in range(0, n_chars - (window_size - 1), s):
    seq_in = raw_text[i: i + (window_size - 1)]
    seq_out = raw_text[i + (window_size - 1)]
    dataX.append([ascii_to_int[char] for char in seq_in])
    dataY.append(ascii_to_int[seq_out])
n_patterns = len(dataX)
print('Total Patterns: ', n_patterns)

Total Patterns:  1605996


In [12]:
len(dataX[0])

99

In [13]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, window_size - 1, 1))

# normalize
X = X / float(n_vocab)

# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [14]:
print(X)

[[[0.71428571]
  [0.9047619 ]
  [0.80952381]
  ...
  [0.52380952]
  [0.82539683]
  [0.82539683]]

 [[0.9047619 ]
  [0.80952381]
  [0.82539683]
  ...
  [0.82539683]
  [0.82539683]
  [0.58730159]]

 [[0.80952381]
  [0.82539683]
  [0.65079365]
  ...
  [0.82539683]
  [0.58730159]
  [0.71428571]]

 ...

 [[0.03174603]
  [0.03174603]
  [0.03174603]
  ...
  [0.9047619 ]
  [0.80952381]
  [0.65079365]]

 [[0.03174603]
  [0.03174603]
  [0.63492063]
  ...
  [0.80952381]
  [0.65079365]
  [0.55555556]]

 [[0.03174603]
  [0.63492063]
  [0.84126984]
  ...
  [0.65079365]
  [0.55555556]
  [0.50793651]]]


### iii. Choose a window size, e.g., W=100

In [15]:
window_size

100

### iv. Inputs to the network will be the first W-1=99 characters of each sequence, and the output of the network will be the Wth character of the sequence. Basically, we are training the network to predict each character using the 99 characters that precede it. Slide the window in strides of S=1 on the text.

In [16]:
s

1

### v. Note that the output has to be encoded using a one-hot encoding scheme with N=256 (or less) elements. This means that the network reads integers, but outputs a vector of N=256 (or less) elements.

In [17]:
print('N in one-hot encoding : ', len(y[0]))

N in one-hot encoding :  63


### vi. Use a single hidden layer for the LSTM with N=256 (or less) memory units.

In [18]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
#model.add(Dropout(0.2))


### vii. Use a Softmax output layer to yield a probability prediction for each of the characters between 0 and 1. This is actually a character classification problem with N classes. Choose log loss (Cross entropy) as the objective function for the network (research what it means)

### Answer : Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. It increases as the predicted probability diverges from the actual label. A perfect model would have a log loss of 0.

In [19]:
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


### viii. We do not use a test dataset. We are using the whole training dataset to learn the probability of each character in a sequence. We are not seeking for a very accurate model. Instead we are interested in a generalization of the dataset that can mimic the gist of the text.

### Answer : I'm using whole dataset.

### ix. Choose a resonable number of epochs for training, considering your computational power (e.g., 30, although the network will need more epochs to yield a better model)

### Answer : epoch = 20, it's difficult for my laptop to implement higher number of epochs because it takes lots of time.

### x. Use model checkpointing to keep the network weights to determine each time an improvement in loss is observed at the end of the epoch. Find the best set of weights in terms of loss.

In [20]:
# define the checkpoint
filepath = "weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [21]:
# fit the model
model.fit(X, y, epochs=EPOCH, callbacks=callbacks_list, batch_size = 500)

Epoch 1/50
  13000/1605996 [..............................] - ETA: 35:23 - loss: 3.3386 - acc: 0.1334

KeyboardInterrupt: 

In [22]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 256)               264192    
_________________________________________________________________
dense_1 (Dense)              (None, 63)                16191     
Total params: 280,383
Trainable params: 280,383
Non-trainable params: 0
_________________________________________________________________


### xi. Use the network with the best weights to generate 1000 characters, using the following text as initialization of the network: 

In [56]:
# load the best weights
filename = 'weights-improvement-20-1.7041.hdf5' ##change after training
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

### There are those who take metal phenomena naively, just as they would physical phenomena. This school of phychologists tends not to emphasize the object.

In [41]:
# seed
seed = bytes(b'There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object.')
seed = seed.lower()
#seedX = []
#print('bf :', seed)
print('Seed:')

# make a seed with 99 characters
for i in range(0, len(seed) - (window_size - 1), window_size - 1):
    seq_in = seed[i: i + (window_size - 1)]
    #seed.append([ascii_to_int[char] for char in seq_in])
    seed = [ascii_to_int[char] for char in seq_in]
    #seedX.append([char for char in seq_in])
print(''.join([chr(int_to_ascii[value]) for value in seed]))
#print(len(seq_in))
#print(seed)

Seed:
there are those who take mental phenomena naively, just as they would physical phenomena. this scho


In [42]:
# generate characters
for i in range(1000):
    x = np.reshape(seed, (1, len(seed), 1))
    #print('x\n', x)
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    #print('index',index,'\n')
    result = chr(int_to_ascii[index])
    seq_in = [chr(int_to_ascii[value]) for value in seed]
    sys.stdout.write(result)
    seed.append(index)
    seed = seed[1:len(seed)]
print('\nDone')


the same tiing is a certain person which iase the somc of the somc of
the somc of the somc of the somc of the somc of the somc of the somc
dottterg of the pomntinn which ias been shen and the somc of the some
oeject of the somc of the somc of the somc of the somc of the somc of
the somc of the somc of the somc of the somc of the somc of the somc
dottterg of the pomntinn which ias been shen and the somc of the some
oeject of the somc of the somc of the somc of the somc of the somc of
the somc of the somc of the somc of the somc of the somc of the somc
dottterg of the pomntinn which ias been shen and the somc of the some
oeject of the somc of the somc of the somc of the somc of the somc of
the somc of the somc of the somc of the somc of the somc of the somc
dottterg of the pomntinn which ias been shen and the somc of the some
oeject of the somc of the somc of the somc of the somc of the somc of
the somc of the somc of the somc of the somc of the somc of the somc
dottterg 
Done
