# Generative Models for Text using LSTMs
### Ques (a)
In this problem, we are trying to build a generative model to mimic the writ- ing style of prominent British Mathematician, Philosopher, prolific writer, and political activist, Bertrand Russell.

In [1]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam
import time
from keras.callbacks import ModelCheckpoint
#from google.colab import files

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
window_size = 100

### Ques (b)
Download the following books from Project Gutenberg http://www.gutenberg.org/ebooks/author/355 in text format:
<ol>
<li> The Problems of Philosophy
<li> The Analysis of Mind
<li> Mysticism and Logic and Other Essays
<li> Our Knowledge of the External World as a Field for Scientific Method in Philosophy
</ol>

### Ques (c) i.
Concatenate your text files to create a corpus of Russell’s writings.
<br>These books are stripped of their headers and then concatenated in a single corpus.txt which is used below.

In [3]:
inp = ""
linecount = 0
with open("corpus.txt", encoding="utf-8") as fileobj:
    for line in fileobj: 
        inp += " " + line[:-1]
inp = inp[:400000]
"number of chars = " + str(len(inp))

'number of chars = 400000'

I have taken only 400,000 characters because my kernel would always crash on taking more characters. It would throw out of memory error. 
### Ques (c) ii.
Use a character-level representation for this model by using extended ASCII that has N = 256 characters. Each character will be encoded into a an integer using its ASCII code. Rescale the integers to the range [0, 1], because LSTM uses a sigmoid activation function. LSTM will receive the rescaled integers
as its input.<br>

In [4]:
unique_chars = list(set(inp))
unique_idx = []
for char in unique_chars:
    unique_idx.append(ord(char))

unique_idx = sorted(unique_idx)

unique_chars = []
for idx in unique_idx:
    unique_chars.append(chr(idx))

print("Number of unique chars =", len(unique_chars))
print(unique_chars)
char_to_ix = {}
ix_to_char = {}
print()

for index, char in enumerate(unique_chars):
    char_to_ix[char] = index
    ix_to_char[index] = char
    
print(char_to_ix)
print()
print(ix_to_char)

Number of unique chars = 87
[' ', '!', '"', '&', "'", '(', ')', '+', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'æ', 'é', 'î', 'ï', 'ö', 'ü', 'œ']

{' ': 0, '!': 1, '"': 2, '&': 3, "'": 4, '(': 5, ')': 6, '+': 7, ',': 8, '-': 9, '.': 10, '0': 11, '1': 12, '2': 13, '3': 14, '4': 15, '5': 16, '6': 17, '7': 18, '8': 19, '9': 20, ':': 21, ';': 22, '=': 23, '?': 24, 'A': 25, 'B': 26, 'C': 27, 'D': 28, 'E': 29, 'F': 30, 'G': 31, 'H': 32, 'I': 33, 'J': 34, 'K': 35, 'L': 36, 'M': 37, 'N': 38, 'O': 39, 'P': 40, 'Q': 41, 'R': 42, 'S': 43, 'T': 44, 'U': 45, 'V': 46, 'W': 47, 'X': 48, 'Y': 49, 'Z': 50, '[': 51, ']': 52, '_': 53, 'a': 54, 'b': 55, 'c': 56, 'd': 57, 'e': 58, 'f': 59, 'g':

We have 87 unique characters, ordered according to their extended ASCII codes and then rescaled from 0-1. The rescaled input is shown in the next cell.

In [5]:
lstm_input = []
for char in inp:
    lstm_input.append(char_to_ix[char])
lstm_input = np.array(lstm_input)
lstm_input_scaled = lstm_input / len(unique_chars)
print(lstm_input_scaled)

[0.         0.45977011 0.81609195 ... 0.68965517 0.09195402 0.        ]


The original ASCII sequence of all the characters is shown below.

In [6]:
print(lstm_input)

[ 0 40 71 ... 60  8  0]


### Ques (c) iii.
Choose a window size, e.g., W = 100.<br>
This is already chosen above and used in parts above as well.


### Ques (c) iv.
Inputs to the network will be the first W −1 = 99 characters of each sequence, and the output of the network will be the Lth character of the sequence. Basically, we are training the network to predict the each character using the 99 characters that precede it. Slide the window in strides of S = 1 on the text. For example, if W = 5 and S = 1 and we want to train the network with the sequence ABRACADABRA, The first input to the network will be ABRA and the corresponding output will be C. The second input will be BRAC and the second output will be A, etc.

In [7]:
def get_x_y(inp, lstm_input_scaled, window_size, char_to_ix, ix_to_char, unique_chars):
    lstm_inp_stacked = []
    lstm_inp_y = []
    lstm_input_arr = []
    zeros = np.zeros(len(unique_chars))
    for i in lstm_input_scaled:
        lstm_input_arr.append([i])
    for i in range(len(lstm_input)-window_size):
        lstm_inp_stacked.append(lstm_input_arr[i:i+window_size-1])
        
        y = inp[i+window_size-1]
        y_ix = char_to_ix[y]
        zeros[y_ix] = 1
        lstm_inp_y.append(zeros.copy())
        zeros[y_ix] = 0
    return np.array(lstm_inp_stacked), np.array(lstm_inp_y)

### Ques (c) v.
Note that the output has to be encoded using a one-hot encoding scheme with N = 256 (or less) elements. This means that the network reads integers, but outputs a vector of N = 256 (or less) elements.

The input X and Y are shown below. 
We have stacked X of width 99 (window_size-1) and the output is one hot encoded representation of output characters. 
Their shapes are printed below.

In [8]:
lstm_inp_stacked, lstm_inp_y = get_x_y(inp, lstm_input_scaled, window_size, char_to_ix, ix_to_char, unique_chars)
lstm_inp_stacked.shape, lstm_inp_y.shape

((399900, 99, 1), (399900, 87))

To verify that maximum occuring characters in the corpus, I printed its index and indeed, the most occuring character is at index 0 which is a space.

In [9]:
print(np.where(lstm_inp_y[0] == max(lstm_inp_y[0])))

(array([0]),)


### Ques (c) vi.
Use a single hidden layer for the LSTM with N = 256 (or less) memory units.
### Ques (c) vii.
Use a Softmax output layer to yield a probability prediction for each of the characters between 0 and 1. This is actually a character classification problem with N classes. Choose log loss (cross entropy) as the objective function for the network (research what it menas).

In [10]:
def create_model(input_shape, unique_chars):
    regressor = Sequential()
    regressor.add(LSTM(units = len(unique_chars), return_sequences = False, input_shape = (input_shape, 1)))
    
    regressor.add(Dropout(0.2))
    
    regressor.add(Dense(units = len(unique_chars), activation='softmax'))
    
    optimizer = Adam(lr=0.1, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0, amsgrad=False)
    
    regressor.compile(optimizer = optimizer, loss = 'categorical_crossentropy')
    return regressor

### Ques (c) viii.
We do not use a test dataset. We are using the whole training dataset to learn the probability of each character in a sequence. We are not seeking for a very accurate model of. Instead we are interested in a generalization of the dataset that can mimic the gist of the text.
### Ques (c) ix.
Choose a reasonable number of epochs for training (e.g., 30, although the network will need more epochs to yield a better model).
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
The number of epochs chosen are 200. Less number of epochs were actually producing garbled results and mostly whitespaces.
### Ques (c) x.
Use model checkpointing to keep the network weights to determine each time an improvement in loss is observed at the end of the epoch. Find the best set of weights in terms of loss.

In [11]:
filepath="weights.best.4lac.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [12]:
regressor = create_model(input_shape = window_size-1, unique_chars = unique_chars)
regressor

<keras.models.Sequential at 0x7ff73875f1d0>

In [13]:
regressor.fit(lstm_inp_stacked, lstm_inp_y, validation_split=0.33, epochs=200, shuffle=True, batch_size=1000, callbacks=callbacks_list, verbose=1)

Train on 267933 samples, validate on 131967 samples
Epoch 1/1

Epoch 00001: val_loss improved from inf to 2.91340, saving model to weights.best.4lac.hdf5


<keras.callbacks.History at 0x7ff738288a58>

#### NOTE:
Please note that the training was done on Google Colab which provides a GPU. Final prediction and formatting of the notebook was done on my local machine. Hence, I do not have the training epochs printed below the cell above.<br>
The best model was loaded in weights.best.4lac.hdf5 file which was reloaded into the regressor in my local machine.<br>
To prove that, I will print final evaluation of the model after 200 epochs.

In [18]:
regressor.load_weights("weights.best.4lac.hdf5")
print("Final loss after 200 epochs :", regressor.evaluate(lstm_inp_stacked, lstm_inp_y))

Final loss after 200 epochs : 2.3951048636418575


Just to evaluate model, I predict 300 characters from the training data itself.

In [17]:
pred = regressor.predict(np.array(lstm_inp_stacked[0:300]))
print("PREDICTION")
for vec in pred:
    print(ix_to_char[np.where(vec == max(vec))[0][0]], end="")
print("\nACTUAL")
for vec in lstm_inp_y[:300]:
    print(ix_to_char[np.where(vec == max(vec))[0][0]], end="")


 ahe paneee  n tf erto save an  e e  tf  edgtn the pane th the e tf   n   tf thesosophystn thesei th thi h w she     tn th   n e oh theete e   n  tf  e    tn  tne e         ah se tf e   tf er    tn     t  oh te  tf  to the rn  e  the  thete   the    of tetw edge of eot   tncene ceth t  oh   to e    
(300, 87)


### Ques (c) xi.
Use the network with the best weights to generate 1000 characters, using the following text as initialization of the network:
<br><br>
<i>There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object.</i>

In [15]:
def get_next_char(sequence, regressor, char_to_ix, ix_to_char, unique_chars, window_size):
    lstm_input_ = []
    for char in sequence:
        lstm_input_.append(char_to_ix[char])
    lstm_input_ = np.array(lstm_input_)
    lstm_input_scaled_ = lstm_input_ / len(unique_chars)
    lstm_input_scaled_ = [np.array([lstm_input_scaled_[-99:]]).T]
    pred = regressor.predict(np.array(lstm_input_scaled_))
    return_seq = ""
    for vec in pred:
        return_seq += ix_to_char[np.where(vec == max(vec))[0][0]]
    return return_seq

In [19]:
test_sequence = "There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphas"
num_test_pred = 1000
result_str = ""
for i in range(num_test_pred):
    result_str += get_next_char(test_sequence + result_str, regressor, char_to_ix, ix_to_char, unique_chars, window_size)
print(test_sequence + result_str)

There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphas the seese as the sore the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soene an the soe

So, the prediction is a repetitive phrase.