# CS 533 Final Project

Sakib Jalal, Tanya Balaraju, Aditya Geria

Professor Matthew Stone - Natural Language, Spring 2018

## Deep Learning Text Generation

Through this project, we aimed to generate text based on a given corpus (in this case, J.K. Rowling's *Harry Potter* series) using an RNN (Recurrent Neural Network) consisting of multiple LSTM layers. The RNN was built using Python/Keras, and training was done on an [AWS g2.8xlarge EC2 instance](https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/). 

Originally, the decision to generate *Harry Potter*-like text was based on the ease of performing qualitative analysis on the outcome--the training data is from a single, popular author with a distinctive style, making it easy to compare the RNN's output to the original text. After this initial analysis, we planned to train the RNN on news articles to generate news and analyze the "tendencies" of news articles. However, due to financial limitations (we ran out of AWS credit), this is not possible for the time being.

We used [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) (Long Short Term Memory) cells in our model because we read about how [promising past results](http://arxiv.org/pdf/1503.04069.pdf) from using these were. We learned how they improved on Vanilla RNNs, which can selectively remember past outputs, by also keeping state vectors that enable selective forgetting. We read about GRU (Gated Recurrent Unit) cells and how they combine the 'forgetting' and 'remembering' into a single 'update' step, but decided to just experiment with LSTMs in this project.

The repository that hosts this project is hosted [here](https://github.com/sakib/NLP).


## Experimental Setup

First, we import the necessary libraries from [Keras](https://keras.io).

In [1]:
from __future__ import print_function
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import TimeDistributed

Using TensorFlow backend.


Now we define some constants to help us control the architecture of our RNN.

The following constants should *not* be modified:
- `SEQ_LENGTH` refers to the length of the character window considered when predicting the next character in the generated sequence.
- `HIDDEN_DIM` refers to the size of the LSTM layers in the RNN.
- `LAYER_NUM` refers to the amount of LSTM layers we want in addition to the initial input LSTM layer.
- `BATCH_SIZE` is a level of training granularity: weights inside the RNN are updated after each batch of `BATCH_SIZE` training samples are processed.
- `DROPOUT_RATE` controls the level to which we randomly drop inputs to the RNN during training, which helps avoid overfitting to the text.

The following constants can be modified:
- `GENERATE_LENGTH` refers to the length of the output sequence that we want to see when we run the generative model and output text.
- `WEIGHTS` is the file location that we want to initially load the network weights from. It can be left as empty: `''`
- `TRAIN` is a flag that tells the program whether we want to be training the RNN on this run.

In [0]:
# constants - DO NOT CHANGE
SEQ_LENGTH = 100
HIDDEN_DIM = 700
LAYER_NUM = 2
BATCH_SIZE = 50
DROPOUT_RATE = 0.3

# constants - modifiable
GENERATE_LENGTH = 500
#WEIGHTS = 'weights/hp/checkpoint_layer_{}_hidden_{}_epoch_{}.hdf5'.format(LAYER_NUM, HIDDEN_DIM, 300)
WEIGHTS = ''
TRAIN = False

## Training Scheme

Then, let's define a routine for preparing the training data. The goal of the generative model is as follows: given a sequence of up to `SEQ_LENGTH` characters, predict the next character in the sequence. Here is the scheme:

- We load all of the text into `data` and create a small set of characters `chars` of length `VOCAB_SIZE`.
- We create mappings from characters to unique numbers and vice versa.
- We use these unique numbers to map individual characters to one-hot vectors of length `VOCAB_SIZE`.
- The model will output a vector of length `VOCAB_SIZE`. The index of the softmax is taken and mapped back to a character to continue the output stream.

In [0]:
def load_data(data_files, seq_length):
    data = '\n'.join([f.read() for f in data_files])
    chars = list(set(data))
    VOCAB_SIZE = len(chars)

    print('Data length: {} characters'.format(len(data)))
    print('Vocabulary size: {} characters'.format(VOCAB_SIZE))

    ix_to_char = {ix:char for ix, char in enumerate(chars)}
    char_to_ix = {char:ix for ix, char in enumerate(chars)}

    # one hot vectors input [0:100], one hot vectors output [1:101]
    X = np.zeros((int(len(data)/seq_length), seq_length, VOCAB_SIZE))
    y = np.zeros((int(len(data)/seq_length), seq_length, VOCAB_SIZE))
    for i in range(0, int(len(data)/seq_length)):
        X_sequence = data[i*seq_length:(i+1)*seq_length]
        X_sequence_ix = [char_to_ix[value] for value in X_sequence]
        input_sequence = np.zeros((seq_length, VOCAB_SIZE))
        for j in range(seq_length):
            input_sequence[j][X_sequence_ix[j]] = 1.
            X[i] = input_sequence

        y_sequence = data[i*seq_length+1:(i+1)*seq_length+1]
        y_sequence_ix = [char_to_ix[value] for value in y_sequence]
        target_sequence = np.zeros((seq_length, VOCAB_SIZE))
        for j in range(seq_length):
            target_sequence[j][y_sequence_ix[j]] = 1.
            y[i] = target_sequence
    return X, y, VOCAB_SIZE, ix_to_char

Now we load the data into variables. Note that the third book in the *Harry Potter* series is not included here because no uncorrupted version was found. It was important to make sure that the text data was at least a few `MB` in size, which it was, even without the third book.

In [4]:
# parse the data
print('\nloading data...')
files = [open('data/hp/hp{}.txt'.format(i), 'r') for i in range(1, 8, 1) if i != 3]
X, y, VOCAB_SIZE, ix_to_char = load_data(files, SEQ_LENGTH)


loading data...


FileNotFoundError: ignored

## Model Architecture

- `LAYER_NUM` LSTM layers, each of size `HIDDEN_DIM`, with an input dropout rate of `DROPOUT_RATE`, followed by a fully-connected Dense layer of size `VOCAB_SIZE` applied to every temporal slice of the input sequence length axis (size `SEQ_LENGTH`). 
- Example: `2` LSTM layers, each of size `700`, with an input dropout rate of `0.3`, followed by a fully-connected Dense layer of size `67` applied to every temporal slice of the input on the input sequence length axis (size `100`).

With this architecture, we apply the `softmax` activation function to have the RNN yield the index of the maximum value in the vector of length `VOCAB_SIZE`, indicating the most likely character to follow the input sequence.

In [0]:
# build the model, lstm, but can replace with gru or simplernn
print('\nbuilding model...')
model = Sequential()
model.add(LSTM(HIDDEN_DIM, input_shape=(None, VOCAB_SIZE), return_sequences=True))
model.add(Dropout(DROPOUT_RATE))
for i in range(LAYER_NUM - 1):
    model.add(LSTM(HIDDEN_DIM, return_sequences=True))
model.add(TimeDistributed(Dense(VOCAB_SIZE)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

## Text Generation

Now we just need a routine to generate an output sequence. We seed the model with a random character from the vocabulary, convert it to a one-hot encoding, predict the next character in the text based on the sliding window of up to `SEQ_LENGTH` prior characters, and rinse and repeat. 

In [0]:
# method for generating text
def generate_text(model, length, vocab_size, ix_to_char):
    # starting with random character
    ix = [np.random.randint(vocab_size)]
    y_char = [ix_to_char[ix[-1]]]
    X = np.zeros((1, length, vocab_size))
    for i in range(length):
        # appending the last predicted character to sequence
        X[0, i, :][ix[-1]] = 1
        print(ix_to_char[ix[-1]], end="")
        ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
        y_char.append(ix_to_char[ix[-1]])
    print()
    return ('').join(y_char)

Here, we illustrate pre-training results through a call to the `generate_text` routine in order to output 100 garbage characters. We also track the current epoch we are on and load weights from `WEIGHTS` if it was specified above.

In [0]:
# generate stuff before training to see it being bad
print('\npre-training results...')
generate_text(model, 100, VOCAB_SIZE, ix_to_char)

if WEIGHTS == '':
    epochs = 0
else:
    epochs = int(WEIGHTS[WEIGHTS.rfind('_') + 1:WEIGHTS.find('.')])
    print('\nloading weights from epoch {}...'.format(epochs))
    model.load_weights(WEIGHTS)


## Training/Generation

If `TRAIN` is specified or `WEIGHTS` is unspecified, we train the model with the `fit` routine and `generate_text` along the way, saving weights every `10` epochs. Each epoch ran for about 27 minutes on the AWS GPU hardware.

In [0]:
if TRAIN or WEIGHTS == '':
    print('training...')
    while True:
        print('\n\nepoch: {}\n'.format(epochs))
        model.fit(X, y, batch_size=BATCH_SIZE, verbose=1, epochs=1)
        epochs += 1
        print('generating text...')
        generate_text(model, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)
        if epochs % 10 == 0:
            print('saving weights to file...')
            model.save_weights('weights/hp/checkpoint_layer_{}_hidden_{}_epoch_{}.hdf5'.format(LAYER_NUM, HIDDEN_DIM, epochs))

If `WEIGHTS` is specified, we generate text only.

In [0]:
# Else, loading the trained weights and performing generation only
if WEIGHTS != '':
    generate_text(model, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)

## Results

Due to the `.hdf5` files containing the weights becoming corrupted for unknown reasons, the network does not successfully load the weights from files. Our progress from running this for 5 days was lost, save for screenshots taken at regular intervals (posted below in the Results section). We are currently re-running the training, and the uncorrupted weight files will be available on Github within a few days after 05/06. Until then, the code will be available in the repository.

If you would like to recreate this training, run the following long-running command in a terminal. (We strongly recommend **not** doing this on a personal machine.)

```
nohup python rnn-harry.py >./hp.log 2>&1 < /dev/null &
```

You can then `tail -f hp.log` to view output as it is logged.

[Here are the screenshots](https://imgur.com/2jPSiIJ) of some of our results during training.

From here, it's clear that the generated text is approaching the actual writing style of J.K. Rowling over several epochs, with diminishing returns after ~100 epochs. Characters actually reply to each other, correct punctuation is used, sentences are formed well enough. All of this is pretty astonishing, given that this model is generating this output character-by-character based on probabilities.

Also, we were thinking of also using GRUs (Gated Recurrent Units) instead of LSTMs, but we ran out of AWS credit, especially after needing to reserve some for re-training the network and regenerating weights.

## Conclusion 
### Exact Matches in Generated Text

Here, we perform some analysis using `difflib` to determine whether there are exact matches between the outputted text and the original text.

In [1]:
#compare string versions of data
from difflib import SequenceMatcher
import re

all_books = []
for i in range(1,8):
    with open('data/hp/hp{}.txt'.format(i),'r') as book:
        #print (book)
        all_books = all_books + (book.readlines())

#gen = open('gen.txt', 'r')
#gen_s = ''.join(gen.readlines())

gen_s = '"I want to know what he\'s about to do." \
    Harry felt a seat at the back of Kreacher\'s neck she had chosen to speak. \
    "Do you know what I think?" said Harry, staring at the pair of them. \
    "We won\'t be sure that you were aware of everything they have seen in the Pensieve." \
    The crowd below was sweaty trunks, some looking pearly-white and staring. \
    Harry stared at him in an instant he was pointing at the pair of them, \
    the people in the crowd cheered. They had been excited by the afterpeering theor in the...'

#analyze similarities sentence-by-sentence
#(comparing the generated text to the full corpus yielded incomplete results)

all_books_s = ''.join(all_books)
all_books_sents = re.split('.,!?"()', all_books_s)
with open('matches.txt', 'w') as f:
    for sent in range(len(all_books_sents)):
        sm = SequenceMatcher(None, gen_s, all_books_sents[sent])
        for block in sm.get_matching_blocks():
            if block[2] > 10:
                 f.write('Match: "{}" in original \n \
                         with "{}" in generated \n'.format(all_books_sents[sent][block[1]:block[1]+block[2]], \
                         gen_s[block[0]:block[0]+block[2]]))

FileNotFoundError: ignored


As the results [(file link)](https://github.com/sakib/NLP/blob/master/matches.txt) of this analysis show, the generated text and the original text have a few exact similarities, but no entire sentences are shared between the samples. This is evidence that the model was not overfitted to the data, although qualitative analysis shows that the generated text is still similar to the original data. (The qualitative similarities can be seen in phrases such as "said Harry, staring at the pair of them," where the "pair" is reminiscent to Ron and Hermione from the original text. J.K. Rowling's frequent use of "said" as a "speaking verb" is also clearly mimicked by the model.)

## Significance of this experiment 

This experiment provides insight into the nature of natural language and the patterns of language in some of its most popular written forms. As this experiment shows, a specific writing style can be mimicked by an algorithm that, at its core, only predicts the most likely character to follow a sequence. Even the writing of J.K. Rowling, one of the most popular creative minds of all time, is defined by key elements that can be reproduced by a neural network. 

However, this is not to say that human creativity is predictable (at least not for the time being). Evidently, the generated text makes grammatical sense, but it doesn't make holistic sense. To address this issue with text-generative models, researchers are looking into not only keeping states, but also [attention](http://arxiv.org/pdf/1502.03044v2.pdf), which means being able to extract information from some abstract larger collection of information. For example, generating captions for an image might pick sections of the image to consider for every character in the output sequence. Similar ideas may apply to simulating authors' writing styles.