# Final Report

This is the final report for my project in CS344 at Calvin University.

Student: Tyler Poel

Date: May 20, 2020

## Vision

In this project I will be exploring text generation using
neural networks, specifically a LSTM. I will feed the
network in a large text file, and based off of how
I train it I'm interested in what type of text generation
I'll receive.

Specifically I'll be feeding the neural network the script for
the Star Wars Original Trilogy (A New Hope (1977), The Empire Strikes Back (1980), and
Return of the Jedi (1983)). I'm interested in the type of script
the neural network will give me as output. The first question
is simply will any of the generated text make logical sense? Secondly,
will the generated text resemble a movie script?

In the interest of comparison, I will also feed the LSTM network the book "Little Women",
by Louisa May Alcott. This is for purposes
of comparing and contrasting with the Star Wars model. Since the model
is the same between them both, different degrees of success between the outputs
could lead to insights with how the network "learns" text. Since the
texts are different in nature, one being a movie script and the other being a novel,
it will be interesting to see the differences between them, if any.

## Background

For this project I'm using a LSTM (long-short-term-memory) network.
This is a type of RNN (recurrent neural network). The RNN is great for
text generation, as it's built to use the output of previous iterations
to better inform and build the next one. That's why
it works great for text generation: one needs to know the
words that came before to give a good next word.

One area where RNN's fall short is the inability to "remember"
things from awhile back. When the gap between the relevant information needed
and the point where it is needed becomes too large, the RNN becomes
less successful.

LSTM's take a standard RNN and make some improvements. It's still
the basic structure, however is able to learn long-term dependencies.
The core idea is that in every module, or node, in the LSTM
linked structure there is a cell state that runs through the module. It
interacts very little with the rest of the internal structure of the module,
and therefore receives only minor changes. There are several "gates"
that the cell state goes through before leaving an individual module. These gates
use a sigmoid neural net layer to determine how much of the cell state
gets let through. The cell state gets passed through the
onto the next module, and it's through this cell state that the
long-term dependencies are learned.

A great breakdown of RNN's and LSTM's can be found here:
https://colah.github.io/posts/2015-08-Understanding-LSTMs

## Implementation

Most of the code I've used for this project is from Chapter 8, Section 1 of
"Deep Learning with Python", by François Chollet. The code
can be found here:
https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.1-text-generation-with-lstm.ipynb

I'll be going through the implementation using the Star Wars network,
however the network for Little Women is exactly the same.

First, the text file is read in, and made lower-case, to reduce
the number of unique characters.

In [1]:
import keras
import numpy as np

text = open("/Users/tylerpoel/CS/344/cs344/Project/OriginalTrilogy_script.txt").read().lower()
print('Corpus length:', len(text))

Using TensorFlow backend.


Corpus length: 491488


Next, a list of sequences of characters is created. The sequences
can be as long as one wants. New sequences are are made every
3 characters. Based off of the sequence length chosen, ther could be a lot of overlap between characters that the sequences contain,
however that's totally ok.

A list of targets also needs to be created. A target in this scenario is simply the very next
character that came after a given sequence.

In [2]:
# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

Number of sequences: 163810


There's a lot one can do with the maxlen and step paramaters.
I found a maxlen of 60 and a step of 3 to work best. A maxlen that's
super small ended up generating text that barely contained full and real words.
A very large maxlen seemed overkill, and also didn't allow
for as much unique behavior. 60 seemed like a great in-between.

A list of unique characters present in the text is also created.

In [3]:
# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

Unique characters: 57


Finally, characters and sequences are one-hot encoded into arrays.
Using multi-dimension arrays, 1's and 0's are used to mark
where one of the unique characters shows up. For each character in each sequence,
there is an array the length of the number of unique characters. If the unique character
matches with the current character in the sequence, a 1 is marked. If not, a zero.
This process of one-hot encoding is done for both the sequences
and the targets.

In [6]:
# One-hot encode the characters into binary arrays.
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

print('One-hot encoding complete')

One-hot encoding complete


Next the network is built, using a LSTM network as described above.
A softmax activation function is used, because it gives probabilities as outputs.
In this case, it will give a probability for each unique character, indicating
the chances of that character coming next.

The loss used is categorical cross entropy. This works
well here because we can think of each character as it's
own class, and our model is trying to compute a next character
that matches the actual next character.

In [8]:
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

print("Model built")

Model built


The sample function is useful for when text generation
actually occurs. It takes the probability distribution given
by the model, and returns an index for the next character
given the weighted distribution.

The sample function also reweights the text given a certain "temperature."
Higher temperatures create slightly more sporadic and unique text, that
has a much higher tendency to build incoherent sentences,
or even words that aren't real. Lower temperatures follow much closer
to the given text. Choosing the temperature is up to the user, given
their own goals.

In [9]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Finally the model is trained, and output is given.
The loop repeatedly trains the model and generates the new text. In this final
report notebook I'll only train it for 4 epochs, as training for
lots of epochs is very time and space consuming, and to get the implementation
details here, many epochs would be overkill. However both the Star
Wars model and the Little Women model have been trained for
60 epochs, and the results of that training can be seen in their
respective notebooks.

In [None]:
import random
import sys

for epoch in range(1, 5):
    print('epoch', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    if epoch % 2 == 0:
        # Select a text seed at random
        start_index = random.randint(0, len(text) - maxlen - 1)
        generated_text = text[start_index: start_index + maxlen]
        print('--- Generating with seed: "' + generated_text + '"')

        for temperature in [0.2, 0.5]:
            print('------ temperature:', temperature)
            sys.stdout.write(generated_text)

            # We generate 600 characters
            for i in range(600):
                sampled = np.zeros((1, maxlen, len(chars)))
                for t, char in enumerate(generated_text):
                    sampled[0, t, char_indices[char]] = 1.

                preds = model.predict(sampled, verbose=0)[0]
                next_index = sample(preds, temperature)
                next_char = chars[next_index]

                generated_text += next_char
                generated_text = generated_text[1:]

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()

epoch 1
Epoch 1/1
epoch 2
Epoch 1/1

## Results


