<img src="https://ucfai.org//course/sp19/recurrent-nets/banner.jpg">

<div class="col-12">
    <a class="btn btn-success btn-block" href="https://ucfai.org/signup">
        First Attendance? Sign Up!
    </a>
</div>

<div class="col-12">
    <h1> Who Needs Show Writers Nowadays? </h1>
    <hr>
</div>

<div style="line-height: 2em;">
    <p>by: 
        <strong> Brandon Silva</strong>
        (<a href="https://github.com/brandons209">@brandons209</a>)
        <strong> John Muchovej</strong>
        (<a href="https://github.com/ionlights">@ionlights</a>)
     on 2019-02-27</p>
</div>

# Generate new Simpson scripts with LSTM RNN
## Link to slides [here](https://drive.google.com/open?id=1-BB-krsxzvpAgLZ19ul1r16ZGfe3BHSG9LFK5VqZFs8)
In this project, we will be using an LSTM with the help of an Embedding layer to train our network on an episode from the Simpsons, specifically the episode "Moe's Tavern". This is taken from [this](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data) dataset on kaggle. This model can be applied to any text. We could use more episodes from the Simpsons, a book, articles, wikipedia, etc. It will learn the semantic word associations and being able to generate text in relation to what it is trained on. Try it with the Harry Potter book included in this project.

First, lets import all of our libraries we need. We utilize Keras' Tokenizer method for tokenizing our inputs, and pad_sequences for generating our sequences. Our embedding layer has a fixed input size, so instead of passing our entire script at once we supply a sequence of characters with a length we can choose. Documentation can be found for [tokenizer](https://keras.io/preprocessing/text/) and [pad_sequences](https://keras.io/preprocessing/sequence/).

I have also created a helper function to handle some loading and saving of dictionaries we will make, and tokenizing punctuation.

In [None]:
#general imports
import helper #
import numpy as np
import time
import os

#pre-processing
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

#model
from keras.models import Sequential
from keras.layers import Dropout, Dense
from keras.layers import Embedding, LSTM
from keras.models import load_model

#training
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import optimizers as opt

## Grab important files and create folders
Here we download the dataset and save it to the folder "data" that we create. There is also a python script file with helper functions I have written to help with processing the data.

To do this, we download a script called gdown that allows downloading files from Google Drive from the command line.

In [None]:
!wget https://raw.githubusercontent.com/circulosmeos/gdown.pl/master/gdown.pl
!chmod +x gdown.pl
!mkdir data
!mkdir data/dictionaries
!mkdir saved_model_data
!mkdir tensorboard_logs

!./gdown.pl https://drive.google.com/file/d/1dgOnAVDDTDAg59SYcuu-aBzva7gk5YFk/view helper.py
!./gdown.pl https://drive.google.com/file/d/1F8Jd_0fhlT5kCwd0m5kUDS2KyiVkOSXy/view data/moes_tavern_lines.txt

## Dataset statistics
Before starting our project, we should take a look at the data we are dealing with.

In [None]:
script_text = helper.load_script('data/moes_tavern_lines.txt')

print('----------Dataset Stats-----------')
print('Approximate number of unique words: {}'.format(len({word: None for word in script_text.split()})))
scenes = script_text.split('\n\n')
print('Number of scenes: {}'.format(len(scenes)))
sentence_count_scene = [scene.count('\n') for scene in scenes]
print('Average number of sentences in each scene: {:.0f}'.format(np.average(sentence_count_scene)))

sentences = [sentence for scene in scenes for sentence in scene.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {:.0f}'.format(np.average(word_count_sentence)))

## Tokenize Text
In order to prepare our data for our network, we need to tokenize the words. That is, we will be converting every unique word and punctuation into an integer. Before we do that, we need to make the punctuation more easier to convert to a number. For example, we will be taking any new lines and converting them to the word "||return||". This makes the text easier to tokenize and pass into our model. The functions that do this are in the helper.py file.

A note on Keras' tokenizer function. 0 is a reserved integer, that is not used to represent any words. So our integers for our words will start at 1. This is needed as when we use the model to generate new text, it needs a starting point, known as a seed. If this seed is smaller than our sequence length, then the function pad_sequences will pad that seed with 0's in order to represent "nothing". This help reduces noise in the network.

This is the list of punctuation and special characters that are converted, notice that spaces are put before and after to make splitting the text easier:
- '!' : ' ||exclaimation_mark|| '
- ',' : ' ||comma|| '
- '"' : ' ||quotation_mark|| '
- ';' : ' ||semicolon|| '
- '.' : ' ||period|| '
- '?' : ' ||question_mark|| '
- '(' : ' ||left_parentheses|| '
- ')' : ' ||right_parentheses|| '
- '--' : ' ||dash|| '
- '\n' : ' ||return|| '
- ':' : ' ||colon|| '

We also convert all of the text to lowercase as this reduces the vocabulary list and trains the network faster.

In [None]:
tokens = Tokenizer(filters='', lower=True, char_level=False)#keras tokenizer function, char_level is for setting the tokenizer to treat every character as a token, instead of words
script_text = helper.tokenize_punctuation(script_text) #helper function to convert non-word characters
script_text = script_text.lower()

script_text = script_text.split()#splits the text based on spaces into a list

tokens.fit_on_texts(script_text)#this will apply the tokenizer to the text.

## Creating Conversion Dictionaries and Input Data
Now that the tokens have been generated, we will create some dictionaries to convert our tokenized integers back to words, and words to integers. We will also generate our inputs and targets to pass into our model. 

To do this, we need to specify the sequence length, which is the amount of words we pass into the model at one time. I choose 12, but feel free to change this. A sequence length of 1 is just one word, so we could get better output depending on our sequence length. We use the helper function gen_sequences to do this for us. Then we can save these for testing.

The targets are simply just the next word in our text. So if we have a sentence: "Hi, how are you?" and we input "Hi, how are you" our target for this sentence will be "?".

In [None]:
sequence_length = 12

word_to_int = tokens.word_index #grab word : int dict
int_to_word = {int: word for word, int in word_to_int.items()} #flip word_to_int dict to get int to word
int_script_text = np.array([word_to_int[word] for word in script_text], dtype=np.int32) #convert text to integers
int_script_text, targets = helper.gen_sequences(int_script_text, sequence_length) #transform int_script_text to sequences of sequence_length and generate targets
vocab_length = len(word_to_int) + 1 #vocab_length for embedding needs to 1 one to length of the dictionary.

print("Number of vocabulary: {}".format(vocab_length))

In [None]:
#save dictionaries for use with testing model, also need to save sequence length since it needs to be the same when testing
helper.save_dict(word_to_int, 'word_to_int.pkl')
helper.save_dict(int_to_word, 'int_to_word.pkl')
helper.save_dict(sequence_length, 'sequence_length.pkl')

## Building the Model
Here is the fun part, building our model. We will use LSTM cells and an Embedding layer, with a fully connected Dense layer at the end for the prediction. Documentation for LSTM cells can be found [here](https://keras.io/layers/recurrent/) and for embedding [here](https://keras.io/layers/embeddings/).

An LSTM layer can be defined simply as:    
```
LSTM(num_cells, dropout=drop, recurrent_dropout=drop, return_sequences=True)
```
From the docs:
- dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
- recurrent_dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.
- return_sequences: Boolean. Whether to return the last output in the output sequence, or the full sequence.

For LSTM layers up until the last LSTM layer, return_sequences is set to True to tell the layer to output the full sequence with its predictions, instead of just the predictions, which allows the next LSTM layer to learn from the input text, and what the LSTM layer added to it before hand. The last layer will leave this unset since we want it to return the last output in the sequence as that will be our final output for the Dense layer, using softmax activation.

An embedding layer can be defined as:
```
Embedding(input_dim, size, input_length=)
```
Our input dimension will be the length of our vocabulary, the size can be whatever you want to set it at, my case I used 300, and the input_length is our sequence_length we defined earlier.

Our model will predict the next word based in the input sequence. We could also predict the next two words, or predict entire sentences. For now though we will just stick with one word.

In [None]:
model = Sequential()
### BEGIN SOLUTION
model.add(Embedding(vocab_length, 300, input_length=sequence_length))
model.add(LSTM(400, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(LSTM(400, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(vocab_length, activation='softmax'))
### END SOLUTION

## Hyperparameters and Compiling the Model
The Adam optimizer is very effective and has built in dynamic reduction of the learning rate, so let's use that. We will also set the learning rate, epochs, and batch size.

You may assume our loss function will be categorical_crossentropy. In our case, this will not work, as that loss function requires our labels/targets to be one-hot encoded. Keras provides another loss function, called  sparse_categorical_crossentropy. This applies categorical_crossentropy, but uses labels that are not one-hot encoded. Since our labels will just be numbers from 1 to vocab_length, this works well for us. 

In [None]:
learn_rate = 0.001
optimizer = opt.Adam(lr=learn_rate)
epochs = 5
batch_size = 32

model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

## Training
Now it is time to train the model. We will use the ModelCheckpoint and tensorboard callbacks for saving the best weights and allowing us to view graphs of loss and accuracy of our model as it is training. Since we are not using validation data, our monitor for our ModelCheckpoint callback will be loss. We do not use validation data because we want the model to be closely related to how our text is constructed.

The model is then saved after training.

In [None]:
#load weights if continuing training
model.load_weights("saved_model_data/model.best.weights.hdf5")

In [None]:
start_time = time.strftime("%a_%b_%d_%Y_%H:%M", time.localtime())
#view tensorboard with command: tensorboard --logdir=tensorboard_logs
os.makedirs("tensorboard_logs", exist_ok=True)
ten_board = TensorBoard(log_dir='tensorboard_logs/{}'.format(start_time), write_images=True)
weight_save_path = 'saved_model_data/{}.best.weights.hdf5'.format(start_time)
checkpointer = ModelCheckpoint(weight_save_path, monitor='loss', verbose=1, save_best_only=True, save_weights_only=True)
print("Tensorboard logs for this run are in: {}, weights will be saved in {}\n".format('tensorboard_logs/{}'.format(start_time), weight_save_path))

model.fit(int_script_text, targets, epochs=epochs, batch_size=batch_size, callbacks=[checkpointer, ten_board])
model.load_weights(weight_save_path)
model.save('saved_model_data/model.{}.h5'.format(start_time))

## Testing the Model
Testing the model simply requires that we convert the output integer back into a word and build our generated text, starting from a seed we define. However, we might get better results by instead of doing an argmax to find the highest probability of what the next word should be, we can take a sample of the top possible words and choose one from there. 

This is done by taking a "temperature" which defines how many predictions we will consider as the next possible word. A lower temperature means the word picked will be closer to the word with the highest probability. Then using a random selection to choose a word. Try it with using both. Setting a temperature of 0 will just use argmax on the entire prediction.

In [None]:
#download weights, model, and dictionaries if using my model:
!./gdown.pl https://drive.google.com/file/d/1UPUkyo5D2Q-WK1d1NHUEA4lU3AC1fFAM/view data/dictionaries/int_to_word.pkl
!./gdown.pl https://drive.google.com/file/d/1UPUkyo5D2Q-WK1d1NHUEA4lU3AC1fFAM/view data/dictionaries/sequence_length.pkl
!./gdown.pl https://drive.google.com/file/d/1fCwa1KnaAMJTriM4hmIpUAbwKF3rkKU1/view data/dictionaries/word_to_int.pkl

!./gdown.pl https://drive.google.com/file/d/1v5XzYZ3X3xKlJUl-EyRolq-FbwFwvw3n/view saved_model_data/model.best.weights.hdf5
!./gdown.pl https://drive.google.com/file/d/1IJnQA4vKPAesZlF6Sc0hDC_DXa7WKdcY/view saved_model_data/model.Tue_Jul_24_2018_01\:22.h5

In [None]:
#load model if returning to this notebook for testing, model that I trained:
model = load_model('saved_model_data/model.Tue_Jul_24_2018_01:22.h5')

In [None]:
def sample(prediction, temp=0):
    if temp <= 0:
        return np.argmax(prediction)
    prediction = prediction[0]
    prediction = np.asarray(prediction).astype('float64')
    prediction = np.log(prediction) / temp
    expo_prediction = np.exp(prediction)
    prediction = expo_prediction / np.sum(expo_prediction)
    probabilities = np.random.multinomial(1, prediction, 1)
    return np.argmax(probabilities)

#generate new script
def generate_text(seed_text, num_words, temp=0):
    input_text= seed_text
    for _  in range(num_words):
        #tokenize text to ints
        int_text = helper.tokenize_punctuation(input_text)
        int_text = int_text.lower()
        int_text = int_text.split()
        int_text = np.array([word_to_int[word] for word in int_text], dtype=np.int32)
        #pad text if it is too short, pads with zeros at beginning of text, so shouldnt have too much noise added
        int_text = pad_sequences([int_text], maxlen=sequence_length)
        #predict next word:
        prediction = model.predict(int_text, verbose=0)
        output_word = int_to_word[sample(prediction, temp=temp)]
        #append to the result
        input_text += ' ' + output_word
    #convert tokenized punctuation and other characters back
    result = helper.untokenize_punctuation(input_text)
    return result

In [None]:
#input amount of words to generate, and the seed text, good options are 'Homer_Simpson:', 'Bart_Simpson:', 'Moe_Szyslak:', or other character's names.:
seed = 'Homer_Simpson:'
num_words = 200
temp = 0.2

#print amount of characters specified.
print("Starting seed is: {}\n\n".format(seed))
print(generate_text(seed, num_words, temp=temp))

## Closing Thoughts
Remember that this model can be applied to any type of text, even code! So go and try different texts, like the (not) included Harry Potter book. (for time purposes, I would not use the whole book, as training would take a long time.)

Try different hyperparameters and model sizes, as you can get some better results.