# The Character-Based Text Generator

This notebook will walk through the way that the <a href="https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py">source project</a> wrote. Minor changes were added in the writing of this code in order to better suit our needs. 

One change was a change in the source material acting as the dataset. These models were inteded for use in an extension for MIT App Inventor, whose target audience is students in middle and high school. As such, we felt that children's books were more appropriate books to train the models on compared to the original dataset which were "Nietzsche's writings."

Another change was the ability to resume training a saved model. This was made mostly for convenience, in order to pause and resume training spontaneously.

One change to consider is adding **dropout** to the models, and whether adding dropout is worth it if we intend to host a word-based text generator instead of these character based ones. While training, dropout helps to reducing overfitting. This is useful since some of the books in the dataset folder, such as Dr Seuss's texts, are not very long. Measures should to be taken in order to mitigate this risk of overfitting the training set data.


## Imports

These models are trained using the Keras library. For the original device which trained these models, the backend was a gpu-enabled version of tensorflow, though to run the training, a handful of back-end deep-learning libraries are accepted by Keras.

In [None]:
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.models import load_model
from keras.layers import Dense
# from keras.layers import Dropout
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io, getopt, ast
from pathlib import Path

## Parameters

Here we will define some parameters for our model.

The first parameter is the `dataset_path` which is simply the path to the text pile from which you want the model to train from. This model will then prepare this text file into training values.

The next group of parameters is concerned with loading and saving the model. If you intend to load an existing model, you would set `load_file` to True and provide the location of the exported Keras model in the variable `load_path`. Regardless of whether you want to load the file, you must provide both the `save_path` and `file_name`. The save path is self-explanatory. `file_name` is the name of the file that you want to export the model to. **Notice**: `file_name` does not have a file extension. This is because the callback function which checkpoints the model will add to the end of the string the number of epochs elapsed and the extension name that Keras applies to the exported models automatically. 

The final group of parameters are the **hyperparameters** for the model. It determines the number of epochs to train the model for, the batch size, the `look_back`, and the `step_size`.

When preparing the dataset, we split the text into a series of input and output sequences. The length of each input data is `look_back` characters long. The output is the character which follows those `look_back` characters. For example, in this case, the look back is defined to be 40 characters long. The model will take the first 40 characters, with indexes from 0-39 inclusive, and label them as the input sequence. The next character, at index 40, will then be labeled the output sequence. 

The model will then repeat this process for the characters in the range 0+`step_size` to 39+`step_size` inclusive. You may want to increase the step size if you have a particularly large dataset and don't have the memory available to separate the document into such a large dataset. 

In [None]:
dataset_path = "./datasets/narnia-1.txt"

load_file = False
load_path = "./checkpoint.h5"
save_path = "./"
file_name = "narnia-1"

num_epochs = 5
checkpoints = list(range(num_epochs))
batch_size = 256
look_back = 40
step_size = 1

## Corpus Processing

This next block will be concerned with reading and preparing the text. We will be notified of the length of the body of text (corpus) and the total number of characters in the dataset.

Additionally, because this code is inteded to be exported to tensorflow.js and hosted on a webserver, I read and exported the charset as a Javascript array. Using this, I would be able to decode the output of the model.

Notice that this is done regardless of whether we are loading an existing model or building a new one. In the future, it may be desirable to also save the charset with a given model so as to improve replicability and save time when training an existing model.

In [None]:
with io.open(dataset_path, encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# Outputs a charset
charset = sorted(list(set(text)))
# Make char's JS Readable
for i in range(len(charset)):
    if (charset[i] == '\n'):
        charset[i] = '\\n'
    if (charset[i] == '"'):
        charset[i] = '\\"'
# Generates the charset file
f = open(file_name + "-charset.txt","w+")
charset_final = '["'+ '", "'.join(charset) + '"]'
f.write(charset_final)
f.close()

print (charset_final)

## Dataset Preparation

Here, we will cut the text into semi-redundant sequences of characters with lenght `look_back` as described above. We will be notified of the number of sequences. 

Finally, we convert these characters into integers, according to their index in `charset`. 

In [None]:
maxlen = look_back
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step_size):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

## Build or Load a Model

Now, we will build the model. When loading the model, we assume that the model had built to except the specific input length and has the proper output length for training purposes. Otherwise we build a simple LSTM model with the given model summary. 

For the purposes of the project, we also save the model before it has done any training. This is for educational purposes, illustrating that the initialized model, not yet exposed to any training data, will essentially be completely random. 

In [None]:
my_file = Path(load_path)
if load_file and my_file.is_file():
    print("Found Checkpoint. Loading saved model...")
    model = load_model(load_path)
else:
    if load_file:
        print("Checkpoint not found. Building a new model instead.")
    print('Building model...')
    model = Sequential()
    model.add(LSTM(128, input_shape=(maxlen, len(chars))))
    model.add(Dense(len(chars), activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01))
    model.save("{}-{}.h5".format(file_name,0))

print(model.summary())

## Helper Function and Callback Functions

Here, we define a helper function and a callback function.

The helper function, `sample` will, given the prediction of the mdoel with some probability, and a temperature which determines how likely we will select the most confident character. 

The function `on_epoch_end` is the callback function which runs every time the model has completed a single epoch. In this function, I added the ability to checkpoint the model at various stages of training. These various stages of training would help to illustrate to the middle and high school students that the models get better as they train for longer.

In [None]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):

    # Checkpointing the model
    for i in checkpoints:
        if epoch + 1 == i:
            print("Checkpointing the model...")
            model.save("%s-%d.h5" % (file_name,i))
    print()
    
    
    diversity = 1
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    print('----- diversity:', diversity)

    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)

    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_indices[char]] = 1.

        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds, diversity)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

# Training

Finally, we begin the training. This is fairly self explanatory. The hyperparameters and training data are defined above.

In [None]:
# checks if the checkpoints array says to checkpoint initial model
if 0 in checkpoints:
    print("Checkpointing the model...")
    model.save("%s-%d.h5" % (file_name,0))

# begin the training process    
model.fit(x, y,
          batch_size=batch_size,
          epochs=num_epochs,
          callbacks=[print_callback])

print("Source: \"%s\" \nEpochs: %d \nBatch Size: %d \nStep Size: %d \n Look Back: %d" % (dataset_path, num_epochs, batch_size, step_size, maxlen))


# Future Work

It may be worth implementing a dropout layer to the models.

Among other features...