# Text Generation using Recurrent Neural Networks


## Introduction

Recurrent neural networks (RNNs) have emerged as powerful predictive and generative models for a range of applications. For example, take a look at the excellent blog post by Andrew Karpathy on [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). In this exercise, your task is to build a generative, character-by-character RNN that can predict the next character from a given sequence. The classic work *The Odyssey* by Homer will serve as a corpus for your model. An example of the type of text that you can generate from this lab is below. The `seed` text represents an initial sequence of characters that is randomly sampled from the corpus. Following the seed, you can see that the model has predicted a fairly realistic text sequence, in the style of the *The Odyssey*, including realistic line breaks and punctuation.

<b>Seed text</b>
```
h ulysses for having
blinded an eye of p
```

<b>Prediction of next 500 characters</b>
```
olypels end she darte yod mentered saw he would polden ewall by sur; for her got him eather, and he would send
them flying out of the hould not save his
men, for they perished through their own sheer folly in eating the
cattle of the sun-god hyperion; so the god prevented them from ever
reaching home. tell me, too, about all these things, oh daughter of
jove, from whatsoever source you may know them.

so now all who escaped death in battle or by shipwreck had got safely
home except ulysses, and 
```

### RNN Models in Keras

As in the CNN exercise, we will use the Tensorflow Keras module to implement a simple RNN model. Specifically, we will make use of the [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) and [GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) layers.
Before proceeding, it may be useful to familiarize yourself with the [Keras API for RNNs](https://www.tensorflow.org/guide/keras/rnn). In particular, understanding what input shape each RNN layer expects will be crucial.

### Embeddings

In essence, an embedding encodes an integer index as a vector of some size. You may think of this as a generalization of a one-hot encodding. For example, consider the list of characters in the word "hello". If this word contained all the characters in our vocabulary (namely "h", "e", "l", and "o"), we can generate a one-hot encoding where each character is represented by a vector of size 4, with a 1 in the element corresponding to the letter, and zeros everwhere else. For our 4-character vocabulary, this could look like the following:

| char / index | encodding  |
|--------------|------------|
|h / 0         | 1 0 0 0    |
|e / 1         | 0 1 0 0    |
|l / 2         | 0 0 1 0    |
|o / 3         | 0 0 0 1    |

While one-hot encoddings are usefull for many applications, they have a number of limitations. For example,

- The encodding matrix is extremely sparse
- The size of the encodding depends on the size of the vocabulary or number of categories being encodded
- There is no notion of similarity between the entities being encodded

An embedding solves these problems by using a learnable (size of vocabulary)X(size of encodding) matrix, in place of the fixed and sparse one-hot encodding matrix. Continuing with our previous example, an embedding for the characters in "hello" might take the form

| char / index | encodding   |
|--------------|-------------|
|h / 0         | 1.2 0.3 4.3 |
|e / 1         | 0.1 1.5 7.8 |
|l / 2         | 0.5 3.2 1.9 |
|o / 3         | 3.6 7.2 5.8 |

Note that here, we have chosen an encoding represented by a vector of size 3, which is less than the size of the vocabulary. This is called "embedding" our vocabulary in a 3 dimensional space. This is extremely useful when building encoddings for very large vocabularies. In addition, the coefficients in the encodding matrix are learned during the training process, allowing "similar" characters (in this case) to be grouped locally in the encodding space. Further, we can visualize the embedding space through a number of techniques to help us understand how our vocabulary is encodded. We will not do that here, but you can find a number of examples online, if you are interested.

## Getting started

Import packages and intialize various global variables. You may want to change these later.

In [1]:
import random
import re

import numpy as np
import requests
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

In [2]:
# Global constants
window_size = 40 # length of character sequences
batch_size = 32 # batch size for learning
rnn_units = 128 # number of hidden units in the LSTM or GRU cells cell
epochs = 100 # number of training epochs

corpus_file = 'odyssey.txt'


def load_corpus(file_name, local=True, github_repo='/'):
    """Loads the corpus text from a given file.
    
    Parameters
    ----------
    file_name : str
        If local is True, the full path to the local file, otherwise the file name for the github repo.
    local : bool
        True if file is local, False (default) if file is in github repo.
    github_repo : str
        Github repo storing the file (only used if local is False).
    """
    # Open local file
    if local:
        return open(file_name, 'r').read()
    
    # Get file from repository
    page = requests.get(f'https://raw.githubusercontent.com/{github_repo}/{file_name}')
    
    return page.text

## Task 1: Preprocess text data

Fill in the function below to read in the text file given by `text_file` and perform any preprocessing that you feel is necessary. For example, convert the text to lower case in order to reduce the size of the vocabulary. Other examples of processing include replacing accented characters with non accented characters, removing "unnecessary" punctuation, etc.

In [3]:
def preprocess_file(file_name):
    """Read a text file, perform preprocessing, and return text as a string.
    
    Parameters
    ----------
    file_name : str
        Name of text file to load.
    
    Returns
    -------
    text : str
        Preprocessed text from the file.
    """
    # Read in file (using load_corpus()) and convert to lower case
    text = load_corpus(file_name).lower()

    # Perform additional processing
    text = re.sub(r'\d+', '', text)
    text = text.translate({ord(c): None for c in '!@#$"{-}:'"'"})
    text = text.replace(' the ', '')
    text = text.strip()
    text = re.sub(' {2,}', ' ', text)

    return text


# Load and prepare data
text = preprocess_file(corpus_file)

# Shorten text for testing
text = text[:10000]
print(text[:500])
print(len(text))

the odyssey


book i

the gods in councilminervas visit to ithacathe challenge from
telemachus tosuitors.

tell me, o muse, of that ingenious hero who travelled far and wide after
he had sackedfamous town of troy. many cities did he visit, and
many werenations with whose manners and customs he was acquainted;
moreover he suffered much by sea while trying to save his own life and
bring his men safely home; but do what he might he could not save his
men, for they perished through their own sheer f
10000


## Task 2: Generate a dataset for training

Fill in the function below which takes in the document text and a "window" size and returns a list of unique characters representing the vocabulary for the document as well as the training data. The training data consists of two lists. The first is a list of lists of integers (indexing the vocabulary list) corresponding to a sequences of characters found in the document of length `window_size`. The other is a list of integers (indexing the vocabulary list) corresponding to the next character in the sequence, for each sequence in the first list.

In [4]:
def make_dataset(text, window_size=40):
    """Create the dataset used to train the RNN.
    
    Parameters
    ----------
    text : str
        String representing text to learn on.
    window_size : int
        Length of character sequence used to predict next character.
    
    Returns
    -------
    vocab : list(char)
        List of characters making up the vocabulary of the text.
    x_data : list(list(int))
        List of sequences of size window_size, containing indices into vocab.
        Each sequence represents a sequence of window_size characters found in
        the text. The number of sequences generated will be len(text) - window_size.
    y_data : list(int)
        List of indices corresponding to the characters that follow the
        sequences in x_data.
    """
    # Determine list of unique characters
    vocab = sorted(list(set(text)))

    # Generate training data
    X_data = []
    y_data = []

    for i in range(len(text)-window_size):
        X_tmp = []

        for char in text[i:i+window_size]:
            X_tmp.append(vocab.index(char))

        X_data.append(X_tmp)
        y_data.append(vocab.index(text[i+window_size]))
    
    return X_data, y_data, vocab


# Retrieve training data
X_data, y_data, vocab = make_dataset(text, window_size=window_size)

# Check if everything is working
print(f"Vocabulary: {vocab}")
print(f"Vocabulary length: {len(vocab)}.\n")
print(f"First element of X_data: {X_data[0]}")
print(f"Text length: {len(text)}.")
print(f"X_data length: {len(X_data)}.\n")
print(f"First element of y_data: {y_data[0]}.")

Vocabulary: ['\n', ' ', ',', '.', ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Vocabulary length: 32.

First element of X_data: [25, 13, 10, 1, 20, 9, 30, 24, 24, 10, 30, 0, 0, 0, 7, 20, 20, 16, 1, 14, 0, 0, 25, 13, 10, 1, 12, 20, 9, 24, 1, 14, 19, 1, 8, 20, 26, 19, 8, 14]
Text length: 10000.
X_data length: 9960.

First element of y_data: 17.


## Task 3: Create the RNN model

Fill in the function below which builds and returns the RNN model, for the given size parameters and RNN layer. The model should take as input a tensor representing batches of character index sequences and output a tensor representing the probabilities of each character in the vocabulary coming next in the sequence, for each sequence in the batch. Use the following architecture.  

- Sequential Keras model
  - [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) with input and output dimensions equal to the vocab size (you can try using smaller encoddings later)
  - [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) with num_units
  - [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) output, with softmax
- `sparse_categorical_crossentropy` loss
- Adam optimizer
- Metrics: accuracy

In [5]:
def rnn_model(num_units, window_size, vocab_size, rnn_layer=layers.LSTM):
    """Creates the RNN model.
    
    Parameters
    ----------
    num_units : int
        Number of hidden units in the LSTM layer.
    window_size : int
        Number of characters in an input sequence.
    vocab_size : int
        Number of unique characters in the vocabulary.
    rnn_layer : Keras RNN layer (RNN, LSTM, GRU)
    
    Returns
    -------
    model : Keras model
        RNN model.
    """
    
    # Initialize model object
    model = Sequential()

    # Add layers
    model.add(layers.Embedding(input_dim=vocab_size, output_dim=vocab_size, 
        trainable=True, input_length=window_size))
    model.add(rnn_layer(num_units, return_sequences=False))
    model.add(layers.Dense(vocab_size, activation='softmax'))

    # Compile model
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    return model

In [6]:
# Instantiate model
model = rnn_model(rnn_units, window_size, len(vocab), layers.GRU)

# Print summary of compiled model
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 40, 32)            1024      
                                                                 
 gru (GRU)                   (None, 128)               62208     
                                                                 
 dense (Dense)               (None, 32)                4128      
                                                                 
Total params: 67,360
Trainable params: 67,360
Non-trainable params: 0
_________________________________________________________________


## Task 4: Train and evaluate

Fill in the code below to train the RNN model. After every 3 epochs, generate 500 characters of text from a random seed sequence to gauge how well the model is doing. This can be done by using the seed to predict the next character in the sequence (take the maximum likelihood character). Append this new character onto the sequence (dropping the first character to maintain the window size) and repeat. Print each new character as you go to generate the text.

In [6]:
# Train model
epochs_to_train = 20

for i in range(epochs_to_train, epochs+1, epochs_to_train):
    # Fit model for 3 epochs
    model.fit(X_data, y_data, batch_size=batch_size, epochs=epochs_to_train)

    # Set random seed sequence and generated text string
    seed_sequence = random.choice(X_data)
    generated_text = ''.join([str(vocab[ind]) for ind in seed_sequence])

    for _ in range(500):
        # Predict most probable index and its corresponding character
        predicted_ind = int(np.argmax(model.predict([seed_sequence])))
        predicted_char = vocab[predicted_ind]

        # Update generated text and seed sequence
        generated_text = ''.join((generated_text, predicted_char))
        del seed_sequence[0]
        seed_sequence.append(predicted_ind)

    # Print generated text every iteration
    print(f"Generated text after {i} epochs of training:\n\n", 
        generated_text)

Epoch 1/3
Epoch 2/3
Epoch 3/3
Generated text after 3 epochs of training:

 for them.

thensuitors came in and took he and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and a
Epoch 1/3
Epoch 2/3
Epoch 3/3
Generated text after 6 epochs of training:

 longer heed them; we shall never see him he theme them men the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the mand the man

In [7]:
# Print final generated text
print(generated_text)

anchialus, and i am king oftaphians. i have
come here with my ship and crew, on a voyage to men of a foreign tongue
being bound for temesa with a cargo of iron, and i shall bring back
copper. as for my ship, it lies over yonder offopen country away
fromtown, inharbour rheithron underwooded mountain
neritum. our fathers were friends before us, as old laertes will
tell you, if you will go and ask him. they say, however, that he never
comes to town now, and lives by himself incountry, faring hardly,
with an old woman to look after him an


## Task 5: Repeat your experiment using a GRU layer

Repeat tasks 3 and 4 with the GRU layer in place of the LSTM. Do you notice any differences in the performance, training, or text generation?

<font color='#4863A0'>A neural network with GRU layers was generally trained faster than with LSTM on a particular dataset. However, both options showed fairly good predictive ability. It was clear that the first iterations of training the model got stuck on some phrases, but with further training it could generate more and more adequate texts.</font>

## Model with sequences of words

In [12]:
# Load and prepare data
text = preprocess_file(corpus_file)
text = text[:500000]

# Transform text to sequence of sentences
text_to_sentences = [sen.split() for sen in text.split('.')]

# Build dictionary of indices
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_to_sentences)

# Change texts into sequence of indexes
text_numeric = tokenizer.texts_to_sequences(text_to_sentences)

# Find vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# Pad sequences
pad_length = 7
text_pad = pad_sequences(text_numeric, pad_length)

# Create list of next words
next_words = np.append(np.asarray([elem[0] for elem in text_pad[1:]]), 0).reshape(-1, 1)

In [13]:
# Initialize model object
model_words = Sequential()

# Add layers
model_words.add(layers.Embedding(input_dim=vocab_size, output_dim=3000, 
        trainable=True, input_length=pad_length))
model_words.add(layers.GRU(rnn_units, return_sequences=False))
model_words.add(layers.Dense(vocab_size, activation='softmax'))

# Compile model
model_words.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print summary of compiled model
model_words.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 7, 3000)           32964000  
                                                                 
 gru_2 (GRU)                 (None, 128)               1201920   
                                                                 
 dense_2 (Dense)             (None, 10988)             1417452   
                                                                 
Total params: 35,583,372
Trainable params: 35,583,372
Non-trainable params: 0
_________________________________________________________________


In [14]:
# Train model
model_words.fit(text_pad, next_words, batch_size=4, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1e4285aa610>

In [15]:
# Set random seed sequence and generated text string
seed_sequence = random.choice(text_pad).reshape(1, -1)
generated_text = ' '.join([tokenizer.index_word.get(ind, str(0)) for ind in seed_sequence[0]])

for _ in range(100):
    # Predict most probable index and its corresponding word
    predicted_ind = np.argmax(model_words.predict(seed_sequence))
    predicted_word = tokenizer.index_word.get(predicted_ind, 0)
    
    # Update generated text and seed sequence
    generated_text = ' '.join((generated_text, str(predicted_word)))
    seed_sequence = np.append(np.delete(seed_sequence, 0), predicted_ind).reshape(1, -1)

# Print generated text every iteration
print(f"Generated text:\n\n", 
    generated_text)

Generated text:

 laid it down ondeck of the ship theoclymenus she to to to to to to by cover no she who on on on went somebody may heaven warm help come wretches andsheep wind away away away away away away give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give give


<font color='#4863A0'>As can be seen, the model failed to produce an adequate text, despite good accuracy in the training set. This suggests that it is worth using a more complex and refined model with several layers and dropouts, as well as preprocessing the entire text in more detail.</font>

## Extra

This exercise was a small taste of the power of RNN models. Here are some other things you can try if you want to go further with the time you have left.

- Sample the output distribution from the model to generate the next character in the sequence (instead of taking the most probable). This will add some more randomness to your text generation.
- Build a vocabulary of words, rather than characters. This will highlight the importance of the embedding layer (you will need to use a smaller output dimension for the embedding than the vocabulary size).
- Try other media types (eg: sound, video, guitar tabs...)