<a href="https://colab.research.google.com/github/ulusalfn/Tutorial-1/blob/main/GENAI_Generating_Text_with_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Text with LSTM Networks

In this notebook, you will create character-based models for text generation with LSTM recurrent neural networks.

**Note:** LSTM models are slow to train. We will start with a small LSTM network and train it for a short time to get a feel for how they work before creating a slightly more complex model and training it for longer to obtain better results.


## Project Gutenberg

Many of the classical texts are no longer protected under copyright. This means you can download all the text for these books for free and use them to experiment with. [Project Gutenberg](https://www.gutenberg.org/) provides an extensive collection of books that are no longer under copyright.

For this notebook, we will download the text for [Alice's Adventures in Wonderland](https://www.gutenberg.org/ebooks/11) by Lewis Carroll.

Before we start building our model, we will need to prepare the text of this book so that we can easily work with it. The simplest way to do this is to [download the complete plain text (UTF-8) version of the book](https://www.gutenberg.org/cache/epub/11/pg11.txt) to your local machine. Save this version of the text with the filename `wonderland_raw.txt`.

Project Gutenberg adds a header and footer information to each book, which is not part of the original text. Because we only want to train our model on the original text, open the file you have just saved in a text editor and delete the header and footer information.

The header text ends with:

> `*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***`

The footer is all of the text after the line containing:

> `THE END`

You should be left with a text file that has a bit over 3,300 lines of text. Save this new version of the file as `wonderland_text.txt`. You will upload this file to this notebook to be the dataset to train your model.

To upload the file, you should be able to open the *Files* panel on the left of this notebook and drag the `wonderland_text.txt` into the panel.

## A Small LSTM Network

In this section, we will build a small LSTM network and train it to predict character sequences from *Alice in Wonderland*.

We start by importing the libraries that we'll use to construct and train our model.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical

Next, we load the text file that we've uploaded into memory and convert all of the text to lowercase to reduce the vocabulary (of characters) that the network must learn.

In [None]:
# load ascii text and covert to lowercase
filename = "wonderland_text.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

The next step is to prepare the data by converting it into an appropriate form to input into a network. We don't model the characters directly, instead, we convert the characters into integer tokens.

We do this by first creating a set of all of the different characters in the book, then creating a map (a `dict` in Python) of each character to a unique integer, by enumerating each character in the set and assigning it an integer equal to its index in the set. The particular value of the integers assigned to the chars is not important for the model.

In [None]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

We can examine the vocabulary of characters that have been extracted from the book by printing the `chars` set, which is a list of the unique lowercase letters, numbers, punctuation and special characters that appear in the text.

In [None]:
print(chars)

['\n', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', '0', '3', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\ufeff']


The vocabulary still includes some characters that we might want to remove from the text, e.g., `*`, and this may improve the model by removing text we do not want to generate. But this will be sufficient for our purposes of working with an LSTM to generate some text.

We can summarise the dataset in terms of the size of the whole text and the size of the vocabulary.

In [None]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: {}".format(n_chars))
print("Total Vocab: {}".format(n_vocab))

Total Characters: 144431
Total Vocab: 46


We can see that the text has a little under 145,000 characters in total and the vocabulary has around 46 characters. (Your values should be around these values, but you may have a slightly bigger or smaller vocabulary depending on how much you removed from the original raw file.)

We now need to define the training data for the network. There is a lot of flexibility in how you choose to break up the text and present it to the network.

In this tutorial, we will split the text into sequences with a fixed length of 100 characters. There is nothing particularly special about this length of sequence, it is an arbitrary length, but should be sufficient to demonstrate the use of an LSTM to generate text character-by-character.

Another common approach to breaking up a dataset like this is to find the longest sentence in the text and use this as the sequence length. The text can then be broken into sentences, padding the shorter sequences such that the input to the network is of a uniform length.

Each training pattern for the network comprises 100 time steps of one character (X) followed by one character output (y). When creating these sequences, we slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it, with the exception of the first 100 characters.

For example, if the sequence length were just 5, then the first two training patterns would be as follows:

> X: `CHAPT` -> y: `E`  
> X: `HAPTE` -> y: `R`

As we split the book into sequences, we also convert the characters to integers using the map we prepared earlier.

In [None]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
  seq_in = raw_text[i:i + seq_length]
  seq_out = raw_text[i + seq_length]
  dataX.append([char_to_int[char] for char in seq_in])
  dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: {}".format(n_patterns))

Total Patterns: 144331


Running the code to this point shows that when we split up the dataset into training data for the network to learn that we have just under 145,000 training patterns.

This makes sense because, excluding the first 100 characters, we have one training pattern to predict each of the remaining characters in the text.

Now that we have prepared the training data, we need to transform it again for use with Keras.

First, we transform the list of input sequences into the form \[samples, time steps, features\], which is expected by an LSTM network.

Next, we rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network using the sigmoid activation function by default.

Finally, we convert the output patterns (single characters converted to integers) into a one-hot encoding. This is so that we can configure the network to predict the probability of each of the different characters in the vocabulary. Each `y` value is converted into a sparse vector with a length of the vocabulary, full of zeros, except with a single 1 in the column for the letter (integer) that the pattern represents.

For example, when “n” (integer value 32) is one-hot encoded, it will look something like:

> `[0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ]`

In [None]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalise the inputs to be in the range 0..1
X = X / float(n_vocab)
# one hot encode the outputs
y = to_categorical(dataY)

We can now define the LSTM model. Here, we define a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20%. The output layer is a Dense layer using the softmax activation function to output a probability prediction (0..1) for each of the different characters.

The problem is really a single character classification problem with the number of classes equal to the size of the vocabulary. As such, the model compiled as optimizing the  categorical crossentropy loss, using the ADAM optimisation algorithm.

In [None]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

There is no test dataset in this example. We are modelling the entire training dataset to learn the probability of each character in a sequence. It can be useful to define a validation set, as we would normally for such a categorisation task, in particular to avoid overfitting on the given dataset. But for the sake of simplicity, this has been omitted from this example.

Even a small LSTM network can be slow to train, because of this, we can use model checkpointing to record all the network weights to a file each time an improvement in the loss is observed at the end of the epoch. This will mean that we can use the best set of weights (lowest loss) to instantiate our generative model in the next section.

In [None]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.keras"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

We can now fit your model to the data. Here, we use a modest number of 20 epochs and a large batch size of 128 patterns, to ensure that we can get a result within a relatively short period of time.

In [None]:
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20
Epoch 1: loss improved from 1.96180 to 1.93548, saving model to weights-improvement-01-1.9355.hdf5
Epoch 2/20
Epoch 2: loss improved from 1.93548 to 1.91284, saving model to weights-improvement-02-1.9128.hdf5
Epoch 3/20
Epoch 3: loss improved from 1.91284 to 1.89150, saving model to weights-improvement-03-1.8915.hdf5
Epoch 4/20
Epoch 4: loss improved from 1.89150 to 1.87459, saving model to weights-improvement-04-1.8746.hdf5
Epoch 5/20
Epoch 5: loss improved from 1.87459 to 1.85369, saving model to weights-improvement-05-1.8537.hdf5
Epoch 6/20
Epoch 6: loss improved from 1.85369 to 1.83492, saving model to weights-improvement-06-1.8349.hdf5
Epoch 7/20
Epoch 7: loss improved from 1.83492 to 1.82006, saving model to weights-improvement-07-1.8201.hdf5
Epoch 8/20
Epoch 8: loss improved from 1.82006 to 1.80218, saving model to weights-improvement-08-1.8022.hdf5
Epoch 9/20
Epoch 9: loss improved from 1.80218 to 1.78437, saving model to weights-improvement-09-1.7844.hdf5
Epoch 10/2

KeyboardInterrupt: ignored

After running the example, a number of weight checkpoint files should be visible in the local directory.

All except the one with the smallest loss value can be deleted. The value of the loss is encoded in the filename of the checkpoint to make identifying the best checkpoint simple.

If the network loss decreased every epoch, the model would likely benefit from additional training. But before doing that, let's have a look at what this network can generate.

## Generating Text

Generating text using the trained LSTM network is relatively straightforward.

First, we load the data and define the network in exactly the same way, except the network weights are loaded from a checkpoint file, and the network does not need to be re-trained. We do this because the last epoch of training may not have had the lowest loss. By reloading the weights form the checkpoint with the lowest loss we ensure that the model is the best it can be.

**Note**: You will need to change the name of the checkpoint file in the code below to match the name of the file that has the lowest loss in your run.

In [None]:
# load the network weights
filename = "weights-improvement-19-1.7580.keras"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

To convert back from the integer tokens to the unique characters in the text, we create a reverse mapping.

In [None]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

The simplest way to use LSTM model to make predictions is to first start with a seed sequence as input, generate the next character, then update the input to add the generated character on the end and trim off the first character. This process is repeated for as long as desired to predict new characters (e.g., a sequence of 1,000 characters in length).

The following defines a function that does exactly that and outputs the generated characters as it goes. It picks a random sequence of characters from the dataset as its initial input pattern, or seed. (We will reuse this function with a larger model below, so the function takes the model to use to generate predictions as a parameter.)

In [None]:
import sys

# pick a random seed
def generate_text(model, length=1000):
  start = np.random.randint(0, len(dataX)-1)
  pattern = dataX[start]
  output = ""
  print("Seed:")
  print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
  print("\n\nGenerated:")
  # generate characters
  for i in range(length):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    output += result
    sys.stdout.write(result)
    sys.stdout.flush()
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
  print("\n\nDone")

We call this function with the model we want it to use to generate the text.

In [None]:
generate_text(model)

Running this function first outputs the selected random seed, then each character as it is generated.

**Note**: Your results will likely vary given the stochastic nature of the algorithm. Consider running the text generation function a few times and compare the average outcome.

Some observations about the generated text:

- Characters are separated into word-like groups, and some are actual English words, e.g., "it", "to", "tea", "she" "more", but many are not, e.g., "hxiniin", "lirtle", "maae", "boeme"
- Occasionally, some of the words in a sequence make sense, e.g., "and the white rabbit" but many don't

That a small LSTM network learning a character-based model can produce this output is impressive, although far from perfect. In the next section, we will look at improving the quality of the results by developing a larger LSTM network.

## A Bigger LSTM Network

We got results, but not great results in the previous section. Now, we will try to improve the quality of the generated text by creating a larger network.

The definition of the bigger model is not particularly complex, we will simply add a second LSTM layer, with dropout again set to 20% for the second layer. We keep the size of the LSTM layers the same at 256 units.

In [None]:
bigger_model = Sequential()
bigger_model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
bigger_model.add(Dropout(0.2))
bigger_model.add(LSTM(256, return_sequences=True))
bigger_model.add(Dropout(0.2))
bigger_model.add(LSTM(256))
bigger_model.add(Dropout(0.2))
bigger_model.add(Dense(y.shape[1], activation='softmax'))
bigger_model.compile(loss='categorical_crossentropy', optimizer='adam')

bigger_model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_7 (LSTM)               (None, 100, 256)          264192    
                                                                 
 dropout_5 (Dropout)         (None, 100, 256)          0         
                                                                 
 lstm_8 (LSTM)               (None, 100, 256)          525312    
                                                                 
 dropout_6 (Dropout)         (None, 100, 256)          0         
                                                                 
 lstm_9 (LSTM)               (None, 256)               525312    
                                                                 
 dropout_7 (Dropout)         (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 45)               

We'll define a different filepath pattern for saving the checkpoints for this bigger model, so that we can tell them apart from the previous checkpoints. We'll also define a new checkpoint callback for this model, so that we aren't competing with the previous lowest loss found before any checkpoints will be saved.

In [None]:
bigger_filepath="weights-improvement-{epoch:02d}-{loss:.4f}-bigger.keras"
bigger_checkpoint = ModelCheckpoint(bigger_filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
bigger_callbacks_list = [bigger_checkpoint]

Finally, we increase the number of training epochs from 20 to 50 and decrease the batch size from 128 to 64 to give the network more of an opportunity to be updated and learn. These changes will make the training slower but should result in a much more capable model.

In [None]:
# fit the model
bigger_model.fit(X, y, epochs=50, batch_size=64, callbacks=bigger_callbacks_list)

Epoch 1/50
Epoch 1: loss improved from inf to 1.79416, saving model to weights-improvement-01-1.7942-bigger.hdf5
Epoch 2/50
Epoch 2: loss improved from 1.79416 to 1.74468, saving model to weights-improvement-02-1.7447-bigger.hdf5
Epoch 3/50
Epoch 3: loss improved from 1.74468 to 1.70146, saving model to weights-improvement-03-1.7015-bigger.hdf5
Epoch 4/50
Epoch 4: loss improved from 1.70146 to 1.66295, saving model to weights-improvement-04-1.6630-bigger.hdf5
Epoch 5/50
Epoch 5: loss improved from 1.66295 to 1.62978, saving model to weights-improvement-05-1.6298-bigger.hdf5
Epoch 6/50
Epoch 6: loss improved from 1.62978 to 1.59905, saving model to weights-improvement-06-1.5990-bigger.hdf5
Epoch 7/50
Epoch 7: loss improved from 1.59905 to 1.56894, saving model to weights-improvement-07-1.5689-bigger.hdf5
Epoch 8/50
Epoch 8: loss improved from 1.56894 to 1.54251, saving model to weights-improvement-08-1.5425-bigger.hdf5
Epoch 9/50
Epoch 9: loss improved from 1.54251 to 1.52035, saving mo

<keras.callbacks.History at 0x7fc00e544650>

After running this bigger model, you can expect to achieve a loss of about 1.2. This should be significantly less than the loss achieved for the original model, which would often plateau around 1.6.

Before proceeding, make sure that you load the checkpoint for the lowest loss, in case this wasn't the final epoch.

In [None]:
# load the network weights
bigger_filename = "weights-improvement-20-1.1860-bigger.keras"
bigger_model.load_weights(bigger_filename)
bigger_model.compile(loss='categorical_crossentropy', optimizer='adam')

We can now run the function that we defined earlier with the bigger model, to generate some text.

In [None]:
generate_text(bigger_model)

Seed:
"  alice
for some time with great curiosity, and this was his first speech.

'you should learn not to  "


Generated:
the birtant ' said the queen, 
'i don't know what you were it in a parrer as it was,' the mock turtle seplied in an offended tone. and the queen surnid another vith one of the gall.
and the queen she heard the queen was she was a little shriek sile she was a git drrter, and the queen she heard the queen was sie was a little dourte, and the queen surnid nut the rabbit hole of the dorrs, and the pueen said to the gryphon. 
'what do you think you miked to see it tr and sereamed to be a very dire, i should think you were the door way i manegeng ane more of the cance.

 will you, won't you, will you, will you, will you, will you, will you, will you, will you, will you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't you, won't

We can see that there are generally fewer spelling mistakes, and the text looks more realistic but is still quite nonsensical.

For example, the same phrases get repeated again and again, e.g., "will you" and "won't you". Quotes are opened but not closed.

The results of this model are significantly better than the previous model, but there is still a lot of room for improvement.

## Improving the Model

Here are some ideas for improving the model that you might want to try:

- Predict fewer than 1,000 characters as output for a given seed (the LSTM performs better the closer it is to the seed, so shorter sequences should be more coherent)
- Clean the source text more thoroughly, e.g., remove all punctuation from the source text and, therefore, from the models’ vocabulary
- Try a different source text; Alice in Wonderland is a brilliant book but Lewis Carroll was a master of non-sensical rhymes and it may be easier to judge the success of the model on a different source
- Increase the number of training epochs, you can easily do this by loading a checkpoint and continuing the training from there
- Try changing the dropout percentage to see if this has a noticeable impact on the generated output
- Try changing the batch size, e.g., start with a batch size of 1 and slowly increase the batch size to see if you can find one that performs better
- Add more memory units and/or layers to the model

## Tutorial Assignment

Attempt at least 2 of the above suggestions (or some changes of your own devising, e.g., you might want to replace the LSTM layers with GRU layers and see how this impacts speed of training and performance on your dataset). Report on your experiments by submitting your notebook with comments on the things that you tried, what worked better than expected, what worked worse.