# Section 3 MN5002 (LSTMs, Language Models, seq2seq architectures) - Assignment
Firstly, modify and run the code below using the [Alice In Wonderland text](https://www.gutenberg.org/cache/epub/11/pg11.txt) and consider the output in the context of the review remarks in the section **Main Exercise** below.

Then pick yourself a text from [Project Gutenberg's texts](https://www.gutenberg.org/ebooks) and the format you should download is **Plain Text UTF-8** ; when you are considering a text, ensure you are happy that the corpus is large enough, so choose an author with a lot of text. Have a good explore around the texts of Project Gutenberg there is much there.

Using your chosen text(s), create an LSTM RNN based Language Model with Keras.
Make improvements to the language model using what you learned from the example notebook. Show text examples of outputs critiquing them and implementing modifications to address issues you see in the generated text.

Also, post comments in the Section 3 Discussion section.
Make sure that your outputs are on in your notebooks so that when the code is submitted the outputs are clear and it will not be necessary to run the code to see the outcomes.

### More tips

It is strongly recommended that this is run with **a Colab GPU runtime** using **Runtime -> Change runtime type** and selecting GPU. You will need to store your model files and data on your Google Drive if you do not want to lose your work between runtimes and you will need to mount your Google Drive each time you run this code so that you can access your data. The base model took me about 7 minutes to train using a T4 Colab GPU runtime.

## Automatic text generation with LSTM RNNs in Python with Keras
Generate text that mimics an author's style. The approach below is based on [code by Jason Brownless](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/)

In [1]:
import numpy as np
import tensorflow as tf
import keras
import os
import sys
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Call up an ASCII text file and convert the characters to lowercase. This is an approach called **case folding** and it reduces the number of tokens under consideration.

In [3]:
# load ascii text and covert to lowercase
os.chdir('/content/drive/MyDrive/MSc NLP Files')
filename = "AliceInWonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

Prepare the data for modelling in by a neural network. This is going to be a character based representation, so we can associate each character with a unique integer.

In [4]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

Now, get a quick summary of what we have so far.

In [5]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  163917
Total Vocab:  64


If there are more than 26 characters, it means that there are several other types of character there - this may or may not add to the meaning. If you remove the extra characters, you may be increasing the capacity to extract meaning, or you may be increasing the "noise" and this is a source of experiment. (Watch out for this in your generated texts later).

### Setting up the generation of text
Text generation is,under the hood, very similar to the normal classification tasks that you will have seen before. Ordinary training for classification consists of a set of training data, where each training instance has a label.

In text generation, the training instance is **a sequence of seq_length characters** and the label is **the (seq_length + 1) th character** ! This way, the network picks up the way the text is "steered" when an author writes.

In [6]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  163817


Now that the training data are in place, you must do the following
* transform the list of input sequences into the form *{samples, times steps, features}* expected by an LSTM network
* Rescale the integers to the range 0-to-1 to make the patterns easier to learn using the sigmoid function
* Finally, in order to get back from encoding to text data, you need to convert the ouput pattern (single character "classifications" as integers) into a one-hot encoding, so that you can determine which of the characters in the vocabulary is being indicated.

In [7]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)

Now, define the LSTM model

In [8]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

  super().__init__(**kwargs)


Note that there is **no dropout layer** and **no test set**. We are trying to maximize the expression of the language structure for judgement by a human being and the outputs will rarely be perfect expectations in the training data, so the usual metrics of accuracy will not be helpful.

Also, as the modelling is slow, use model checkpointing to record all the network weights and select for the one with the **lowest loss**. The loss value will be built into the checkpoint filename, so this is very useful.

In [10]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.keras"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

Now fit the model

In [11]:
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20
[1m1280/1280[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - loss: 3.1155
Epoch 1: loss improved from inf to 3.03075, saving model to weights-improvement-01-3.0307.keras
[1m1280/1280[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 14ms/step - loss: 3.1154
Epoch 2/20
[1m1278/1280[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 13ms/step - loss: 2.8737
Epoch 2: loss improved from 3.03075 to 2.85104, saving model to weights-improvement-02-2.8510.keras
[1m1280/1280[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 14ms/step - loss: 2.8736
Epoch 3/20
[1m1279/1280[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 14ms/step - loss: 2.7826
Epoch 3: loss improved from 2.85104 to 2.76491, saving model to weights-improvement-03-2.7649.keras
[1m1280/1280[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 14ms/step - loss: 2.7826
Epoch 4/20
[1m1278/1280[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 14ms/step - loss: 2.7183
Epoch 

<keras.src.callbacks.history.History at 0x7ad69385a650>

## Generating Text with an LSTM network
Firstly, load the data and define the network using the checkpoint file.

In [13]:
# load the network weights
filename = "weights-improvement-20-2.0905.keras" # Use the checkpoint file with the lowest weight here
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

You need to make a reverse mapping that to convert integers back to characters so that you can understand the predictions.

Recall -
* **Encoders** turn language tokens into numerical representations whereas
* **Decoders** turn numerical representations back into language tokens
* **Discriminative models** use operations on encoded data to perform classification tasks (e.g., sentiment analysis, named entities, etc.,) but
* **Generative models** are used to create outputs that have the same form as the inputs (e.g., word tokens as input to create word tokens as output)

In [14]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

Now, make predictions; start with a seed sequence as input, generate the next character, then update the seed sequence to add the generated character on the end and trim off the first character (this is kind of generatively "shuffling" along a generated character sequence) and this goes on for as long as you want to generate characters (1000 characters in the example below)

In [15]:
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" ant state of change. if you are outside the united states,
check the laws of your country in additio "
n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

# Main exercise
Consider the output that you have generated and then apply those considerations to a corpus that you pick yourself from [Project Gutenberg's texts](https://www.gutenberg.org/ebooks) and the format you should download is **Plain Text UTF-8** ; when you are considering an author, ensure you are happy that the corpus is large enough, so choose an author with a lot of text.

* How sensible does the output look?
* If it doesn't look sensible, what is messing it up?
* Have you enough input data, i.e., is your input text file too small?
* Are there characters in the sequence that should be systematically removed?
* Is your **case folding strategy** appropriate?
* Could you boost the information in your corpus by adding documents from the same author or from a similar author?
* Predict fewer than 1,000 characters as output for a given seed?
* Remove some/all punctuation from the source text and, therefore, from the models’ vocabulary
* Train the model on padded sentences rather than random sequences of characters
* Increase the number of training epochs to 100 or many hundreds
* Add dropout to the visible input layer and consider tuning the dropout percentage
* Tune the batch size; try a batch size of 1 as a (very slow) baseline and larger sizes from there
* Add more memory units to the layers and/or more layers
*Experiment with scale factors (**temperature**) when interpreting the prediction probabilities
Change the LSTM layers to be “**stateful**” to maintain state across batches

There are [several pieces of advice and experience from others in the Responses section](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/) for the original code, make use of this. There is also a further example provided on the same page using a larger network that also uses Dropout, which I do not personally think is a good strategy in this case, but again, make use of what is available.