<center><h1><b>Practical #3</b></h1>
<h2><b>RNNs: Recurrent Neural Networks</b><br/>
<b>for Text Generation</b></h2></center>

RNNs are a type of neural networks which use the **output** they produce as **input** for the next element generated. This structure is generally very useful in generating any sequential data (such as text!). Check week 3's lectures for more details on how they work.

<center><img width=600 src="https://drive.google.com/uc?id=1wotuCYP_0wRtOfGXc6xFhiEsPLXXqKpH"></center>

<br/>

During today's practical you will:
1. Use given code to build the structure of a character-based RNN, load datasets and train it to observe results. 
2. Scrape the web for more data to use in training the text generation model.
3. Follow-up with exercises exploring the possibilities for customising the RNN and ready-made interactive applications using RNNs.


# Getting started

Create your own Jupyter notebook in Google Colab (or download a copy of this one so that you can edit it). Make sure that you enable GPU for the session (`Edit -> Notebook Settings -> Hardware accelerator -> GPU`). If you've made your own one, copy the code structure given here. Otherwise, just fill in the missing code.

Next, import `tensorflow` into your project and check that everything is set up appropriately. The expected output from the following block of code is the name of the GPU (if nothing is printed after "Found GPU at:", then you don't have GPU access, please notify a demonstrator).

We'll also perform the other required imports in this step.

In [None]:
#@title Imports

import tensorflow as tf
import numpy as np
import random
import os
import pandas as pd

device_name = tf.test.gpu_device_name()
print('Found GPU at: {}'.format(device_name))

print(tf.__version__)

# Data

The first thing we'll do in today's session is to retrieve a data set of text. The RNN that we will train next on this data set will output similar text. You can experiment with different options (and even upload your own text file!), but for now we'll go with a book from [Project Gutenberg](https://www.gutenberg.org/ebooks/).

Some other options:
*   **NLTK Text Corpora**: several datasets of various texts, which can be accessed through the `nltk` library. ([More info](https://www.nltk.org/book/ch02.html))
*   **Wikipedia**: a collection of cleaned Wikipedia articles in all languages. ([More info](https://www.tensorflow.org/datasets/catalog/wikipedia))
*   **Shakespeare**: "The Winter's Tale" by William Shakespeare ([link](https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt) - known to not be accessible at all times)

---

<br/>

## <h1><img width=30 src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Bright_green_checkbox-checked.svg/1024px-Bright_green_checkbox-checked.svg.png"> <b>TODO-1:</b> Load a dataset</h1> 

<b>Choose</b> a book from Project Gutenberg, which will be loaded into a `text` variable. For any book on the website, if you navigate to its page, you can get to a link pointing to a plain-text version. Copy this link as the input for our dataset.

<i>Optional</i>: check the code to understand how the dataset is loaded.

In [None]:
# -*- coding: utf-8 -*-
#book_choice = "https://www.gutenberg.org/files/1342/1342-0.txt"  # @param {type: "string"}
#path_to_file = tf.keras.utils.get_file("Book", book_choice)
from google.colab import drive
drive.mount('/content/gdrive')
artists = ['Led Zeppelin', 'The Doors', 'Chuck Berry']
df = pd.DataFrame()
for artist in artists:
    df = df.append(pd.read_csv(f'/content/gdrive/My Drive/Colab Notebooks/{artist}.csv'))
text = '\n'.join(df['lyrics'])
#text = open(path_to_file, 'rb').read().decode(encoding='utf_8_sig')

# The length of text is the number of characters in it
print (len(text))

## <h1><img width=30 src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Bright_green_checkbox-checked.svg/1024px-Bright_green_checkbox-checked.svg.png"> <b>TODO-2:</b> Vectorize the text</h1> 

We need to vectorize the text (mapping all the characters to numbers), so that the neural network can process it (it works with numbers, not text!). In this example, we're going to be working at character level (i.e. generate sequences of characters).

<b>Implement</b> the code to find the list of each unique character in the text (*hint*: a sorted set is useful). Store this in a variable called `vocabulary` and create 2 more data structures:
  * `char2idx`: a dictionary mapping from each unique character to its index in the list
  * `idx2char`: a numpy array of all unique characters in the text


In [None]:
# TODO: compute the list of all unique characters in the file
vocabulary = sorted(set(text), key=lambda x:x)#TODO
# TODO: create the 2 data structures
char2idx = dict(zip(vocabulary, range(len(vocabulary))))
idx2char = np.array(vocabulary)#TODO

## Data pre-processing

Let's keep track of some parameters related to our dataset. We recommend to keep `BATCH_SIZE` at 64, but you can play around with this.

In [None]:
#@title Dataset parameters

# batch size, default: 64
BATCH_SIZE = 64  # @param {type: "integer"}
# buffer size to shuffle our dataset, default 10000
BUFFER_SIZE = 10000  # @param {type: "integer"}
# number of RNN units, default 1024
N_RNN_UNITS = 1024  # @param {type: "integer"}
# length of text chunks for training, default 100
MAX_LENGTH =   40# @param {type: "integer"}
# size of the embedding layer, default 256
EMBEDDING_DIM = 256    # @param {type: "integer"}

VOCAB_SIZE = len(vocabulary)  # length of the vocabulary in chars
print("Batch size: {} \nBuffer size: {} \n# RNN Units: {}\
       \nMax input length: {} \nVocabulary size: {} \nEmbedding dimension: {}".format(
            BATCH_SIZE, BUFFER_SIZE, N_RNN_UNITS, MAX_LENGTH, VOCAB_SIZE, EMBEDDING_DIM
        )
)

Next we need to create the training data for the network. We want to be able to predict the next character in a sequence, and we will set this up as follows:

1. Split the dataset text into chunks of size `MAX_LENGTH` set earlier, starting from the first character. This will be the input data.
1. Split the dataset text into chunks of size `MAX_LENGTH` set earlier, starting from the second character (thus including one more character than the input). This will be the target data.
1. Transform the chunks of text into number vectors, using the `char2idx` dictionary defined before.

For example, take the dataset "Tensorflow is great". If `MAX_LENGTH` is set to 9, then we have input string "tensorflo" mapping to target output "ensorflow", and input "w is grea" mapping to target output " is great". This will efficiently teach the network how sequences of characters work, and what the probabilities of characters are based on `MAX_LENGTH` previous characters.


In [None]:
#@title Obtaining input and target data

input_text = []
target_text = []

for c in range(0, len(text)-MAX_LENGTH, MAX_LENGTH):
    inps = text[c : c + MAX_LENGTH]
    tars = text[c + 1 : c + 1 + MAX_LENGTH]

    input_text.append([char2idx[i] for i in inps])
    target_text.append([char2idx[t] for t in tars])
    
print (np.array(input_text).shape)
print (np.array(target_text).shape)

Next, we create batches from the data.

In [None]:
#@title Batch datasets
dataset = tf.data.Dataset.from_tensor_slices((input_text, target_text)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

dataset


And that's it! our data is all set up and ready to be fed into the RNN. 

# Build the network structure

Let's start setting up the network. The input for the generator is a vector of size `MAX_LENGTH`, with contents in the range [0, VOCAB_SIZE). The output is a probability distribution over the vocabulary available, indicating the probability for each character of appearing next in the sequence.

---
We'll need the following layers for the network structure, and we use an [Adam optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam):
* **[Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)**: Turns positive integers into dense vectors of a fixed size.
* **[GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU)**: Gated Recurrent Unit ([Cho et al. 2014](https://arxiv.org/pdf/1406.1078.pdf)), RNN network setup. GRU is a very similar architecture to LSTM (see lecture notes).
* **[Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)**: A densely connected layer.

When you run the following code, it will output a summary of the network structure. We can also test this configuration with an example, and observe initial random output.

In [None]:
#@title Set up generator network structure

# Define the loss function
def loss_function(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

# Define input and output around the RNN (GRU)
def build_model(vocab_size=VOCAB_SIZE, embedding_dim=EMBEDDING_DIM, n_rnn_units=N_RNN_UNITS, batch_size=BATCH_SIZE):
    model = tf.keras.Sequential([
            tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                      batch_input_shape=[batch_size, None]),
            tf.keras.layers.GRU(n_rnn_units,
                                return_sequences=True,
                                stateful=True,
                                recurrent_activation='sigmoid',
                                recurrent_initializer='glorot_uniform'),
            tf.keras.layers.Dense(vocab_size)
        ])
    model.summary()
    return model

model = build_model()

# Define the optimiser
# default: 0.001
opt_learning_rate = 0.001  #@param{type:"raw"}
# default: 0.5
opt_beta = 0.5 #@param{type:"raw"}
optimizer = tf.keras.optimizers.Adam(opt_learning_rate, beta_1=opt_beta)

# Compile the model
model.compile(optimizer, loss_function)

In [None]:
#@title Test configuration with one example

for input_example_batch, target_example_batch in dataset.take(1):
    # Run the batch through the model
    example_batch_predictions = model(input_example_batch)

    # Print output shape
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

    # To get the predictions, sample over the output distribution
    sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
    sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy() 
    
    # Decode the indices to see the text predicted by the (untrained) model
    print("Input: \n", repr("".join(idx2char[input_example_batch[0]])), "\n")
    print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices])))

# Train and run the network

Once the model is trained (or even before!), we can run it! In this case, we want the model to generate some text (so, several characters, instead of just one), given some input by the user. Let's set up a function that does just that.

In [None]:
#@title Set up text generation function

def generate_text(model, input_text, n_characters_output=1000):
    # First, vectorize the input text as before
    input_eval = [char2idx[s] for s in input_text]
    input_eval = tf.expand_dims(input_eval, 0)

    # We'll store results in this variable
    text_generated = []

    # Generate the number of characters desired
    model.reset_states()
    for i in range(n_characters_output):
        # Run input through model
        predictions = model(input_eval)

        # Remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # Using a categorical distribution to predict the character returned by the model
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # Pass the predicted character as the next input to the model
        input_eval = tf.expand_dims([predicted_id], 0)

        # Add the predicted character to the output
        text_generated.append(idx2char[predicted_id])

    # Return output
    return (input_text + ''.join(text_generated))

In this practical, we'll make use of the built-in functions for training the model, as this network is much simpler than the GAN we explored before. First, we'll set up checkpoints at which the current state of the model should be saved, pointing to a directory in the GDrive. This will allow to restore the model at any point during training and use that to generate text or fine-tune from there.

In [None]:
#@title Save checkpoints during training

from google.colab import drive
drive.mount('/content/gdrive')

# Directory where the checkpoints will be saved
path = 'My Drive/Work/Colab/TextGen/' #@param{type: 'string'}
full_path = "/content/gdrive/" + path + "ckpt_{epoch}" 

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
                      filepath=full_path,
                      save_weights_only=True)

In [None]:
#@title Train the model

# default: 50
n_epochs =  100# @param{type: "integer"} 
history = model.fit(dataset, epochs=n_epochs, callbacks=[checkpoint_callback])

In [None]:
#@title Restore latest checkpoint and build model

batch_size = 1

model = build_model(batch_size=batch_size)
model.load_weights(tf.train.latest_checkpoint("/content/gdrive/" + path))
model.build(tf.TensorShape([batch_size, None]))

In [None]:
#@title Generate text!

#input_text = "In the morning, "  # @param {type: "string"}

input_text = "I love you"  # @param {type: "string"}
n_characters_output = 1000 #@param 
print(generate_text(model, input_text=input_text, n_characters_output=n_characters_output))

# Exercises

## <h1><img width=30 src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Bright_green_checkbox-checked.svg/1024px-Bright_green_checkbox-checked.svg.png"> <b>TODO-3:</b> Acquiring more data (web scraping)</h1> 

When working with such text generation models, you might find you need more data - whether it is that data specific to your desired application is not readily available, or you simply need *more*. Let's look at how we can gather data which can then be used for training the same network we've set up here.

The code in the next cell defines a way of scraping a [lyrics website](https://www.lyrics.com/) to find song lyrics from a particular artist. It uses as input the link to the artist's page, and then uses the HTML of that page to find links to lyrics of that artist's songs (with a maximum set so that the program doesn't take too long to run). 

The data obtained is then put into a `text` variable, from which point the usual code in the notebook for vectorizing the text, data-preprocessing and training the network can be run. Try it out! Is this network good at generating new songs for the chosen artist? 

In [None]:
import requests
import lxml
from lxml import etree
from IPython import display
import time

# Obtain the desired artist's page HTML
artist_url = 'https://www.lyrics.com/artist/Michael-Bubl%C3%A9/554516' #@param {type: "string"}
site_html = requests.get(artist_url)

# Process this into a tree using lxml
html_tree = etree.HTML(site_html.content)
# html_text = str(etree.tostring(html_tree, pretty_print=True), "utf-8")
# print(html_text)

# Extract data
# Albums and songs in "<div class='tdata-ext'>"
# Links are https://www.lyrics.com + x where "<a href='x'>"
# /album, /artist, or /lyric

content_by_artist = html_tree.xpath('//div[contains(@class, "tdata-ext")]')[0]
# print(str(etree.tostring(div, pretty_print=True), "utf-8"))
content_list = content_by_artist.xpath('//a/@href')
filter_list = [x for x in content_list if x.startswith('/lyric')]
# print(filter_list)
print(len(filter_list))

max_songs = 20 #@param{type: 'number'}
data = []
s = 0
for lyric_url in filter_list:
    url = 'https://www.lyrics.com' + lyric_url
    html = requests.get(url)
    tree = etree.HTML(html.content)
    lyric_html = tree.xpath('//pre[contains(@class, "lyric-body")]')
    if len(lyric_html) == 1:
        lyric_html = lyric_html[0]
    else:
        continue
    # print(str(etree.tostring(lyric_html, pretty_print=True), "utf-8"))

    lyrics = ''.join(lyric_html.xpath('.//text()'))
    data.append(lyrics)

    s += 1
    display.clear_output()
    print("Processed:",s)
    if s > max_songs:
        break
    if s % 10 == 0:
        time.sleep(0.5)

# Save all lyrics for the artist in a single string called 'text'
text = '\n'.join(data)

# Print out some of the text obtained
display.clear_output()
print(text[:500])


1. <b>Repeat</b> the whole process for a different artist of your choice (*note*: you may need to adapt the html processing code, depending on the web page you're using!).

## <img width=30 src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Bright_green_checkbox-checked.svg/1024px-Bright_green_checkbox-checked.svg.png"> <b>TODO-4:</b> Customize the network

1. You will find several **parameters** in this notebook, highlighted on the right side of the code blocks (e.g. learning rate for the optimiser). If you modify some of these values, how do you think it will impact the quality of the text generated? 
  <ol type="a">
  <li>Is it better, or worse? </li>
  <li>Does it need more or less training time to start generating good text?  </li>
  <li>Try some different values and check if your intuition is correct.</li>
  </ol>

2. The **structure** of the network can also be modified.  

  <ol type="a">
  <li>Look into the different activation functions available and check if others work better or worse in this context.</li>
  <li>What other optimisers could be used for better performance?</li>
  <li>What about the layers? What happens if another RNN layer is added to the network?</li>
  </ol>

## <img width=30 src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Bright_green_checkbox-checked.svg/1024px-Bright_green_checkbox-checked.svg.png"> <b>TODO-5:</b> Test online RNN applications

1. [Oleksii Trekhleb's Machine Learning Experiments](https://trekhleb.github.io/machine-learning-experiments/#/) offers several browser-based software, allowing you to run RNNs to generate Shakespeare, Wikipedia-like text, recipes or sum numbers. Scroll to the bottom of the page to find the RNN applications.
1. [AI Dungeon](https://play.aidungeon.io/) uses GPT-2 to create text-based adventures. Explore this game and try to check what happens if you fight a dragon with a candlestick.
1. [Write With Transformer](https://transformer.huggingface.co/doc/gpt2-large) is a web application using GPT2 for text prediction while writing a document.
1. [Talk to Transformer](https://app.inferkit.com/demo) is a text generation demo working similarly to today's practical, using GPT-2.


## Related tutorials

Some of the materials in this practical were based on the following tutorials:

* Tensorflow, [Text Generation](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/text_generation.ipynb)
* Oleksii Trekhleb, [Wikipedia Text Generation](https://colab.research.google.com/github/trekhleb/machine-learning-experiments/blob/master/experiments/text_generation_wikipedia_rnn/text_generation_wikipedia_rnn.ipynb)
* Max Woolf, [textgenrnn](https://colab.research.google.com/drive/1mMKGnVxirJnqDViH7BDJxFqWrsXlPSoK)

## GPT-2 fine-tuning tutorial

The powerful GPT-2 model is available for download and public use. The Generative Pre-trained Transformer (GPT) 2 is a large model developed by OpenAI and trained over a large collection of data for a long time (thus using more resources than regularly available). If you'd like to read more about this, please refer to [this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and the [OpenAI blog](https://openai.com/blog/better-language-models/).

[This tutorial](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce) provides a nice introduction to downloading an easy to use version of the GPT-2 model (in different sizes as well), although it is only compatible with Tensorflow 1.x at the moment of writing (in Google Colab, you can use the command `%tensorflow_version 1.x` **before** you import tensorflow to change versions).