### MLDA@EEE Deep Learning Week Special:
# **Text Generator using RNN**

This notebook is part of MLDA@EEE's series of workshops during the Deep Learning week.

Designed to run in Google Colab.



In this workshop, we assumed that you have attended the workshops in pre-deep learning week and have basic knowledge of **Python** programming, **deep learning** as well as **neural network** Basics.
If not, don't worry, as you will be instructed step by step in this pratical session to apply what you learnt during the tutorial session. If you encounter any technical issues or need assistance from us, you can ask us in the ZOOM chat and a helper will come to you as soon as possible.

The structure of this pratical session is listed below:
1. Text Processing Basics
2. RNN Building Basics
3. Lyrics Generator
4. Shakespeare Generator


### **Connect to GPU instance (Recommend)**
To connect to GPU instance on Google Colab, follow the instruction below

Edit > Notebook settings > Hardware accelerator > GPU

## **1. Text Processing Basics**

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.

To give you a more intuitive perspective, we will start with a short file 'eee-overview.txt' and do some practices on text processing basics first.

In [None]:
# download 'eee-overview.txt' file
!wget https://ycrao573.github.io/rnn-workshop/eee-overview.txt

In [None]:
overview = open('eee-overview.txt', 'r').read()
# length of text is the number of characters in it
print('Length of text: {} characters'.format(len(overview)))
print('First 100 characters: \n', overview[:100])

In [None]:
# The unique characters in the file
vocab = sorted(set(overview))
print ('{} unique characters'.format(len(vocab)))

***TASK 1: Text Pre-Processing - Special Characters Cleaning***

Please only change the code in **\# INSERT YOUR CODE HERE** or **None**

In [None]:
stopChars = None # list of stop charaters

# iterate over stopChars and replace them with space
# use string's replace(' ', ' ') method and store in 'corpus'
corpus = None

print(corpus[:100])

In [None]:
corpus_words = [i for i in corpus.split() if i]
corpus_words[:5]

In [None]:
map(str.strip, corpus_words)
vocab = sorted(set(corpus_words))
print('Corpus length (in words):', len(corpus_words))
print('Unique words in corpus: {}'.format(len(vocab)))

***TASK 2: Use Index Number to Represent the Word***

In this task, you will assign an index to each word in string 's', e.g. {1: 'eee'}

You are encouraged to use dict and list comprehension to perform assignment and substitution.


In [None]:
s = 'the school of electrical and electronic engineering ntu eee began as one of the three'
# using dictionary comprehension to iterate vocab, whose index should start with zero
word2idx = None
print(word2idx)
# using list comprehension to iterate word2idx and replace word with number (index)
# print them in the below
# INSERT YOUR CODE HERE

Now, we will introduce tensorflow library so that we can process our text with higher quality and efficiency.

Let's start with tokenization:

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
sentences = [
    'I love coffee',
    'I do not like tea.',
    'We all love MLDA!'
]
tokenizer = Tokenizer(num_words = 32)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print('word_index: ', word_index)
test_sen = [
    'I like coffee.',
    'You really love tea?',
    'We love MLDA ah!'
]
test_seq = tokenizer.texts_to_sequences(test_sen)
print(test_seq)

We may also consider adding oov_token. Keras lets us define an Out Of Vocab token - this will replace any unknown words with a token of our choosing. This is better than just throwing away unknown words since it tells our model there was information here.

In [None]:
tokenizer = Tokenizer(num_words = 32, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print('word_index: ', word_index)
sequences = tokenizer.texts_to_sequences(sentences)
test_seq = tokenizer.texts_to_sequences(test_sen)
print(test_seq)

All the neural networks require to have inputs that have the same shape and size. However, when we pre-process and use the texts as inputs for our model e.g. LSTM, not all the sentences have the same length. In other words, naturally, some of the sentences are longer or shorter. We need to have the inputs with the same size, this is where the padding is necessary.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = 32, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=8)
print(word_index)
print(sequences)
print(padded)

## **2. RNN Building Basics**

**Recurrent neural networks (RNN)** are a class of neural networks that is powerful for modeling sequence data such as time series or natural language. Schematically, a RNN layer uses a for loop to iterate over the timesteps of a sequence, while maintaining an internal state that encodes information about the timesteps it has seen so far.

The **Keras RNN API** is designed with a focus on:


*   Ease of use: the built-in keras.layers.RNN, keras.layers.LSTM, keras.layers.GRU layers enable you to quickly build recurrent models without having to make difficult configuration choices.

*   **Ease of customization**: You can also define your own RNN cell layer (the inner part of the for loop) with custom behavior, and use it with the generic keras.layers.RNN layer (the for loop itself). This allows you to quickly prototype different research ideas in a flexible way with minimal code.

For more information about building RNN in keras, please visit TensorFlow official documentation [here](https://www.tensorflow.org/guide/keras/rnn)

In [None]:
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

In [None]:
# initialize a tokenizer first
tokenizer = Tokenizer()

overview = open('eee-overview.txt', 'r').read()
corpus = overview.lower().split("\n")
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

print(tokenizer.word_index)
print(total_words)

In [None]:
input_sequences = []
for line in corpus:
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]
		input_sequences.append(n_gram_sequence)

# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

In [None]:
# model building
model = Sequential()
model.add(Embedding(total_words, 32, input_length=max_sequence_len-1))
model.add(SimpleRNN(32))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
simple_history = model.fit(xs, ys, epochs=100, verbose=1)

In [None]:
seed_text = "eee is"
next_words = 20
  
for _ in range(next_words):
	token_list = tokenizer.texts_to_sequences([seed_text])[0]
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	predicted = model.predict_classes(token_list, verbose=0)
	output_word = ""
	for word, index in tokenizer.word_index.items():
		if index == predicted:
			output_word = word
			break
	seed_text += " " + output_word
print(seed_text)

## **3. Lyrics Generator (Word Tokenization)**

This tutorial demonstate how to generator text based on given text using word tokenization and RNN. The text containing the song titles and the lyrics of many famous songs of Beatles (credit: [petrosDemetrakopoulos](https://github.com/petrosDemetrakopoulos/)). So, given a sequence of words from Beatles lyrics, it can predict the next words.





<img src = "https://miro.medium.com/max/2560/0*SUipu9efyQeKHdlk." width = 70%>

***Task 3: Build the Model and Train It!***

In this task, you will get the chance to experience the whole process of building a lyrics generator using Tensorflow Keras.

Following the steps in **section two** (very important, make sure you understand them all) and create a lyrics generator for given text file.

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, GRU, SimpleRNN, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers

In [None]:
!wget

In [None]:
# open the 'lyrics.txt' file
text = None
# print length of text and first 250 words of the text
# INSERT YOUR CODE HERE


In [None]:
tokenizer = Tokenizer()

corpus = None

# Tokenization Process
# INSERT YOUR CODE HERE
total_words = None # length of the corpus

print(tokenizer.word_index)
print(total_words)

In [None]:
input_sequences = []
for line in corpus:
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
		# Generate n-gram sequences
		# INSERT YOUR CODE HERE

# max of input sequences
max_sequence_len = None
# pad sequences
input_sequences = None

# create predictors and label
predictors, labels = None
label = None

In [None]:
model = None

# Build your own RNN model with LSTM or GRU layers
# INSERT YOUR CODE HERE

model.summary()

In [None]:
# train your model and store it in history
history = None

***Task 4: Our Accuracy Graph over Epochs***

In [None]:
import matplotlib.pyplot as plt

# Use pyplot to plot a graph with x as epoch number, y as accuracy
def plot_graphs(history, string):
  #INSERT YOUR CODE HERE

plot_graphs(history, 'accuracy')

***Task 5: Predict Words Using Your Starting Text***

In [None]:
seed_text = "One day"
next_words = 10
  
for _ in range(next_words):
	# INSERT YOUR CODE HERE

print(seed_text)

## **4. Shakespeare Generator (Character Tokenization)**

This tutorial demonstrates how to generate text using a character-based RNN. We will work with a dataset of Shakespeare's writing from Andrej Karpathy's The Unreasonable Effectiveness of Recurrent Neural Networks. Given a sequence of characters from this data ("Shakespear"), train a model to predict the next character in the sequence ("e"). Longer sequences of text can be generated by calling the model repeatedly.

In [None]:
import tensorflow as tf
import numpy as np
import time
#Download the dataset
path_to_file = tf.keras.utils.get_file('shakespeare.txt',
                                       'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
#Explore the data
text = open(path_to_file, 'r').read()
print(text[:100])

***Task 6: Text Processing on Character Level***

In [None]:
# Create the vocab of the given text
vocab = None
print ('{} unique characters'.format(len(vocab)))

# Creating a mapping from unique characters to indices
char2idx = None
idx2char = np.array(vocab)

# Map the character to corresponding integer (char2idx)
text_as_int = np.array(None)
for char,_ in zip(char2idx, range(5)):
    print(repr(char), ':', char2idx[char])

In [None]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

In [None]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

In [None]:
def split_input_target(chunk):
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text
dataset = sequences.map(split_input_target)

In [None]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [None]:
# Helper function for building new model
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):

  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])

  return model

# return the chosen loss parameter
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

***Task 7: Build Shakespeare Generator Model***

In [None]:
# Complete the model infomation below
model = build_model(
    vocab_size=len(None),
    embedding_dim=None,
    rnn_units=None,
    batch_size=None)

print(model.summary())

model.compile(optimizer=None, loss=loss, metrics=None)

In [None]:
import os
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [None]:
# Train your model here
history = model.fit(dataset, epochs=20, callbacks=[checkpoint_callback], verbose=1)

In [None]:
# Simplify the output model with batch_size = 1 for ease of prediction
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

In [None]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
    predictions = model(input_eval)
    # remove the batch dimension
    predictions = tf.squeeze(predictions, 0)

    # using a categorical distribution to predict the character returned by the model
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    # We pass the predicted character as the next input to the model
    # along with the previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

***Task 8: Check Your Prediction Result***

In [None]:
# Based on the helper function for generating text using the learned model above
# print the generated text with your preferred starting string

# INSERT YOUR CODE HERE


## **(Optional) Text Generation using GPT-2 Transformer**

OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It’s a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

Here, we will use a lighter version, [GPT-2 Medium Model](https://huggingface.co/gpt2-medium) for your convenience and demo purpose.

In [None]:
!pip install transformers

In [None]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer, GPT2Config

In [None]:
model_name = "gpt2-medium"
config = GPT2Config.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
gptmodel = TFGPT2LMHeadModel.from_pretrained(model_name, config=config)

In [None]:
from transformers import set_seed
set_seed(23)

In [None]:
input_ids = tokenizer.encode('I love machine learning and data analytics,', return_tensors='tf')
input_ids

In [None]:
output = gptmodel.generate(input_ids, max_length=15)
print('Output:\n')
print(tokenizer.decode(output[0], skip_special_tokens=True))

In [None]:
sample_outputs = gptmodel.generate(
    input_ids,
    do_sample=True,
    max_length=30,
    top_k=50,
    top_p=0.95,
    num_return_sequences=8
)

for i, sample_output in enumerate(sample_outputs):
  
  print("{}: {}".format(i, tokenizer.decode(sample_output)))