<a href="https://colab.research.google.com/github/rahiakela/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/blob/16-natural-nanguage-processing-with-RNNs-and-Attention/1_generating_shakespearean_text_using_character_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Shakespearean Text Using a Character RNN

A common approach for natural language tasks is to use recurrent neural networks.
We will therefore continue to explore RNNs, starting with
a character RNN, trained to predict the next character in a sentence. This will allow us to generate some original text, and in the process we will see how to build a TensorFlow Dataset on a very long sequence. 

We will first use a stateless RNN (which learns on random portions of text at each iteration, without any information on the rest of the text), then we will build a stateful RNN (which preserves the hidden state between training iterations and continues reading where it left off, allowing it to learn longer patterns).

Let’s start with a simple and fun model that can write like Shakespeare (well, sort of).

## Setup

In [1]:
import sys
assert sys.version_info >= (3, 5)  # Python ≥3.5 is required

import sklearn 
assert sklearn.__version__ >= "0.20"  # Scikit-Learn ≥0.20 is required

# %tensorflow_version only exists in Colab.
try:
  %tensorflow_version 2.x
  IS_COLAB = True
except Exception:
  IS_COLAB = False
  pass

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= '2.0'

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

TensorFlow 2.x selected.


## Generating Shakespearean Text Using a Character RNN

In a famous 2015 [blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) titled “The Unreasonable Effectiveness of Recurrent Neural Networks,” Andrej Karpathy showed how to train an RNN to predict the next character in a sentence. This Char-RNN can then be used to generate novel text, one character at a time. Here is a small sample of the text generated by a Char-RNN model after it was trained on all of Shakespeare’s work:



---

PANDARUS:

Alas, I think he shall be come approached and the day

When little srain would be attain’d into being never fed,

And who is but a chain and subjects of his death,

I should not sleep.

---

Not exactly a masterpiece, but it is still impressive that the model was able to learn words, grammar, proper punctuation, and more, just by learning to predict the next character in a sentence. 

Let’s look at how to build a Char-RNN, step by step, starting
with the creation of the dataset.


## Splitting a sequence into batches of shuffled windows

For example, let's split the sequence 0 to 14 into windows of length 5, each shifted by 2 (e.g.,`[0, 1, 2, 3, 4]`, `[2, 3, 4, 5, 6]`, etc.), then shuffle them, and split them into inputs (the first 4 steps) and targets (the last 4 steps) (e.g., `[2, 3, 4, 5, 6]` would be split into `[[2, 3, 4, 5], [3, 4, 5, 6]]`), then create batches of 3 such input/target pairs:

In [2]:
np.random.seed(42)
tf.random.set_seed(42)

n_steps = 5
dataset = tf.data.Dataset.from_tensor_slices(tf.range(15))
dataset = dataset.window(n_steps, shift=2, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(n_steps))
dataset = dataset.shuffle(10).map(lambda window: (window[:-1], window[1:]))
dataset = dataset.batch(3).prefetch(1)
for index, (X_batch, Y_batch) in enumerate(dataset):
    print("_" * 20, "Batch", index, "\nX_batch")
    print(X_batch.numpy())
    print("=" * 5, "\nY_batch")
    print(Y_batch.numpy())

____________________ Batch 0 
X_batch
[[6 7 8 9]
 [2 3 4 5]
 [4 5 6 7]]
===== 
Y_batch
[[ 7  8  9 10]
 [ 3  4  5  6]
 [ 5  6  7  8]]
____________________ Batch 1 
X_batch
[[ 0  1  2  3]
 [ 8  9 10 11]
 [10 11 12 13]]
===== 
Y_batch
[[ 1  2  3  4]
 [ 9 10 11 12]
 [11 12 13 14]]


## Creating the Training Dataset

First, let’s download all of Shakespeare’s work, using Keras’s handy get_file() function and downloading the data from Andrej Karpathy’s Char-RNN project:

In [3]:
shakespeare_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
filepath = keras.utils.get_file('shakespeare.txt', shakespeare_url)

with open(filepath) as f:
  shakespeare_text = f.read()

print(shakespeare_text[:148])

Downloading data from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?



In [4]:
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

We must encode every character as an integer. One option is to create a custom preprocessing layer. 

But in this case, it will be simpler to use Keras’s Tokenizer class. First we need to fit a tokenizer to the text: it will find all the characters used in the text and map each of them to a different character ID, from 1 to the number of distinct characters.

In [0]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

We set char_level=True to get character-level encoding rather than the default word-level encoding. Note that this tokenizer converts the text to lowercase by default (but you can set lower=False if you do not want that). 

Now the tokenizer can encode a sentence (or a list of sentences) to a list of character IDs and back, and it tells us how many distinct characters there are and the total number of characters in the text:

In [6]:
tokenizer.texts_to_sequences(['First'])

[[20, 6, 9, 8, 3]]

In [7]:
tokenizer.sequences_to_texts([[20, 6, 8, 3]])

['f i s t']

In [8]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count  # total number of characters
print(max_id)
print(dataset_size)

39
1115394


Let’s encode the full text so each character is represented by its ID (we subtract 1 to get IDs from 0 to 38, rather than from 1 to 39):

In [0]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

### How to Split a Sequential Dataset

It is very important to avoid any overlap between the training set, the validation set, and the test set.

When dealing with time series, you would in general split across time,: for example, you might take the years 2000 to 2012 for the training set, the years 2013 to 2015 for the validation set, and the years 2016 to 2018 for the test set. However, in some cases you may be able to split along other dimensions, which will give you a longer time period to train on.

So, it is often safer to split across time—but this implicitly assumes that the patterns the RNN can learn in the past (in the training set) will still exist in the future. In other words, we assume that the time series is *stationary* (at least in a wide sense).

In short, splitting a time series into a training set, a validation set, and a test set is not a trivial task, and how it’s done will depend strongly on the task at hand.

Let’s take the first 90% of the text for the training set
(keeping the rest for the validation set and the test set), and create a tf.data.Dataset that will return each character one by one from this set.

In [0]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

### Chopping the Sequential Dataset into Multiple Windows

The training set now consists of a single sequence of over a million characters, so we can’t just train the neural network directly on it: the RNN would be equivalent to a deep net with over a million layers, and we would have a single (very long) instance to train it. Instead, we will use the dataset’s window() method to convert this long sequence of characters into many smaller windows of text. 

Every instance in the dataset will be a fairly short substring of the whole text, and the RNN will be unrolled only over the length of these substrings. This is called **truncated backpropagation through time**.

In [0]:
n_steps = 100
window_length = n_steps + 1  # target = input shifted 1 character ahead
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True) 

By default, the window() method creates nonoverlapping windows, but to get the largest possible training set we use shift=1 so that the first window contains characters 0 to 100, the second contains characters 1 to 101, and so on. To ensure that all windows are exactly 101 characters long, we set drop_remainder=True.

The window() method creates a dataset that contains windows, each of which is also represented as a dataset. It’s a nested dataset, analogous to a list of lists. This is useful when you want to transform each window by calling its dataset methods (e.g., to shuffle them or batch them).

However, we cannot use a nested dataset directly for training, as our model will expect tensors as input, not datasets. So, we must call the flat_map() method: it converts a nested dataset into a flat dataset (one that does not contain datasets).

In [0]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

Notice that we call batch(window_length) on each window: since all windows have exactly that length, we will get a single tensor for each of them. 

Now the dataset contains consecutive windows of 101 characters each. Since Gradient Descent works best when the instances in the training set are independent and identically distributed.

we need to shuffle these windows. Then we can batch the windows and separate the inputs (the first 100 characters) from the target (the last character):

In [0]:
np.random.seed(42)
tf.random.set_seed(42)

batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

<img src='https://github.com/rahiakela/img-repo/blob/master/hands-on-machine-learning-keras-tensorflow/shuffled-windows.png?raw=1' width='800'/>

It summarizes the dataset preparation steps discussed so far (showing windows of length 11 rather than 101, and a batch size of 3 instead of 32).

Categorical input features should generally be encoded,
usually as one-hot vectors or as embeddings. Here, we will encode each character using a one-hot vector because there are fairly few distinct characters(only 39).

In [14]:
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset.take(1)

<TakeDataset shapes: ((None, None, 39), (None, None)), types: (tf.float32, tf.int64)>

Finally, we just need to add prefetching.

In [15]:
dataset = dataset.prefetch(1)
dataset.take(1)

<TakeDataset shapes: ((None, None, 39), (None, None)), types: (tf.float32, tf.int64)>

In [16]:
for X_batch, Y_batch in dataset.take(1):
  print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


That’s it! Preparing the dataset was the hardest part. Now let’s create the model.

## Building and Training the Char-RNN Model

To predict the next character based on the previous 100 characters, we can use an RNN with 2 GRU layers of 128 units each and 20% dropout on both the inputs (dropout) and the hidden states (recurrent_dropout). We can tweak these hyperparameters
later, if needed. The output layer is a time-distributed Dense layer.

This time this layer must have 39 units (max_id) because there are 39 distinct characters in the text, and we want to output a probability for each possible character (at each time step). The output probabilities should sum up to 1 at each time step, so we apply the softmax activation function to the outputs of the Dense layer. 

We can then compile this model, using the "sparse_categorical_crossentropy" loss and an Adam optimizer.

In [17]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))                            
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru (GRU)                    (None, None, 128)         64896     
_________________________________________________________________
gru_1 (GRU)                  (None, None, 128)         99072     
_________________________________________________________________
time_distributed (TimeDistri (None, None, 39)          5031      
Total params: 168,999
Trainable params: 168,999
Non-trainable params: 0
_________________________________________________________________


In [18]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

history = model.fit(dataset, steps_per_epoch=train_size // batch_size, epochs=5)

Train for 31370 steps
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Using the Char-RNN Model

Now we have a model that can predict the next character in text written by Shakespeare. To feed it some text, we first need to preprocess it like we did earlier, so let’s create a little function for this:

In [0]:
def preprocess(texts):
  X = np.array(tokenizer.texts_to_sequences(texts)) - 1
  return tf.one_hot(X, max_id)

Now let’s use the model to predict the next letter in some text:

In [20]:
X_new = preprocess(['How are yo'])
Y_pred = model.predict_classes(X_new)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1]  # 1st sentence, last char

'u'

In [21]:
tokenizer.sequences_to_texts(Y_pred + 1)[0]  # # 1st sentence, all chars

'e u   s   e   t o u'

Success! The model guessed right. Now let’s use this model to generate new text.

## Generating Fake Shakespearean Text

To generate new text using the Char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it at the end of the text, then give the extended text to the model to guess the next letter, and so on. But in practice this often leads to the same words being repeated over and over again. 

Instead, we can pick the next character randomly, with a probability equal to the estimated probability, using TensorFlow’s tf.random.categorical() function. This will generate more diverse and interesting text. The categorical() function samples random class indices, given the class log probabilities (logits). 

In [22]:
tf.random.set_seed(42)

tf.random.categorical([[np.log(0.5), np.log(0.4), np.log(0.1)]], num_samples=40).numpy()

array([[0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        2, 0, 0, 1, 1, 1, 0, 0, 1, 2, 0, 0, 1, 1, 0, 0, 0, 0]])

To have more control over the diversity of the generated text, we can divide the logits by a number called the temperature, which we can tweak as we wish: a temperature close to 0 will favor the highprobability characters, while a very high temperature will give all characters an equal probability.

In [0]:
def next_char(text, temperature=1):
  X_new = preprocess([text])
  y_proba = model.predict(X_new)[0, -1:, :]
  rescaled_logits = tf.math.log(y_proba) / temperature
  char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1

  return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [24]:
tf.random.set_seed(42)

next_char('How are yo', temperature=1)

'u'

In [25]:
next_char('How is lif', temperature=1)

't'

We can write a small function that will repeatedly call next_char() to get the next character and append it to the given text.

In [0]:
def complete_text(text, n_chars=100, temperature=1):
  for _ in range(n_chars):
    text += next_char(text, temperature)
  return text

We are now ready to generate some text! Let’s try with different temperatures:

In [0]:
tf.random.set_seed(42)

print(complete_text('t', temperature=0.2))

In [28]:
print(complete_text('t', temperature=1))

tperves me from this faults revenged,
i bray it! supking, my enemy to the commoner to please yourselv


In [29]:
print(complete_text('t', temperature=2))

t as't so;,' rare' reigs as sir; broighews cipixict hid
curmom's helves with
yoke, are remialumiens; 


In [30]:
print(complete_text('p', temperature=0.2))

prayes.
they her care have the rest was deliver the heart and the belly and with her to the belly and


Apparently our Shakespeare model works best at a temperature close to 1. To generate more convincing text, you could try using more GRU layers and more neurons per layer, train for longer, and add some regularization.

Moreover, the model is currently incapable of learning patterns longer than n_steps, which is just 100 characters. You could try
making this window larger, but it will also make training harder, and even LSTM and GRU cells cannot handle very long sequences. 

Alternatively, you could use a stateful RNN.

## Stateful RNN

Until now, we have used only stateless RNNs: at each training iteration the model starts with a hidden state full of zeros, then it updates this state at each time step, and after the last time step, it throws it away, as it is not needed anymore. 

What if we told the RNN to preserve this final state after processing one training batch and use it as the initial state for the next training batch? 

This way the model can learn long-term patterns despite only backpropagating through short sequences. This is called a stateful RNN.

First, note that a stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off. So the first thing we need to do to build a stateful RNN is to use sequential and non-overlapping input sequences (rather than the shuffled and overlapping sequences we used to train stateless RNNs).

When creating the Dataset, we must therefore use shift=n_steps (instead of shift=1), when calling the window() method. Moreover,
we must obviously not call the shuffle() method.

Unfortunately, batching is much harder when preparing a dataset for a stateful RNN than it is for a stateless RNN.

Indeed, if we were to call batch(32), then 32 consecutive windows would be put in the same batch, and the following batch would not continue each of these window where it left off. 

The first batch would contain windows 1 to 32 and the second batch would contain windows 33 to 64, so if you consider, say, the first window of each batch (i.e., windows 1 and 33), you can see that they are not consecutive. The simplest solution to this problem is to just use “batches” containing a single window:

In [0]:
tf.random.set_seed(42)

In [0]:
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.repeat().batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

<img src='https://github.com/rahiakela/img-repo/blob/master/hands-on-machine-learning-keras-tensorflow/sequence-fragments-for-stateful-rnn.png?raw=1' width='800'/>

Batching is harder, but it is not impossible. For example, we could chop Shakespeare’s text into 32 texts of equal length, create one dataset of consecutive input sequences for each of them, and finally use tf.train.Dataset.zip(datasets).map(lambda
*windows: tf.stack(windows)) to create proper consecutive batches, where the nth input sequence in a batch starts off exactly where the nth input sequence ended in the previous batch.

In [0]:
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size)
datasets = []

for encoded_part in encoded_parts:
  dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
  dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
  dataset = dataset.flat_map(lambda window: window.batch(window_length))
  datasets.append(dataset)

dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows))
dataset = dataset.repeat().map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Now let’s create the stateful RNN. 

First, we need to set stateful=True when creating every recurrent layer. 

Second, the stateful RNN needs to know the batch size (since it
will preserve a state for each input sequence in the batch), so we must set the batch_input_shape argument in the first layer.

Note that we can leave the second dimension unspecified, since the inputs could have any length.

In [0]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2, recurrent_dropout=0.2, 
                     batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))                             
])

At the end of each epoch, we need to reset the states before we go back to the beginning of the text. For this, we can use a small callback:

In [0]:
class ResetStatesCallback(keras.callbacks.Callback):
  def on_epoch_begin(self, epoch, logs):
    self.model.reset_states()

And now we can compile and fit the model.

In [39]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

steps_per_epoch = train_size // batch_size // n_steps

model.fit(dataset, epochs=50, steps_per_epoch=steps_per_epoch, callbacks=[ResetStatesCallback()])

Train for 313 steps
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f951e5fafd0>

After this model is trained, it will only be possible to use it to make predictions for batches of the same size as were used during training.

To avoid this restriction, create an identical stateless model,
and copy the stateful model’s weights to this model.

In [0]:
stateless_model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))                                       
])

To set the weights, we first need to build the model (so the weights get created):

In [0]:
stateless_model.build(tf.TensorShape([None, None, max_id]))

stateless_model.set_weights(model.get_weights())
model = stateless_model

In [0]:
tf.random.set_seed(42)

In [48]:
print(complete_text('t'))

tant honour friends.
'tis mave strented by
do him of discoriolanus and voice as bloody the letter's g


In [44]:
print(complete_text('t', temperature=0.2))

tish the death,
and the stand the change of the seal the present
to the poor son, the strange to the 


In [45]:
print(complete_text('t', temperature=1))

t as sost; for we his
warwick,--and i death in point be cause
the kind welcome that require: no, no w


In [46]:
print(complete_text('t', temperature=2))

t:
aqhneed's me: bie o. caws-dafe?
at,,--by him comb!ocjnar'd court'rt?'
the voicoftiels faebe-clace.


In [47]:
print(complete_text('p', temperature=0.2))

part of the seeming the command
to the country to the stand the stands of the words,
and i will be so


Now that we have built a character-level model, it’s time to look at word-level models
and tackle a common natural language processing task: sentiment analysis.