<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/16-NLP-with-RNNs-and-Attention/v2_generating_text_using_character_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Generating Text using Character RNN

A common approach for natural language tasks is to use recurrent neural networks.
We will therefore continue to explore RNNs, starting with
a character RNN, trained to predict the next character in a sentence. This will allow us to generate some original text, and in the process we will see how to build a TensorFlow Dataset on a very long sequence. 

We will first use a stateless RNN (which learns on random portions of text at each iteration, without any information on the rest of the text), then we will build a stateful RNN (which preserves the hidden state between training iterations and continues reading where it left off, allowing it to learn longer patterns).

Let’s start with a simple and fun model that can write like Shakespeare (well, sort of).

## Setup

In [1]:
import sys
assert sys.version_info >= (3, 7)  # Python ≥3.5 is required

import sklearn 
assert sklearn.__version__ >= "0.20"  # Scikit-Learn ≥0.20 is required

import tensorflow as tf

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= '2.0'

# Common imports
import numpy as np
import os
from pathlib import Path

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib.pyplot as plt

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

## Dataset

Let's download the Shakespeare data from Andrej Karpathy's char-rnn project.

In [2]:
shakespeare_url = "https://homl.info/shakespeare"
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
  shakespeare_text = f.read()

Downloading data from https://homl.info/shakespeare


In [3]:
# extra code – shows a short text sample
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


In [4]:
# extra code - shows all 39 distinct characters (after converting to lower case)
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

Let's encode the text.

In [5]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character", standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

Let’s subtract 2 from the
character IDs and compute the number of distinct characters and the total number of
characters.

In [6]:
encoded -= 2                                     # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)                      # total number of chars = 1,115,394

print(n_tokens)
print(dataset_size)

39
1115394


Let’s convert a long sequence of character IDs into a dataset of input/target window pairs.

In [7]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
  ds = tf.data.Dataset.from_tensor_slices(sequence)
  ds = ds.window(length + 1, shift=1, drop_remainder=True)
  ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
  if shuffle:
    ds = ds.shuffle(100_000, seed=seed)
  ds = ds.batch(batch_size)
  return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

In [8]:
# extra code – a simple example using to_dataset()
# There's just one sample in this dataset: the input represents "to b" and the output represents "o be"
list(to_dataset(text_vec_layer(["To be"])[0], length=4))

[(<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 4,  5,  2, 23]])>,
  <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 5,  2, 23,  3]])>)]

In [9]:
list(to_dataset(text_vec_layer(["To be or not to be"])[0], length=8))

[(<tf.Tensor: shape=(10, 8), dtype=int64, numpy=
  array([[ 4,  5,  2, 23,  3,  2,  5, 10],
         [ 5,  2, 23,  3,  2,  5, 10,  2],
         [ 2, 23,  3,  2,  5, 10,  2, 11],
         [23,  3,  2,  5, 10,  2, 11,  5],
         [ 3,  2,  5, 10,  2, 11,  5,  4],
         [ 2,  5, 10,  2, 11,  5,  4,  2],
         [ 5, 10,  2, 11,  5,  4,  2,  4],
         [10,  2, 11,  5,  4,  2,  4,  5],
         [ 2, 11,  5,  4,  2,  4,  5,  2],
         [11,  5,  4,  2,  4,  5,  2, 23]])>,
  <tf.Tensor: shape=(10, 8), dtype=int64, numpy=
  array([[ 5,  2, 23,  3,  2,  5, 10,  2],
         [ 2, 23,  3,  2,  5, 10,  2, 11],
         [23,  3,  2,  5, 10,  2, 11,  5],
         [ 3,  2,  5, 10,  2, 11,  5,  4],
         [ 2,  5, 10,  2, 11,  5,  4,  2],
         [ 5, 10,  2, 11,  5,  4,  2,  4],
         [10,  2, 11,  5,  4,  2,  4,  5],
         [ 2, 11,  5,  4,  2,  4,  5,  2],
         [11,  5,  4,  2,  4,  5,  2, 23],
         [ 5,  4,  2,  4,  5,  2, 23,  3]])>)]

Now we’re ready to create the training set, the validation set, and the test set.

In [10]:
tf.random.set_seed(42)

length = 100
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True, seed=42)
valid_set = to_dataset(encoded[1_000_000: 1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

## Building and Training Model

Let’s build and train
a model with one GRU layer composed of 128 units.

In [11]:
tf.random.set_seed(42)

model = keras.models.Sequential([
    keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.Dense(n_tokens, activation='softmax')                            
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 16)          624       
                                                                 
 gru (GRU)                   (None, None, 128)         56064     
                                                                 
 dense (Dense)               (None, None, 39)          5031      
                                                                 
Total params: 61,719
Trainable params: 61,719
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam', metrics=["accuracy"])

model_ckpt = keras.callbacks.ModelCheckpoint("my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10, callbacks=[model_ckpt])

Let's handle text preprocessing.

In [13]:
shakespeare_model = keras.Sequential([
    text_vec_layer,
    keras.layers.Lambda(lambda x: x - 2),  # no <PAD> or <UNK> tokens
    model
])

And now let’s use it to predict the next character in a sentence.

In [14]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
# choose the most probable character ID
y_pred = tf.argmax(y_proba)
text_vec_layer.get_vocabulary()[y_pred + 2]



'e'

In [15]:
y_proba = shakespeare_model.predict(["I love yo"])[0, -1]
# choose the most probable character ID
y_pred = tf.argmax(y_proba)
text_vec_layer.get_vocabulary()[y_pred + 2]



'u'

In [16]:
y_proba = shakespeare_model.predict(["Where are you goin"])[0, -1]
# choose the most probable character ID
y_pred = tf.argmax(y_proba)
text_vec_layer.get_vocabulary()[y_pred + 2]



'g'

## Generating Fake Shakespearean Text

Let's generate more diverse and interesting text.

In [20]:
tf.random.set_seed(42)

# probas = 50%, 40%, and 10%
log_probas = tf.math.log([[0.5, 0.4, 0.1]])
tf.random.categorical(log_probas, num_samples=8) # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 0, 1, 1, 1, 0, 0, 0]])>

Let's control over the diversity of the generated text using temperature.

In [23]:
def next_char(text, temperature=1):
  y_proba = shakespeare_model.predict([text])[0, -1:]
  rescaled_logits = tf.math.log(y_proba) / temperature
  char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]

  return text_vec_layer.get_vocabulary()[char_id + 2]

def extend_text(text, n_chars=50, temperature=1):
  for _ in range(n_chars):
    text += next_char(text, temperature)
  return text

We are now ready to generate some text! Let’s try with different temperature values.

In [None]:
tf.random.set_seed(42)

t_text = extend_text('To be or not to be', temperature=0.01)

In [27]:
print(t_text)

To be or not to be a man i shall be a man i shall be a man i shall b


In [None]:
tf.random.set_seed(42)

t_text = extend_text('To be or not to be', temperature=1)

In [29]:
print(t_text)

To be or not to be hence
sinnertys, as it was noth.

page:
thou shal


In [None]:
tf.random.set_seed(42)

t_text = extend_text('To be or not to be', temperature=100)

In [31]:
print(t_text)

To be or not to be!:q?

ddidn:;&yoe-3
j.&lvj,s-pxh. b:kx:o? woystj3



## Stateful RNN

Until now, we have used only stateless RNNs: at each training iteration the model starts with a hidden state full of zeros, then it updates this state at each time step, and after the last time step, it throws it away, as it is not needed anymore. 

What if we told the RNN to preserve this final state after processing one training batch and use it as the initial state for the next training batch? 

This way the model can learn long-term patterns despite only backpropagating through short sequences. This is called a stateful RNN.

First, note that a stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off. So the first thing we need to do to build a stateful RNN is to use sequential and non-overlapping input sequences (rather than the shuffled and overlapping sequences we used to train stateless RNNs).

When creating the Dataset, we must therefore use shift=n_steps (instead of shift=1), when calling the window() method. Moreover,
we must obviously not call the shuffle() method.

Unfortunately, batching is much harder when preparing a dataset for a stateful RNN than it is for a stateless RNN.

Indeed, if we were to call batch(32), then 32 consecutive windows would be put in the same batch, and the following batch would not continue each of these window where it left off. 

The first batch would contain windows 1 to 32 and the second batch would contain windows 33 to 64, so if you consider, say, the first window of each batch (i.e., windows 1 and 33), you can see that they are not consecutive. The simplest solution to this problem is to just use “batches” containing a single window:

In [None]:
tf.random.set_seed(42)

In [None]:
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.repeat().batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

<img src='https://github.com/rahiakela/img-repo/blob/master/hands-on-machine-learning-keras-tensorflow/sequence-fragments-for-stateful-rnn.png?raw=1' width='800'/>

Batching is harder, but it is not impossible. For example, we could chop Shakespeare’s text into 32 texts of equal length, create one dataset of consecutive input sequences for each of them, and finally use tf.train.Dataset.zip(datasets).map(lambda
*windows: tf.stack(windows)) to create proper consecutive batches, where the nth input sequence in a batch starts off exactly where the nth input sequence ended in the previous batch.

In [None]:
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size)
datasets = []

for encoded_part in encoded_parts:
  dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
  dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
  dataset = dataset.flat_map(lambda window: window.batch(window_length))
  datasets.append(dataset)

dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows))
dataset = dataset.repeat().map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Now let’s create the stateful RNN. 

First, we need to set stateful=True when creating every recurrent layer. 

Second, the stateful RNN needs to know the batch size (since it
will preserve a state for each input sequence in the batch), so we must set the batch_input_shape argument in the first layer.

Note that we can leave the second dimension unspecified, since the inputs could have any length.

In [None]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2, recurrent_dropout=0.2, 
                     batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))                             
])

At the end of each epoch, we need to reset the states before we go back to the beginning of the text. For this, we can use a small callback:

In [None]:
class ResetStatesCallback(keras.callbacks.Callback):
  def on_epoch_begin(self, epoch, logs):
    self.model.reset_states()

And now we can compile and fit the model.

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

steps_per_epoch = train_size // batch_size // n_steps

model.fit(dataset, epochs=50, steps_per_epoch=steps_per_epoch, callbacks=[ResetStatesCallback()])

After this model is trained, it will only be possible to use it to make predictions for batches of the same size as were used during training.

To avoid this restriction, create an identical stateless model,
and copy the stateful model’s weights to this model.

In [None]:
stateless_model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))                                       
])

To set the weights, we first need to build the model (so the weights get created):

In [None]:
stateless_model.build(tf.TensorShape([None, None, max_id]))

stateless_model.set_weights(model.get_weights())
model = stateless_model

In [None]:
tf.random.set_seed(42)

In [None]:
print(complete_text('t'))

In [None]:
print(complete_text('t', temperature=0.2))

In [None]:
print(complete_text('t', temperature=1))

In [None]:
print(complete_text('t', temperature=2))

In [None]:
print(complete_text('p', temperature=0.2))

Now that we have built a character-level model, it’s time to look at word-level models
and tackle a common natural language processing task: sentiment analysis.