# Generating Shakespearean Text Using a Character RNN (GRU)

This notebook is based on Geron's [notebook](https://github.com/ageron/handson-ml3/blob/main/16_nlp_with_rnns_and_attention.ipynb).

Presented here in accordance with Apache 2.0 license.

In [1]:
import tensorflow as tf

## Creating the Training Dataset

Let's download the Shakespeare data from Andrej Karpathy's [char-rnn project](https://github.com/karpathy/char-rnn/)

In [2]:
shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

## Explore the Dataset

In [3]:
print(type(shakespeare_text))

<class 'str'>


In [4]:
print(f'Number of characters in the text: {len(shakespeare_text):,}')

Number of characters in the text: 1,115,394


In [5]:
# shows the first 80 characters in the text:
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


In [6]:
# shows all distinct characters (after converting to lower case)
distinct_chars = "".join(sorted(set(shakespeare_text.lower())))
print(f'Number of distinct characters: {len(distinct_chars)}')

Number of distinct characters: 39


In [7]:
distinct_chars

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

## Preprocess the Data

For the students: 

What do the commands below do, specifically `TextVectorization` and its `adapt` method? 

Looking at the [reference](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) for TextVectorization or in Geron p.471 will be useful.

In [8]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")

In [9]:
text_vec_layer

<keras.layers.preprocessing.text_vectorization.TextVectorization at 0x2ad57b8d450>

In [10]:
text_vec_layer.adapt([shakespeare_text])

In [11]:
text_vec_layer([shakespeare_text])

<tf.Tensor: shape=(1, 1115394), dtype=int64, numpy=array([[21,  7, 10, ..., 22, 28, 12]], dtype=int64)>

In [12]:
encoded = text_vec_layer([shakespeare_text])[0]

* We set split="character" to get character-level encoding rather than the default word-level encoding, 
* We use standardize="lower" to convert the text to lowercase (which will simplify the task)

Next:
* Each character is now mapped to an integer, starting at 2. The `TextVectorization`
layer reserved the value 0 for padding tokens, and it reserved 1 for unknown characters. We won’t need either of these tokens for now, so let’s subtract 2 from the
character IDs

In [13]:
encoded

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([21,  7, 10, ..., 22, 28, 12], dtype=int64)>

In [14]:
encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

In [15]:
encoded

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([19,  5,  8, ..., 20, 26, 10], dtype=int64)>

In [16]:
print(f'Number of distinct tokens: {n_tokens}')

Number of distinct tokens: 39


In [17]:
print(f'Number of tokens: {dataset_size:,}')

Number of tokens: 1,115,394


we can turn this very long sequence into a dataset of windows that we can then use to train a sequence-to-sequence RNN. The targets will be similar to the inputs, but shifted by one time step into the “future”. For example, one sample in the dataset may be a sequence of character IDs representing the text “to be or not to b” (without the final “e”), and the corresponding target — a sequence of character IDs representing the text “o be or not to be” (with the final “e”, but without the leading “t”). Let’s write a small utility function to convert a long sequence of character IDs into a dataset of input/target window pairs:

In [18]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

In [19]:
# a simple example using to_dataset()
# There's just one sample in this dataset: the input represents "to b" and the
# output represents "o be"
list(to_dataset(text_vec_layer(["To be"])[0], length=4))

[(<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 4,  5,  2, 23]], dtype=int64)>,
  <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 5,  2, 23,  3]], dtype=int64)>)]

In [20]:
length = 100
tf.random.set_seed(42)
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

In [21]:
train_set

<_PrefetchDataset element_spec=(TensorSpec(shape=(None, None), dtype=tf.int64, name=None), TensorSpec(shape=(None, None), dtype=tf.int64, name=None))>

## Building and Training the Char-RNN Model

**Warning**: the following code may one or two hours to run, depending on your GPU. Without a GPU, it may take over 24 hours. If you don't want to wait, just skip the next two code cells and run the code below to download a pretrained model.

**Note**: the `GRU` class will only use cuDNN acceleration (assuming you have a GPU) when using the default values for the following arguments: `activation`, `recurrent_activation`, `recurrent_dropout`, `unroll`, `use_bias` and `reset_after`.

We use an Embedding layer as the first layer, to encode the character IDs. The Embedding layer’s number of input dimensions is the number of distinct character IDs, and the number of output dimensions is a hyperparameter you can tune—we’ll set it to 16 for now. Whereas the inputs of the Embedding layer will be 2D tensors of shape [batch size, window length], the output of the Embedding layer will be a 3D tensor of shape [batch size, window length, embedding size].

See Geron p.466-469 for a great review about the Embedding layer.
See Geron p.571-572 (or [Wikipedia](https://en.wikipedia.org/wiki/Gated_recurrent_unit)) for a review about Gated Recurrect Unit (GRU).

In [22]:
tf.random.set_seed(42)  # ensures reproducibility on CPU
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10,
                    callbacks=[model_ckpt])

Epoch 1/10
   5940/Unknown - 993s 160ms/step - loss: 1.6087 - accuracy: 0.5186

KeyboardInterrupt: 

In [17]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

If you don't want to wait for training to complete, Geron pretrained a model for us. The following code will download it. Uncomment the last line if you want to use it instead of the model trained above.

In [23]:
from pathlib import Path

# extra code – downloads a pretrained model
url = "https://github.com/ageron/data/raw/main/shakespeare_model.tgz"
path = tf.keras.utils.get_file("shakespeare_model.tgz", url, extract=True)
model_path = Path(path).with_name("shakespeare_model")
shakespeare_model = tf.keras.models.load_model(model_path)

In [24]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]



'e'

## Generating Fake Shakespearean Text

To generate new text using the char-RNN model, we could feed it some text, make
the model predict the most likely next letter, add it to the end of the text, then give
the extended text to the model to guess the next letter, and so on. This is called greedy
decoding. But in practice this often leads to the same words being repeated over and over again. Instead, we can sample the next character randomly, with a probability
equal to the estimated probability, using TensorFlow’s tf.random.categorical()
function. This will generate more diverse and interesting text. The categorical()
function samples random class indices, given the class log probabilities (logits). 

Example:

In [25]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 1, 0, 2, 1, 0, 0, 1]], dtype=int64)>

To have more control over the diversity of the generated text, we can divide the logits
by a number called the temperature, which we can tweak as we wish. A temperature
close to zero favors high-probability characters, while a high temperature gives all
characters an equal probability. Lower temperatures are typically preferred when
generating fairly rigid and precise text, such as mathematical equations, while higher
temperatures are preferred when generating more diverse and creative text. 

In [26]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

In [27]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [28]:
tf.random.set_seed(42)  # ensures reproducibility on CPU

In [29]:
print(extend_text("To be or not to be", temperature=0.01))

To be or not to be the duke
as it is a proper strange death,
and the


In [30]:
print(extend_text("To be or not to be", temperature=1))

To be or not to behold?

second push:
gremio, lord all, a sistermen,


In [31]:
print(extend_text("To be or not to be", temperature=100))

To be or not to bef ,mt'&o3fpadm!$
wh!nse?bws3est--vgerdjw?c-y-ewznq
