# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'> Text Generation with TensorFlow</div></b>

In this notebook, we'll walk you through how to generate text using a character RNN model. Here are the topics we'll cover:
- Imports the required libraries
- Downloads the Shakespeare dataset
- Preprocesses the text data
- Defines a model architecture
- Compiles the model
- Trains the model
- Generates text using the trained model

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>1. Data Loading</div></b>

In this section, we begin by importing the TensorFlow library and proceed to download a dataset containing Shakespearean text from a remote URL. The downloaded text is stored in a variable called text, and we display the first 100 characters of the text for initial exploration.

In [1]:
# Let's import TensorFlow library:
import tensorflow as tf

In [2]:
filepath = '../Data/tinyshakespeare.txt'

with open(filepath, 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
# Let's display the first 100 characters of the text:
text[:100]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

In [4]:
# Let's examine characters:
"".join(sorted(set(text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [5]:
# Let's take a look at the length of characters:
len("".join(sorted(set(text.lower()))))

39

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>2. Text Preprocessing</div></b>

This section focuses on preprocessing the raw text data. We create a TextVectorization layer that tokenizes the text at the character level and converts all characters to lowercase for consistency. The layer is adapted to the text data, allowing us to efficiently encode the text into numerical sequences. We also check the shape of the encoded text to understand its dimensions.

In [6]:
# Let's create a TextVectorization layer for character-level tokenization:
text_vec_layer = tf.keras.layers.TextVectorization(
    split="character", standardize="lower") # character-level tokenization and lowercase conversion. Character tokenization is used to preserve the structure of the text.

In [7]:
# Let's adapt the TextVectorization layer to the text data:
text_vec_layer.adapt([text])

In [8]:
# Let's check the shape of the encoded text:
text_vec_layer([text]).shape

TensorShape([1, 1115394])

In [9]:
# Let's preprocess the text:
encoded = text_vec_layer([text])[0]
encoded # 0 padding, 1 unknown character

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([21,  7, 10, ..., 22, 28, 12], dtype=int64)>

The TextVectorization layer assigns 0 for padding tokens and 1 for unknown characters. Since we currently don't need these tokens, we subtract 2 from the character IDs and calculate both the count of distinct characters and the total character count.

In [10]:
# Let’s subtract 2 from the character IDs and compute the number of distinct characters and the total number of characters:
encoded -= 2 # 0 padding, 1 unknown character
n_tokens = text_vec_layer.vocabulary_size()-2 # one token is one character
n_tokens # number of distinct characters in the text data. 39 distinct characters. 39 tokens.

39

In [11]:
# Let's take a look at the length of the dataset:
dataset_size = len(encoded) # total number of characters in the text data
dataset_size

1115394

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>3. Dataset Preparation </div></b>

Here, we define a function called to_dataset that converts the encoded text sequences into a dataset suitable for training. This function segments the text into overlapping sequences of a specified length and organizes them into batches. Optionally, it shuffles the dataset to enhance randomness during training. An example usage of the to_dataset function is provided to illustrate its functionality.

In [12]:
# Let's create a function to convert text sequences into a dataset
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True) # window size is length + 1 to create overlapping sequences of length. Because we want to predict the next character in the sequence.
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1)) # flat_map() method flattens the dataset by applying the function to each element of the dataset. For example, if the dataset contains multiple windows, the flat_map() method will flatten them into a single dataset.
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

# Above function does the following: 
# 1. It creates a dataset from the input sequence.
# 2. It segments the dataset into overlapping windows of a specified length.
# 3. It converts the windows into batches.
# 4. It shuffles the dataset if needed.
# 5. It maps the windows to input and target sequences.
# 6. It prefetches the dataset for better performance.
# Shift is set to 1 to create overlapping sequences. For example, if the sequence is [1, 2, 3, 4, 5] and the length is 3, the resulting sequences will be [1, 2, 3], [2, 3, 4], and [3, 4, 5].

In [13]:
# Let's get an example and pass it to the function:
list(to_dataset(text_vec_layer(["I like"])[0], length=5))

[(<tf.Tensor: shape=(1, 5), dtype=int64, numpy=array([[ 7,  2, 13,  7, 26]], dtype=int64)>,
  <tf.Tensor: shape=(1, 5), dtype=int64, numpy=array([[ 2, 13,  7, 26,  3]], dtype=int64)>)]

Let's create the training, validation and test datasets.

In [14]:
length = 100
tf.random.set_seed(42)
# The training dataset:
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True, seed=42) # 1 million characters
# The validation dataset: 
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length) # 60,000 characters
# Test dataset:
test_set = to_dataset(encoded[1_060_000:], length=length) # 60,000 characters

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>4. Model Definition and Training </div></b>

In this part of the code, we define the architecture of a neural network model for text generation. The model consists of an Embedding layer for representing tokens, a GRU (Gated Recurrent Unit) layer for sequence modeling, and a Dense layer with a softmax activation for predicting the next character. We compile the model using the sparse categorical cross-entropy loss and the Nadam optimizer. We also incorporate a ModelCheckpoint callback to save the best model weights during training. The model is then trained on the prepared datasets using the fit method.

In [16]:
# Let's define the model architecture:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16), # Embedding layer is used to represent tokens. The output_dim parameter specifies the size of the embedding vectors.
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax") # Dense layer is used for predicting the next character. 39 output units are used to predict the probability of each character. Softmax activation is used to ensure that the output probabilities sum to 1.
]) # Embedding layer is used to represent tokens, GRU layer is used for sequence modeling, and Dense layer is used for predicting the next character. 
# Let's compile the model:
model.compile(loss="sparse_categorical_crossentropy", # sparse_categorical_crossentropy loss is used because the targets are integer sequences.
              optimizer="nadam", metrics=["accuracy"])

#  Let's train the model and save the best checkpoints:
model_ckpt = tf.keras.callbacks.ModelCheckpoint("my_shakespeare_model.keras", monitor="val_accuracy", save_best_only=True) # ModelCheckpoint callback is used to save the best model weights during training.

# Let's train the model:
history = model.fit( train_set, validation_data=valid_set, epochs=3, callbacks=[model_ckpt])

Epoch 1/3
  31246/Unknown [1m1176s[0m 37ms/step - accuracy: 0.5478 - loss: 1.4954

  self.gen.throw(typ, value, traceback)


[1m31247/31247[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1193s[0m 38ms/step - accuracy: 0.5478 - loss: 1.4954 - val_accuracy: 0.5339 - val_loss: 1.6035
Epoch 2/3
[1m31247/31247[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1220s[0m 39ms/step - accuracy: 0.5974 - loss: 1.2921 - val_accuracy: 0.5427 - val_loss: 1.5740
Epoch 3/3
[1m31247/31247[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m996s[0m 32ms/step - accuracy: 0.6024 - loss: 1.2704 - val_accuracy: 0.5442 - val_loss: 1.5688


In [19]:
# Let's add the text preprocessing layer:
shakespeare_model = tf.keras.Sequential([ # Sequential model is created to combine the TextVectorization layer, character-level adjustment, and the previously trained text generation model.
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2), # subtract 2 from the character IDs to remove padding and unknown tokens
    model
])

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>5. Text Generation </div></b>

This section defines a higher-level model for text generation, combining the TextVectorization layer, character-level adjustment, and the previously trained text generation model. This model can be used to generate text based on an initial input.

In [31]:
# Let's generate text using the trained model:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1] # predict the next character in the sequence "To be or not to b" using the trained model. [0, -1] is used to select the last character in the sequence.
y_pred = tf.argmax(y_proba) # get the character ID with the highest probability
text_vec_layer.get_vocabulary()[y_pred + 2] # get the character corresponding to the character ID

ValueError: Exception encountered when calling TextVectorization.call().

[1mWhen using `TextVectorization` to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: inputs.shape=(1, 17) with rank=2[0m

Arguments received by TextVectorization.call():
  • inputs=tf.Tensor(shape=(1, 17), dtype=string)

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>6. Text Generation Functions</div></b>

Here, we define two important functions for text generation. The next_char function predicts the next character in a sequence given a context and a temperature parameter that controls the randomness of predictions. The extend_text function extends a given text with additional characters by iteratively predicting the next character based on the context. Example usages of these functions are provided to demonstrate how to generate text with different temperatures.

In [32]:
# How to use the tf.random.categorical() method:
log_probas = tf.math.log([[0.6, 0.3, 0.1]]) # log probabilities of three classes (0.6, 0.3, 0.1) are calculated. Log probabilities are used to prevent numerical instability. 0.6 is the highest probability. 0.6 is probability of class 0, 0.3 is probability of class 1, 0.1 is probability of class 2.
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=10) # 10 samples are drawn from the log probabilities. The output is the indices of the classes with the highest probability.

<tf.Tensor: shape=(1, 10), dtype=int64, numpy=array([[0, 1, 0, 2, 1, 0, 0, 0, 0, 0]], dtype=int64)>

In [33]:
def next_char(text, temperature=1): # the next_char() function generates the next character in the sequence given a context and a temperature parameter.
    y_proba = shakespeare_model.predict([text])[0, -1:] # predict the next character in the sequence using the trained model. Probabilities of all characters are calculated.
    rescaled_logits = tf.math.log(y_proba) / temperature # the logits are rescaled using the temperature parameter. A higher temperature increases the randomness of the predictions.
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0] # the character ID with the highest probability is selected. 
    return text_vec_layer.get_vocabulary()[char_id + 2] # the character corresponding to the character ID is returned.

In [24]:
def extend_text(text, n_chars=50, temperature=1): # the extend_text() function generates text by predicting the next character in the sequence iteratively.
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [35]:
# Let's generate a text with a low temperature:
tf.random.set_seed(42)
print(extend_text("I like", temperature=0.01))



ValueError: Exception encountered when calling TextVectorization.call().

[1mWhen using `TextVectorization` to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: inputs.shape=(1, 6) with rank=2[0m

Arguments received by TextVectorization.call():
  • inputs=tf.Tensor(shape=(1, 6), dtype=string)

In [None]:
# Let's create a higher temperature text:
print(extend_text("I like", temperature=1))

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>Conclusion</div></b>

In this notebook, we covered how to build a RNN-based model with TensorFlow for text generation.

Thanks for reading. If you enjoy this notebook, don't forget upvote. 