# Text generation using a RNN ✍️ 

<a target="_blank" href="https://colab.research.google.com/github/toelt-llc/HSLU-NLP-Bootcamp/blob/main/Day_1/Sequence_Models_Hands-On/Text_Generation/text_generation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

https://github.com/toelt-llc/HSLU-NLP-Bootcamp/blob/main/Day_1/Sequence_Models_Hands-On/Text_Generation/text_generation.ipynb

🎯 Our goal is to use a dataset of Shakespeare's writing from http://karpathy.github.io/2015/05/21/rnn-effectiveness/ in order to generate Shakespeare like texts from our own prompts! **Our model will take in 100 characters and predict the 101st character.** To predict an entire paragraph we can call our model over and over again using our generated characters (i.e character 2-100 + our generated 101 to predict 102).

## 1️⃣ Setup

### 1.1) Imports

In [1]:
import tensorflow as tf

import numpy as np
import os
import time

2025-03-14 14:22:45.953293: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 1.2) Get the data 📕

Run the helper function below 👇 you can see it downloads us the data in the filename **shakespeare.txt** and returns us the file path to it!

In [2]:
path_to_data = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

### 1.3) Have a look at the data 🔎

Here you can open the file and read it as a string (we have to decode it to make a string rather than a byte string): 

In [3]:
text = open(path_to_data, 'rb').read().decode(encoding='utf-8')

In [4]:
# Take a look at the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



## 2️⃣ Preprocessing

### 2.1) Vectorize the text

Before training, you need to convert the strings to a numerical representation. 

The [tf.keras.layers.StringLookup](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup) layer can convert each character into a numeric ID. This layer just needs the text to be split into tokens first. You can use the helper function [tf.strings.unicode_split](https://www.tensorflow.org/api_docs/python/tf/strings/unicode_split) to achieve that like the example below 👇.

In [5]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

### 2.2) Generate the vocab 📖

❓ Generate a list of **unique characters** in our text and save it in the variable **`vocab`**.

In [6]:
vocab = sorted(set(text))

❓ Now create the `tf.keras.layers.StringLookup` layer and save it as `ids_from_chars`:

In [7]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=vocab, mask_token=None)

<details>
<summary markdown='span'>💡 If you get stuck</summary>

```python
tf.keras.layers.StringLookup(vocabulary=vocab, mask_token=None)
```

</details>


It converts from tokens to character IDs based on the vocab we passed to it. 

❓ Use the layer below 👇 and edit `chars` variable above and see what happens when you add characters outside the vocab. 

In [8]:
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[40, 41, 42, 43, 44, 45, 46], [63, 64, 65]]>

To generate text, it will also be important to **invert this representation** and recover human-readable strings from it. For this you can use `tf.keras.layers.StringLookup(..., invert=True)`.  

❗️ Here instead of passing the original vocabulary generated with `sorted(set(text))`, use the `get_vocabulary()` method of the `tf.keras.layers.StringLookup` to get the vocabulary assigned to the previous `ids_from_chars` layer. 

This way, we also have a `[UNK]` string for unknown characters outside our original representation

In [9]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

This layer recovers the characters from the vectors of IDs, and returns them as a `tf.RaggedTensor` of characters:

In [10]:
chars = chars_from_ids(ids)
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

✍️ We use `tf.strings.reduce_join` to join the characters back into strings. 

In [11]:
tf.strings.reduce_join(chars, axis=-1).numpy()

array([b'abcdefg', b'xyz'], dtype=object)

❓ Define a function `text_from_ids` that takes a tensor of ids and returns the corresponding text.

In [12]:
def text_from_ids(ids):
    return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

### 2.3) The dataset 🚚

❓ First split our whole text using `unicode_split` and convert them all with `ids_from_chars`, to get all of our text as a single continuous array saved as `all_ids`.

In [13]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([19, 48, 57, ..., 46,  9,  1])>

We can then make a tensorflow dataset object with that array. This is an object which allows us to write pipelines to transform our data into the format needed for our model to read it!

In [14]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

The `batch` method allows us to set how many characters we should take at a time! In our case we want **101**.

In [15]:
sequences = ids_dataset.batch(101, drop_remainder=True)

for seq in sequences.take(1):
    print(chars_from_ids(seq))

tf.Tensor(
[b'F' b'i' b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':'
 b'\n' b'B' b'e' b'f' b'o' b'r' b'e' b' ' b'w' b'e' b' ' b'p' b'r' b'o'
 b'c' b'e' b'e' b'd' b' ' b'a' b'n' b'y' b' ' b'f' b'u' b'r' b't' b'h'
 b'e' b'r' b',' b' ' b'h' b'e' b'a' b'r' b' ' b'm' b'e' b' ' b's' b'p'
 b'e' b'a' b'k' b'.' b'\n' b'\n' b'A' b'l' b'l' b':' b'\n' b'S' b'p' b'e'
 b'a' b'k' b',' b' ' b's' b'p' b'e' b'a' b'k' b'.' b'\n' b'\n' b'F' b'i'
 b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':' b'\n' b'Y'
 b'o' b'u' b' '], shape=(101,), dtype=string)


2025-03-14 14:22:55.902105: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


It's easier to see  if we join the tokens back into strings 👇:

In [16]:
for seq in sequences.take(5):
    print(text_from_ids(seq).numpy())

b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
b"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
b"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
b'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


2025-03-14 14:22:55.925175: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


For training you'll need a dataset of `(input, label)` pairs, where `input` and 
`label` are sequences. 

Even though we are predicting one character at a time, the sequence at each time step consists of the:

1. `input` which is the `n` characters in the sequence up to the `n+1` character we want to predict
2. `label` which is the predicted character and `n-1` characters leading up to it.

For example if the text is `"Hello"`. The input sequence would be `"Hell"`, and the target sequence `"ello"`.

</br>


<details>
    <summary markdown='span'>🤔 Why do we have a target of <strong>ello</strong> if our goal is only to predict <strong>o</strong>? Click here for an explanation.</summary>

It is much more stable to train a model this way. If **H** was only updated by the back propagation from the predict of **o** it would be very weakly updated. This problem would be even worse with 100 characters between!

</details>


❓ Write a function `split_input_target` which converts a sequence to a `(input, label)` pair.

In [17]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

Then we map the function to our dataset. This applies it to every element in the dataset, this is part of the reason `tensorflow` datasets are so powerful for preprocessing data! 🙌

In [18]:
dataset = sequences.map(split_input_target)
dataset

<_MapDataset element_spec=(TensorSpec(shape=(100,), dtype=tf.int64, name=None), TensorSpec(shape=(100,), dtype=tf.int64, name=None))>

Checkout what our **`dataset`** looks like now 👇

In [19]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


2025-03-14 14:22:56.058084: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### 2.4) Optimizing the dataset 🛠️

With tensorflow [datasets](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) we define the batch size before we fit the model. We also:

- shuffle the dataset
- prefetch (this gets the next N elements ready) - super important when we are loading data from disk to have it ready for the next batch without wasting GPU time! 🚀

In [20]:
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

## 3️⃣ Building the Model

### 3.1) Define the model 🔮

This section defines the model as a [`keras.Model`](https://keras.io/api/models/model/) subclass which you won't have seen before (For details see [Making new Layers and Models via subclassing](https://www.tensorflow.org/guide/keras/custom_layers_and_models)). 

This model has three layers:

* `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map each character-ID to a vector with `embedding_dim` dimensions;
* `tf.keras.layers.GRU`: A type of RNN with size `units=rnn_units` (You can also use an LSTM layer here.)
* `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs. It outputs one logit for each character in the vocabulary. These are the log-likelihood of each character according to the model.

❓ The **model** is quite different than how we have defined models so far. Take a few minutes to try and understand the code. 

- The first section is the `__init__` here we define layers using `self.layer_name = layer`
- The second section is where we define how to use the layers when we are given an input. You can see we call the layers similarly to the [Keras functional API](https://keras.io/guides/functional_api/) but we the flexibility to include `if` statements and other code.

In [None]:
class MyModel(tf.keras.Model):
    def __init__(self, vocab_size):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, 256)
        self.gru = tf.keras.layers.GRU(1024,
                                       return_sequences=True,
                                       return_state=True)
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        x = inputs
        x = self.embedding(x, training=training)

        # handle initial states
        if states is None:
            # Don't try to get initial state from x here
            x, states = self.gru(x, training=training)
        else:
            x, states = self.gru(x, initial_state=states, training=training)

        x = self.dense(x, training=training)

        if return_state:
            return x, states
        else:
            return x

In [32]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

model = MyModel(vocab_size=vocab_size)

❗️ For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character.

### 3.2) Check the model 🔬

Lets call the untrained model of our first piece of data 👇

In [33]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 66) # (batch_size, sequence_length, vocab_size)


2025-03-14 14:33:01.519677: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


To get actual predictions from the model you need to sample from the output distributions. 

- This distribution is defined by the logits over the character vocabulary.
- ❗ It is important to _sample_ from this distribution as taking the _argmax_ of the distribution can easily get the model stuck in a loop.

Try it for the first example in the batch:

In [34]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

This gives us, at each timestep, a prediction of the next character index:

In [35]:
sampled_indices

array([24,  7, 32, 20, 26, 47, 22, 51, 57, 54, 25, 40, 56, 51, 48, 13,  0,
       27, 54, 43,  6, 23, 49, 27, 43, 16, 56, 63, 37, 58,  4, 59, 65, 36,
       59, 54, 52, 58, 40, 35, 11, 30, 25, 31,  4,  5, 36, 28, 30, 17, 40,
       10, 55, 64, 20, 17, 22, 11, 24,  2, 19,  0,  6, 58, 31, 61, 22, 36,
       40, 59, 12, 49, 22, 27, 60, 14, 17, 19, 13, 22, 22, 10, 57, 19, 49,
       32, 54, 41, 11, 54, 11, 15, 29, 20, 11, 26, 56, 63,  0, 27])

### 3.3) Train the model 🏋️‍♂️

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

❓ Compile the model with the **correct loss** and an optimizer

</br>

<details>
    <summary markdown='span'>💡 Click here for the loss if you're stuck</summary>

    You should use the <code>tf.keras.losses.SparseCategoricalCrossentropy</code> loss.

</details>

In [36]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss)

To keep training within reasonable time, we will use **just 5 epochs** (you can increase this later if you like) to train the model. 

This will still take about 10 mins so grab a coffee ☕️ while you wait.

In [None]:
%%time
history = model.fit(dataset, epochs=5)

[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m291s[0m 2s/step - loss: 2.7625
CPU times: user 30min 37s, sys: 7min 2s, total: 37min 39s
Wall time: 4min 51s


## 4️⃣ Generate text 🧠

### 4.1) Generation model 🤖

We need to edit our model for generation, the code below looks excessive so lets break it down:

- We will inherit from Keras base model and pass our previously defined model to the `__init__` method
- We will also create a mask which add a value of **negative infinity** for the unknown character **`[UNK]`** used to denote characters outside of our vocab as we never want our model to generate this character.
- We **sample** and **squeeze** to get the predicted ids.
- We pass the state back to allow us to feed it back into the model!

In [39]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars):
    super().__init__()
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [40]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

### 4.2) Using the model 📝

Now we can run it in a loop to generate some text. Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences but pretty impressive for the training time! 🙌

❓ Play around with the input text and the number of predict characters and see what your model creates

In [41]:
%%time
states = None
next_char = tf.constant(['Juliet: Where art thou, Romeo?'])
result = [next_char]

for n in range(1000):
    next_char, states = one_step_model.generate_one_step(next_char, states=states)
    result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)


Juliet: Where art thou, Romeo?

ERENBAB:
No
whece whilt it ho lith?
Shat, shey bud nok:
Thy thom he wlathend Comain: by sumhel's, hand but of thie lode.

PARUTIS:
Wone foim, the quceagice to kere'd at to quintion.

hem:
I woud'd toopty?

ASRLO:
Buttrish tot siy whow hes in to thin ywu''s you.
Nore mu frove do hepead
Uplaw, te veis
Hesty and, bay thy herpene in myrtithir.

CRFARE:
By'd thit plowart:
At yout why letion lering.

MUTENUS::
Me cvowe's on ereimene me alfopefoo
Whou of the hearsing:
I, a surthy and you,
I fad it theace mart mert in thoul tlyervey sees indu;
And dather ameare tad houlf thos -ursispy,
Un I have you it wigh alongshe, it eloon Poting.

PESFUS:
Yo here, forlege and bat it new them-on mleavend sic-ze Caco bie.

has! In it that arem pobe'd bread grougert'd shoul
Tht Inridig! I mand withon dyrove moo-h'd seat, if,
co her wordend nobgain say ha unting dorre.

CLORTAR:

will wathty what stay will on be pape whill briot,
Were stere waithtlow;
Hood gimlers? aim moms, sen

🏁 Even though the results could be improved significantly it is quite incredible what the model learnt in **only five** epochs! Next you can up the epochs or try using the model on some text of your own.

##### Copyright 2019 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.