# Generating text using an RNN

## Why would you use an RNN instead of (other technique)?

Natural language and text generation these days is mostly done with transformers (see GPT-2 and GPT-3, for example), but I still wanted to explore recurrent neural networks, so I decided to try doing some text generation.

## Licensing information

Copyright 2023 Simon Wu

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Parts of this notebook were created by the Tensorflow Authors under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

The blog post corpus used was [The Blog Authorship Corpus](https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm). No particular license for use is noted, but in the words of the creators of the corpus:

> The corpus may be freely used for non-commercial research purposes.

The Shakespeare corpus used was scraped by [Andrej Karpathy](https://github.com/karpathy/char-rnn). The works of Shakespeare are in the public domain, and Karpathy's repository is under the MIT license.

Recipe data was scraped from the [based.cooking](https://github.com/LukeSmithxyz/based.cooking/) GitHub repository. The contents of the site and repository were placed into the public domain by its authors.

## Before you start

If you want to poke around with the code in this notebook, please make sure to enable GPU on Colab or whatever runtime you use for the notebook.

## Basic architecture

These models were trained as recurrent neural networks. They take in tokenized text and output a probability distribution of the possible next character in the form of logits.

The most likely next character is selected, appended to the beginning prompt, and then the text is taken in by the model again to generate the next character after that.

Three layers are used:
- Embedding layer, turning input text into one-hot vectors
- GRU RNN layer, which is trained to predict the next characters
- Output layer that gives logits (log probabilities)

# Code

## Initialize

### Import needed libraries and data

In [54]:
# Importing needed libraries

import tensorflow as tf

import numpy as np
import os

import shutil
import zipfile
import requests

In [30]:
# Getting datasets

shakes_data = tf.keras.utils.get_file('shakespeare.txt', 'https://raw.githubusercontent.com/shangmingwu/rnn-text-generator/main/data/shakespeare-corpus.txt')
recipe_data = tf.keras.utils.get_file('recipes.txt', 'https://raw.githubusercontent.com/shangmingwu/rnn-text-generator/main/data/recipes-corpus.txt')
blog_data = tf.keras.utils.get_file('blogs.txt', 'https://raw.githubusercontent.com/shangmingwu/rnn-text-generator/main/data/blog-corpus.txt')

In [31]:
# Get the text from the datasets

shakes_text = open(shakes_data, 'rb').read().decode(encoding = 'utf-8')
recipe_text = open(recipe_data, 'rb').read().decode(encoding = 'utf-8')
blog_text = open(blog_data, 'rb').read().decode(encoding = 'utf-8')

In [32]:
# Create a vocabulary: an entry for each character in the text

vocab = sorted(set(shakes_text + recipe_text + blog_text))

### Process text for training

In [33]:
# Methods for converting to and from tensors

chars_to_tensor = tf.keras.layers.StringLookup(vocabulary = list(vocab), mask_token = None)
tensor_to_chars = tf.keras.layers.StringLookup(vocabulary = list(vocab), invert = True, mask_token = None)

def tensor_to_string(tokens):
  return tf.strings.reduce_join(tensor_to_chars(tokens), axis = -1)

### Select desired dataset

In [34]:
# Run this code block to use Shakespeare data.

text = shakes_text
prompt = 'ROMEO: '

In [None]:
# Run this code block to use recipe data.

text = recipe_text
prompt = '---'

In [None]:
# Run this code block to use blog data.

text = blog_text
prompt = '<Blog>'

### Create training datasets

In [35]:
# Turn the chosen dataset into a lot of tensors

all_tensors = chars_to_tensor(tf.strings.unicode_split(text, 'UTF-8'))
tensor_dataset = tf.data.Dataset.from_tensor_slices(all_tensors)

In [36]:
# Split into sequences of tensors

sequence_length = 100
sequences = tensor_dataset.batch(sequence_length + 1, drop_remainder = True)

def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [37]:
# Shuffle data around into batches

BATCH_SIZE = 64

# Tensorflow uses a buffer for shuffling

BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder = True)
    .prefetch(tf.data.experimental.AUTOTUNE))

## Building the RNN

### Define the model

In [38]:
vocab_size = len(chars_to_tensor.get_vocabulary())

embedding_dim = 256

rnn_dim = 1024

In [39]:
class TextModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_dim):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_dim,
                                   return_sequences = True,
                                   return_state = True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states = None, return_state = False, training = False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [40]:
class TextTraining(TextModel):
  @tf.function
  def train_step(self, inputs):
      inputs, labels = inputs
      with tf.GradientTape() as tape:
          predictions = self(inputs, training = True)
          loss = self.loss(labels, predictions)
      grads = tape.gradient(loss, model.trainable_variables)
      self.optimizer.apply_gradients(zip(grads, model.trainable_variables))

      return {'loss': loss}

In [41]:
model = TextTraining(
    vocab_size = len(chars_to_tensor.get_vocabulary()),
    embedding_dim = embedding_dim,
    rnn_dim = rnn_dim)

In [42]:
model.compile(optimizer = tf.keras.optimizers.Adam(),
              loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True))

### Train the model

In [43]:
# Amount of training periods

## Per-period training time can vary from 15 to 30 seconds depending on dataset

## If you don't like the output, increasing epochs might help

EPOCHS = 20

In [None]:
history = model.fit(dataset, epochs=EPOCHS)

In [48]:
class OneStep(tf.keras.Model):
  def __init__(self, model, tensor_to_chars, chars_to_tensor, temperature = 1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.tensor_to_chars = tensor_to_chars
    self.chars_to_tensor = chars_to_tensor

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.chars_to_tensor(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(chars_to_tensor.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.chars_to_tensor(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.tensor_to_chars(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [49]:
single_step = OneStep(model, tensor_to_chars, chars_to_tensor)

## Generate text!

In [None]:
# Run this code block to override the prompt set in the beginning

prompt = 'My new prompt: '

In [None]:
# Generate 1000 characters with the model

states = None
next_char = tf.constant([prompt])
result = [next_char]

for n in range(1000):
  next_char, states = single_step.generate_one_step(next_char, states = states)
  result.append(next_char)

result = tf.strings.join(result)
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)

## Save/load the model

### Load pre-trained Shakespeare model

In [55]:
url = 'https://github.com/shangmingwu/rnn-text-generator/blob/main/data/shakespeare_model_20epochs.zip?raw=true'
r = requests.get(url, allow_redirects=True)
open('./shakespeare_model.zip', 'wb').write(r.content)
with zipfile.ZipFile("./shakespeare_model.zip","r") as zip_ref:
    zip_ref.extractall("./")

In [None]:
# Load and run the model

single_step_loaded = tf.saved_model.load('./single_step')

states = None
next_char = tf.constant([prompt])
result = [next_char]

for n in range(1000):
  next_char, states = single_step_loaded.generate_one_step(next_char, states = states)
  result.append(next_char)

print(tf.strings.join(result)[0].numpy().decode("utf-8"))

### Save/load own model

In [None]:
# Save the model to the current directory, and compress it for download

tf.saved_model.save(single_step, './single_step')
shutil.make_archive("./single_step_saved", 'zip', "./single_step")

In [None]:
# Load and run the model

single_step_loaded = tf.saved_model.load('./single_step')

states = None
next_char = tf.constant([prompt])
result = [next_char]

for n in range(1000):
  next_char, states = single_step_loaded.generate_one_step(next_char, states = states)
  result.append(next_char)

print(tf.strings.join(result)[0].numpy().decode("utf-8"))