# Text generation with a miniature GPT

**Description:** Implement a miniature version of GPT and train it to generate text.

## Introduction

This example demonstrates how to implement an autoregressive language model
using a miniature version of the GPT model.
The model consists of a single Transformer block with causal masking
in its attention layer.
We use the text from the IMDB sentiment classification dataset for training
and generate new movie reviews for a given prompt.
When using this script with your own dataset, make sure it has at least
1 million words.

This example should be run with `tf-nightly>=2.3.0-dev20200531` or
with TensorFlow 2.3 or higher.

**References:**

- [GPT](https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035)
- [GPT-2](https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe)
- [GPT-3](https://arxiv.org/abs/2005.14165)

## Setup

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import numpy as np
import os
import re
import string
import random

## Implement a Transformer block as a layer

In [None]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    """
    Mask the upper half of the dot product matrix in self attention.
    This prevents flow of information from future tokens to current token.
    1's in the lower triangle, counting from the lower right corner.
    """
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)


class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads, embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        attention_output = self.att(inputs, inputs, attention_mask=causal_mask)
        attention_output = self.dropout1(attention_output)
        out1 = self.layernorm1(inputs + attention_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)


## Implement an embedding layer

Create two seperate embedding layers: one for tokens and one for token index
(positions).

In [None]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions


## Implement the miniature GPT model

In [None]:
vocab_size = 20000  # Only consider the top 20k words
maxlen = 80  # Max sequence size
embed_dim = 256  # Embedding size for each token
num_heads = 2  # Number of attention heads
feed_forward_dim = 256  # Hidden layer size in feed forward network inside transformer


def create_model():
    inputs = layers.Input(shape=(maxlen,), dtype=tf.int32)
    embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
    x = embedding_layer(inputs)
    transformer_block = TransformerBlock(embed_dim, num_heads, feed_forward_dim)
    x = transformer_block(x)
    outputs = layers.Dense(vocab_size)(x)
    model = keras.Model(inputs=inputs, outputs=[outputs, x])
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(
        "adam", loss=[loss_fn, None],
    )  # No loss and optimization based on word embeddings from transformer block
    return model


## Prepare the data for word-level language modelling

Download the IMDB dataset and combine training and validation sets for a text
generation task.

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  20.8M      0  0:00:03  0:00:03 --:--:-- 20.8M


In [None]:
batch_size = 128

# The dataset contains each review in a separate text file
# The text files are present in four different folders
# Create a list all files
filenames = []
directories = [
    "aclImdb/train/pos",
    "aclImdb/train/neg",
    "aclImdb/test/pos",
    "aclImdb/test/neg",
]
for dir in directories:
    for f in os.listdir(dir):
        filenames.append(os.path.join(dir, f))

print(f"{len(filenames)} files")

# Create a dataset from text files
random.shuffle(filenames)
text_ds = tf.data.TextLineDataset(filenames)
text_ds = text_ds.shuffle(buffer_size=256)
text_ds = text_ds.batch(batch_size)


def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    lowercased = tf.strings.lower(input_string)
    stripped_html = tf.strings.regex_replace(lowercased, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"([{string.punctuation}])", r" \1")


# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int",
    output_sequence_length=maxlen + 1,
)
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()  # To get words back from token indices


def prepare_lm_inputs_labels(text):
    """
    Shift word sequences by 1 position so that the target for position (i) is
    word at position (i+1). The model will use all words up till position (i)
    to predict the next word.
    """
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


text_ds = text_ds.map(prepare_lm_inputs_labels)
text_ds = text_ds.prefetch(tf.data.AUTOTUNE)

50000 files


In [None]:
type(filenames[0:1])

list

Let's create a test set.

In [None]:
# Create a dataset from text files
# for _test
random.shuffle(filenames)
filenames_test = filenames[0:1]
text_ds_test = tf.data.TextLineDataset(filenames_test)
text_ds_test = text_ds_test.shuffle(buffer_size=256)
text_ds_test = text_ds_test.batch(batch_size)


# Create a vectorization layer and adapt it to the text
# for _test
vectorize_layer_test = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int",
    output_sequence_length=maxlen + 1,
)
vectorize_layer_test.adapt(text_ds_test)
vocab_test = vectorize_layer_test.get_vocabulary()  # To get words back from token indices


text_ds_test = text_ds_test.map(prepare_lm_inputs_labels)
text_ds_test = text_ds_test.prefetch(tf.data.AUTOTUNE)

What is `text_ds` exactly? It is related to `tf.raw_ops.PrefetchDataset`:
It seems to be an operation that returns an element asynchronously from prefetched dataset.

One can refer [here](https://www.tensorflow.org/api_docs/python/tf/raw_ops/PrefetchDataset) for more explanation.

In [None]:
type(text_ds)

tensorflow.python.data.ops.dataset_ops.PrefetchDataset

It is iterable as in and we can use a `for` loop to expand it. For example, the following code prints the first 3 entries in the `text_ds` object.

In [None]:
l = 0
for string_, int_ in text_ds:
    if l <= 3:
        print(f'No. of entry: {l}')
        print(string_)
        print()
        print(int_)
        l += 1
        print()
        print()

    else:
        break

No. of entry: 0
tf.Tensor(
[[9236 5072   16 ... 8092   37    2]
 [   2   69  293 ...    6    1    1]
 [  13   18 2043 ...   71   16  698]
 ...
 [  13    9    5 ... 1881  314    7]
 [ 928 3270  108 ... 3517    4   10]
 [  12   32  121 ...  284   39 3491]], shape=(128, 80), dtype=int64)

tf.Tensor(
[[5072   16   34 ...   37    2   67]
 [  69  293   12 ...    1    1    3]
 [  18 2043    3 ...   16  698    1]
 ...
 [   9    5  496 ...  314    7  794]
 [3270  108   77 ...    4   10    9]
 [  32  121  118 ...   39 3491 1834]], shape=(128, 80), dtype=int64)


No. of entry: 1
tf.Tensor(
[[  55    5  982 ...    0    0    0]
 [ 948  505   12 ... 3644    4   21]
 [1877    4   12 ...   36  309    3]
 ...
 [  53   24  201 ...  253   14  218]
 [1017   99 3663 ...   15   43   90]
 [4676 2397    1 ...  159    3    0]], shape=(128, 80), dtype=int64)

tf.Tensor(
[[   5  982   22 ...    0    0    0]
 [ 505   12  199 ...    4   21   11]
 [   4   12   99 ...  309    3    5]
 ...
 [  24  201   14 ...   14  

In [None]:
l = 0
for string_, int_ in text_ds_test:
    print(f'No. of entry: {l}')
    print(string_)
    print()
    print(int_)
    l += 1
    print()
    print()


No. of entry: 0
tf.Tensor(
[[   13    18    52     5   555    71    11    66   657     1    57   570
    146    21    12    16     5   131   237   676    39     2   140     7
      2    18   715   112    14    10    16     5   106   137    53     1
      1   673  4724    93    50    53    62   673     8   362    56  2661
      1   113    52   438    59   230   107     6    62    16     5   555
  10401     8 10558   715   112    60     2   172   772     2   485     1
     80   212 10558     1    30   467     8  1389]], shape=(1, 80), dtype=int64)

tf.Tensor(
[[   18    52     5   555    71    11    66   657     1    57   570   146
     21    12    16     5   131   237   676    39     2   140     7     2
     18   715   112    14    10    16     5   106   137    53     1     1
    673  4724    93    50    53    62   673     8   362    56  2661     1
    113    52   438    59   230   107     6    62    16     5   555 10401
      8 10558   715   112    60     2   172   772     2   485     

## Implement a Keras callback for generating text

In [None]:
class TextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model.
    1. Feed some starting prompt to the model
    2. Predict probabilities for the next token
    3. Sample the next token and add it to the next input

    Arguments:
        max_tokens: Integer, the number of tokens to be generated after prompt.
        start_tokens: List of integers, the token indices for the starting prompt.
        index_to_word: List of strings, obtained from the TextVectorization layer.
        top_k: Integer, sample from the `top_k` token predictions.
        print_every: Integer, print after this many epochs.
    """

    def __init__(
        self, max_tokens, start_tokens, index_to_word, top_k=10, print_every=1
    ):
        self.max_tokens = max_tokens
        self.start_tokens = start_tokens
        self.index_to_word = index_to_word
        self.print_every = print_every
        self.k = top_k

    def sample_from(self, logits):
        logits, indices = tf.math.top_k(logits, k=self.k, sorted=True)
        indices = np.asarray(indices).astype("int32")
        preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
        preds = np.asarray(preds).astype("float32")
        return np.random.choice(indices, p=preds)

    def detokenize(self, number):
        return self.index_to_word[number]

    def on_epoch_end(self, epoch, logs=None):
        start_tokens = [_ for _ in self.start_tokens]
        if (epoch + 1) % self.print_every != 0:
            return
        num_tokens_generated = 0
        tokens_generated = []
        while num_tokens_generated <= self.max_tokens:
            pad_len = maxlen - len(start_tokens)
            sample_index = len(start_tokens) - 1
            if pad_len < 0:
                x = start_tokens[:maxlen]
                sample_index = maxlen - 1
            elif pad_len > 0:
                x = start_tokens + [0] * pad_len
            else:
                x = start_tokens
            x = np.array([x])
            y, _ = self.model.predict(x)
            sample_token = self.sample_from(y[0][sample_index])
            tokens_generated.append(sample_token)
            start_tokens.append(sample_token)
            num_tokens_generated = len(tokens_generated)
        txt = " ".join(
            [self.detokenize(_) for _ in self.start_tokens + tokens_generated]
        )
        print(f"generated text:\n{txt}\n")


# Tokenize starting prompt
word_to_index = {}
for index, word in enumerate(vocab):
    word_to_index[word] = index

start_prompt = "this movie is"
start_tokens = [word_to_index.get(_, 1) for _ in start_prompt.split()]
num_tokens_generated = 40
text_gen_callback = TextGenerator(num_tokens_generated, start_tokens, vocab)


## Train the model

Note: This code should preferably be run on GPU.

In [None]:
# model = create_model()
# model.fit(text_ds, verbose=2, epochs=25, callbacks=[text_gen_callback])

In [None]:
# check gpu
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


In [None]:
model = create_model()
with tf.device('/device:GPU:0'):
    model.fit(text_ds, verbose=2, epochs=25, callbacks=[text_gen_callback])

Epoch 1/25
generated text:
this movie is a terrible piece of [UNK] of [UNK] , a lot in the original version of the movie , and a [UNK] , i was very good and i 've ever seen it . the worst movies , but it 's not

391/391 - 12s - loss: 5.5901 - dense_5_loss: 5.5901 - 12s/epoch - 32ms/step
Epoch 2/25
generated text:
this movie is about an interesting movie and a little bit . i would have to say that it was the first movie i have ever seen on dvd and i have a fan of mine in the movie , which is just a

391/391 - 11s - loss: 4.7167 - dense_5_loss: 4.7167 - 11s/epoch - 29ms/step
Epoch 3/25
generated text:
this movie is very simple but very interesting . this story was very good . if not a good movie with a couple of times , then you 've never seen any of those films in a long time and i had to see

391/391 - 12s - loss: 4.4704 - dense_5_loss: 4.4704 - 12s/epoch - 30ms/step
Epoch 4/25
generated text:
this movie is a very good movie . it 's really good . the storyline and the characters are a 

## Predict the next word

In [None]:
curr_output = model.predict(text_ds_test)



Check out the type of the output `curr_output`.

In [None]:
type(curr_output)

list

Checkout the items of the input data type `text_ds_test`.

In [None]:
type(text_ds_test)

tensorflow.python.data.ops.dataset_ops.PrefetchDataset

We extract the content of the `tensorflow.python.data` by using a for-loop.

In [None]:
l = 0
sample_input_text = []
for string_, int_ in text_ds_test:
    print(f'No. of entry: {l}')
    print(string_)
    sample_input_text.append(string_)
    print()
    print(int_)
    l += 1
    print()
    print()

No. of entry: 0
tf.Tensor(
[[   13    18    52     5   555    71    11    66   657     1    57   570
    146    21    12    16     5   131   237   676    39     2   140     7
      2    18   715   112    14    10    16     5   106   137    53     1
      1   673  4724    93    50    53    62   673     8   362    56  2661
      1   113    52   438    59   230   107     6    62    16     5   555
  10401     8 10558   715   112    60     2   172   772     2   485     1
     80   212 10558     1    30   467     8  1389]], shape=(1, 80), dtype=int64)

tf.Tensor(
[[   18    52     5   555    71    11    66   657     1    57   570   146
     21    12    16     5   131   237   676    39     2   140     7     2
     18   715   112    14    10    16     5   106   137    53     1     1
    673  4724    93    50    53    62   673     8   362    56  2661     1
    113    52   438    59   230   107     6    62    16     5   555 10401
      8 10558   715   112    60     2   172   772     2   485     

## Read the content

Recall the functions `text_gen_callback.detokenize(3)` can detokenize an integer into a word.

Then we can read out the input text, which is a movie review.

In [None]:
a_sample_input = [text_gen_callback.detokenize(ii) for ii in np.array(sample_input_text[0][0])]
" ".join(a_sample_input)

"this movie has a decent story in my opinion [UNK] good fight scenes but i was a little bit disappointed by the end of the movie .i think that it was a way better if [UNK] [UNK] knew karate also or if she knew to use some weapons [UNK] character has become more interesting too and she was a decent opponent to cynthia .i think when the director filmed the final [UNK] ' between cynthia [UNK] he wanted to finish"

We can check out the dimension of the output object.

In [None]:
curr_output[0].shape

(1, 80, 20000)

We can use `text_gen_callback.detokenize(int)` to read the output content, which is what the GPT thinks the reviewer will continue to write.

In [None]:
' '.join([text_gen_callback.detokenize(ii) for ii in np.argmax(curr_output[0], 2)[0]])

'is is a lot cast about a opinion . is . scenes . it think very little disappointed disappointed . the end of the movie was think it it was a very to than it was was it was was any it was it be the of in [UNK] . a a believable and . then is a little actor to cynthia rothrock think the they villain was this [UNK] fight of is the [UNK] [UNK] was to be the'