## Introduction

In this example, we will use KerasHub to build a scaled down Generative
Pre-Trained (GPT) model. GPT is a Transformer-based model that allows you to generate
sophisticated text from a prompt.

We will train the model on the [simplebooks-92](https://arxiv.org/abs/1911.12391) corpus,
which is a dataset made from several novels. It is a good dataset for this example since
it has a small vocabulary and high word frequency, which is beneficial when training a
model with few parameters.

This example combines concepts from
[Text generation with a miniature GPT](https://keras.io/examples/generative/text_generation_with_miniature_gpt/)
with KerasHub abstractions. We will demonstrate how KerasHub tokenization, layers and
metrics simplify the training
process, and then show how to generate output text using the KerasHub sampling utilities.

Note: If you are running this example on a Colab,
make sure to enable GPU runtime for faster training.

This example requires KerasHub. You can install it via the following command:
`pip install keras-hub`

## Setup

In [1]:
!pip install numpy==1.26.0



In [2]:
!pip install tensorflow[and-cuda]==2.18

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow[and-cuda]==2.18
  Obtaining dependency information for tensorflow[and-cuda]==2.18 from https://files.pythonhosted.org/packages/84/76/c55967ac9968ddaede25a4dce37aba37e9030656f02c12676151ce1b6f22/tensorflow-2.18.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached tensorflow-2.18.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorboard<2.19,>=2.18 (from tensorflow[and-cuda]==2.18)
  Obtaining dependency information for tensorboard<2.19,>=2.18 from https://files.pythonhosted.org/packages/b1/de/021c1d407befb505791764ad2cbd56ceaaa53a746baed01d2e2143f05f18/tensorboard-2.18.0-py3-none-any.whl.metadata
  Using cached tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Collecting keras>=3.5.0 (from tensorflow[and-cuda]==2.18)
  Obtaining dependency information for keras>=3.5.0 from https://files.pythonhosted.org/packag

In [3]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
print(f"TensorFlow Version: {tf.__version__}")

2024-11-09 20:28:36.499503: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-09 20:28:36.612030: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1731202116.645054  491318 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1731202116.656536  491318 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-09 20:28:36.722996: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

Num GPUs Available:  1
TensorFlow Version: 2.18.0


In [4]:
!pip install --upgrade keras-nlp keras==3.6.0 keras-hub==0.17.0

Defaulting to user installation because normal site-packages is not writeable


In [5]:
import keras_nlp
import keras

print(f"Keras version: {keras.__version__}")
print(f"KerasNLP version: {keras_nlp.__version__}")

Keras version: 3.6.0
KerasNLP version: 0.17.0


In [7]:
import os
import keras_hub
import time
import zipfile


import tensorflow.data as tf_data
import tensorflow.strings as tf_strings

import tensorflow_text as tf_text
import collections

## Settings & hyperparameters

In [8]:
# Data
BATCH_SIZE = 64
MIN_STRING_LEN = 512  # Strings shorter than this will be discarded
SEQ_LEN = 128  # Length of training sequences, in tokens

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 128
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000  # Limits parameters in model.

# Training
EPOCHS = 5

# Inference
NUM_TOKENS_TO_GENERATE = 80

## Load the data

Now, let's download the dataset! The SimpleBooks dataset consists of 1,573 Gutenberg books, and has
one of the smallest vocabulary size to word-level tokens ratio. It has a vocabulary size of ~98k,
a third of WikiText-103's, with around the same number of tokens (~100M). This makes it easy to fit a small model.

In [9]:
"""keras.utils.get_file(
    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
    extract=True,
)
#dir = os.path.expanduser("~/.keras/datasets/simplebooks/") this is the original line of code
dir = os.path.join(os.path.expanduser("~/.keras/datasets"), "simplebooks", "simplebooks-92-raw")

# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
    tf_data.TextLineDataset(dir + "train.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
    tf_data.TextLineDataset(dir + "valid.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
)"""

'keras.utils.get_file(\n    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",\n    extract=True,\n)\n#dir = os.path.expanduser("~/.keras/datasets/simplebooks/") this is the original line of code\ndir = os.path.join(os.path.expanduser("~/.keras/datasets"), "simplebooks", "simplebooks-92-raw")\n\n# Load simplebooks-92 train set and filter out short lines.\nraw_train_ds = (\n    tf_data.TextLineDataset(dir + "train.txt")\n    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)\n    .batch(BATCH_SIZE)\n    .shuffle(buffer_size=256)\n)\n\n# Load simplebooks-92 validation set and filter out short lines.\nraw_val_ds = (\n    tf_data.TextLineDataset(dir + "valid.txt")\n    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)\n    .batch(BATCH_SIZE)\n)'

In [9]:
#Get current working directory
cwd = os.getcwd()
#Download the dataset to the current working directory
file_path = keras.utils.get_file(
   fname="simplebooks.zip",
   origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip"
,
   extract=False,  # Do not extract immediately
   cache_dir=cwd  # Save it in the current working directory
)
#Extract the zip file manually to the current working directory
with zipfile.ZipFile(file_path, 'r') as zip_ref:
   zip_ref.extractall(cwd)
#Now set the dataset directory based on your current working directory
dir = os.path.join(cwd, "simplebooks/")
#Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
   tf_data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
   .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
   .batch(BATCH_SIZE)
   .shuffle(buffer_size=256)
)
#Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
   tf_data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
   .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
   .batch(BATCH_SIZE)
)
print(f"Dataset extracted to: {dir}")

Dataset extracted to: /sfs/gpfs/tardis/home/aec4hr/Codeathon3/simplebooks/


I0000 00:00:1731202144.465121  491318 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43483 MB memory:  -> device: 0, name: NVIDIA A40, pci bus id: 0000:63:00.0, compute capability: 8.6


## Train the tokenizer

We train the tokenizer from the training dataset for a vocabulary size of `VOCAB_SIZE`,
which is a tuned hyperparameter. We want to limit the vocabulary as much as possible, as
we will see later on
that it has a large effect on the number of model parameters. We also don't want to include
*too few* vocabulary terms, or there would be too many out-of-vocabulary (OOV) sub-words. In
addition, three tokens are reserved in the vocabulary:

- `"[PAD]"` for padding sequences to `SEQ_LEN`. This token has index 0 in both
`reserved_tokens` and `vocab`, since `WordPieceTokenizer` (and other layers) consider
`0`/`vocab[0]` as the default padding.
- `"[UNK]"` for OOV sub-words, which should match the default `oov_token="[UNK]"` in
`WordPieceTokenizer`.
- `"[BOS]"` stands for beginning of sentence, but here technically it is a token
representing the beginning of each line of training data.

In [10]:
# Train tokenizer vocabulary
vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=VOCAB_SIZE,
    lowercase=True,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

2024-11-09 20:29:54.374525: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Load tokenizer

We use the vocabulary data to initialize
`keras_hub.tokenizers.WordPieceTokenizer`. WordPieceTokenizer is an efficient
implementation of the WordPiece algorithm used by BERT and other models. It will strip,
lower-case and do other irreversible preprocessing operations.

In [11]:
tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

## Tokenize data

We preprocess the dataset by tokenizing and splitting it into `features` and `labels`.

In [12]:
# packer adds a start token
start_packer = keras_hub.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]"),
)


def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels


# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)

## Build the model

We create our scaled down GPT model with the following layers:

- One `keras_hub.layers.TokenAndPositionEmbedding` layer, which combines the embedding
for the token and its position.
- Multiple `keras_hub.layers.TransformerDecoder` layers, with the default causal masking.
The layer has no cross-attention when run with decoder sequence only.
- One final dense linear layer

In [13]:
inputs = keras.layers.Input(shape=(None,), dtype="int32")
# Embedding.
embedding_layer = keras_hub.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_hub.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_hub.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

Let's take a look at our model summary - a large majority of the
parameters are in the `token_and_position_embedding` and the output `dense` layer!
This means that the vocabulary size (`VOCAB_SIZE`) has a large effect on the size of the model,
while the number of Transformer decoder layers (`NUM_LAYERS`) doesn't affect it as much.

In [14]:
model.summary()

## Training

Now that we have our model, let's train it with the `fit()` method.

In [16]:
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

Epoch 1/5


2024-11-09 09:35:51.799070: E tensorflow/core/util/util.cc:131] oneDNN supports DT_INT64 only on platforms with AVX-512. Falling back to the default Eigen-based implementation if present.
I0000 00:00:1731162959.363130  285871 service.cc:148] XLA service 0x7f9c88008cb0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1731162959.363198  285871 service.cc:156]   StreamExecutor device (0): NVIDIA A40, Compute Capability 8.6
2024-11-09 09:35:59.448800: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
W0000 00:00:1731162959.593244  285871 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert
W0000 00:00:1731162959.752901  285871 assert_op.cc:38] Ignoring Assert operator sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/as

     16/Unknown [1m14s[0m 10ms/step - loss: 7.7993 - perplexity: 2610.7571

I0000 00:00:1731162965.550960  285871 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


   2221/Unknown [1m54s[0m 18ms/step - loss: 5.0716 - perplexity: 195.1689

W0000 00:00:1731163005.508260  285871 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert
W0000 00:00:1731163005.619780  285871 assert_op.cc:38] Ignoring Assert operator sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert




   2442/Unknown [1m60s[0m 19ms/step - loss: 5.0322 - perplexity: 186.8458

2024-11-09 09:36:51.924627: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
  self.gen.throw(typ, value, traceback)
W0000 00:00:1731163012.993253  285867 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert
W0000 00:00:1731163012.995957  285867 assert_op.cc:38] Ignoring Assert operator sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert




[1m2445/2445[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 20ms/step - loss: 5.0316 - perplexity: 186.7414 - val_loss: 4.2358 - val_perplexity: 69.2680
Epoch 2/5


2024-11-09 09:36:54.606595: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
2024-11-09 09:36:54.606661: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4327584292250278623
2024-11-09 09:36:54.606683: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6965722924101409755


[1m2445/2445[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 17ms/step - loss: 4.1923 - perplexity: 66.2508 - val_loss: 4.1133 - val_perplexity: 61.2603
Epoch 3/5


2024-11-09 09:37:40.960132: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
2024-11-09 09:37:40.960224: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4327584292250278623
2024-11-09 09:37:40.960241: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6965722924101409755


[1m2445/2445[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 18ms/step - loss: 4.0402 - perplexity: 56.8730 - val_loss: 4.0260 - val_perplexity: 56.1025
Epoch 4/5


2024-11-09 09:38:30.166531: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4327584292250278623
2024-11-09 09:38:30.166611: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6965722924101409755


[1m2441/2445[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 19ms/step - loss: 3.9668 - perplexity: 52.8465

2024-11-09 09:39:22.737333: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4327584292250278623


[1m2445/2445[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 19ms/step - loss: 3.9668 - perplexity: 52.8449 - val_loss: 3.9926 - val_perplexity: 54.2658
Epoch 5/5


2024-11-09 09:39:23.052006: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
2024-11-09 09:39:23.052084: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4327584292250278623
2024-11-09 09:39:23.052100: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6965722924101409755


[1m2445/2445[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 20ms/step - loss: 3.9180 - perplexity: 50.3251 - val_loss: 3.9583 - val_perplexity: 52.4085


2024-11-09 09:40:17.326623: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4327584292250278623
2024-11-09 09:40:17.326704: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6965722924101409755


<keras.src.callbacks.history.History at 0x7f9d14382710>

## Inference

With our trained model, we can test it out to gauge its performance. To do this
we can seed our model with an input sequence starting with the `"[BOS]"` token,
and progressively sample the model by making predictions for each subsequent
token in a loop.

To start lets build a prompt with the same shape as our model inputs, containing
only the `"[BOS]"` token.

In [17]:
# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens

<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>

We will use the `keras_hub.samplers` module for inference, which requires a
callback function wrapping the model we just trained. This wrapper calls
the model and returns the logit predictions for the current token we are
generating.

Note: There are two pieces of more advanced functionality available when
defining your callback. The first is the ability to take in a `cache` of states
computed in previous generation steps, which can be used to speed up generation.
The second is the ability to output the final dense "hidden state" of each
generated token. This is used by `keras_hub.samplers.ContrastiveSampler`, which
avoids repetition by penalizing repeated hidden states. Both are optional, and
we will ignore them for now.

In [18]:

def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    # Ignore hidden states for now; only needed for contrastive search.
    hidden_states = None
    return logits, hidden_states, cache


Creating the wrapper function is the most complex part of using these functions. Now that
it's done, let's test out the different utilities, starting with greedy search.

### Greedy search

We greedily pick the most probable token at each timestep. In other words, we get the
argmax of the model output.

In [19]:
sampler = keras_hub.samplers.GreedySampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,  # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")

Greedy search generated text: 
['[BOS] " i have been thinking of the matter , " the captain said , " but i have been thinking of the matter over the matter of the saxons , and i have been able to tell you that the captain of the captain of the captain , who had been a shipwright , and had been in command of the ship , and had been a shipwright crew of the ship , and had been sent ashore , and had been sent ashore to the ship , and had been sent ashore to the ship , and had been sent ashore to the ship , and had been drowned in the ship , and had been drowned in the ship']



As you can see, greedy search starts out making some sense, but quickly starts repeating
itself. This is a common problem with text generation that can be fixed by some of the
probabilistic text generation utilities shown later on!

### Beam search

At a high-level, beam search keeps track of the `num_beams` most probable sequences at
each timestep, and predicts the best next token from all sequences. It is an improvement
over greedy search since it stores more possibilities. However, it is less efficient than
greedy search since it has to compute and store multiple potential sequences.

**Note:** beam search with `num_beams=1` is identical to greedy search.

In [20]:
sampler = keras_hub.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")

Beam search generated text: 
['[BOS] " it is true , " he said . " it is true that it is true , but it is true , and it is true that it is true . it is true that it is true , but it is true , and it is true that it is true . it seems to me that it is true , but it is true that it is true , and that it is not to say that it is true , but it is true that it is true , and that it is not to say that it is true , but it is true that it is true that it is true , and that it is true that it is not']



Similar to greedy search, beam search quickly starts repeating itself, since it is still
a deterministic method.

### Random search

Random search is our first probabilistic method. At each time step, it samples the next
token using the softmax probabilities provided by the model.

In [21]:
sampler = keras_hub.samplers.RandomSampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")

Random search generated text: 
["[BOS] the conversation , although they did not call . the household was soon settled on the quantities out of sorts and vain possessed themselves . the front , elsie had breathlessly distorted of time with her not what she wore , but one day large aside , and her hair had been thick . swift and mole children rode amid quite bright still . after that she had finished its feeds with her head and fell and made over what was the last look of things at the star ' bins placed over the handkerchief and waited stoinfully forward she sometimes in order to keep sitting across it . a cry went by the"]



Voilà, no repetitions! However, with random search, we may see some nonsensical words
appearing since any word in the vocabulary has a chance of appearing with this sampling
method. This is fixed by our next search utility, top-k search.

### Top-K search

Similar to random search, we sample the next token from the probability distribution
provided by the model. The only difference is that here, we select out the top `k` most
probable tokens, and distribute the probability mass over them before sampling. This way,
we won't be sampling from low probability tokens, and hence we would have less
nonsensical words!

In [22]:
sampler = keras_hub.samplers.TopKSampler(k=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")

Top-K search generated text: 
['[BOS] " that \' s true , " he said , " i am sorry i will tell you that the two - - and you are not in a hurry to see me . you know what the matter would not be done . i have been going to be a prisoner . i have been thinking over here , i have been trying to find out what we should have done . i am sure i will tell you what to say , my brother , and i shall not do the work of the same thing . he does not know how you came back from my father and mother , as the two brothers of the gods and my youngest , you see the']



### Top-P search

Even with the top-k search, there is something to improve upon. With top-k search, the
number `k` is fixed, which means it selects the same number of tokens for any probability
distribution. Consider two scenarios, one where the probability mass is concentrated over
2 words and another where the probability mass is evenly concentrated across 10. Should
we choose `k=2` or `k=10`? There is no one size that fits all `k` here.

This is where top-p search comes in! Instead of choosing a `k`, we choose a probability
`p` that we want the probabilities of the top tokens to sum up to. This way, we can
dynamically adjust the `k` based on the probability distribution. By setting `p=0.9`, if
90% of the probability mass is concentrated on the top 2 tokens, we can filter out the
top 2 tokens to sample from. If instead the 90% is distributed over 10 tokens, it will
similarly filter out the top 10 tokens to sample from.

In [23]:
sampler = keras_hub.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-P search generated text: \n{txt}\n")

Top-P search generated text: 
['[BOS] " that is the true name of the name of the keel , and that it is not that i know you are my own , and it is very much to you , that , i believe it is very difficult to find it difficult to get out of it , and not that you will be a good thing to do for you to live here , and not often with you in the way of travelling with a great deal of rest , and it is to be found that the two boys have to look for them . i don \' t believe that the child has a bad heart and makes a little squeak of the mut']



### Using callbacks for text generation

We can also wrap the utilities in a callback, which allows you to print out a prediction
sequence for every epoch of the model! Here is an example of a callback for top-k search:

In [24]:

class TopKTextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model using top-k."""

    def __init__(self, k):
        self.sampler = keras_hub.samplers.TopKSampler(k)

    def on_epoch_end(self, epoch, logs=None):
        output_tokens = self.sampler(
            next=next,
            prompt=prompt_tokens,
            index=1,
        )
        txt = tokenizer.detokenize(output_tokens)
        print(f"Top-K search generated text: \n{txt}\n")


text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback])

Epoch 1/2


2024-11-09 09:40:39.330675: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4327584292250278623
2024-11-09 09:40:39.330770: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6965722924101409755


Top-K search generated text: 
['[BOS] " it was not until she came to the palace in the hall , where the sun was shining . she sat looking at her mother \' s face , with a black hairy blue , which had been tied round a necklace . she had been a good many times and had been combed by the spruce tree . a long and low bush of her tail - - the white - haired - - but a little , with the black shadows on the floor . it looked like a little white bear in her eyes . her head was very thin , and she was a great , white and white beard , and her hair']

1/1 - 9s - 9s/step - loss: 3.7595 - perplexity: 42.9551
Epoch 2/2


2024-11-09 09:40:46.814181: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4327584292250278623
2024-11-09 09:40:46.814265: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6965722924101409755


Top-K search generated text: 
['[BOS] " i am not going to be allowed to have a little organized house at the same time , and when the preceding evening comes back with you ; but it is so well to be the case , and the other , when it comes to the conclusion that i will not be much more than you will not let the confinement of any kind . in that i am not to say to you , i have been in a prussy , for i cannot say that i cannot explain that it has to me . it may be that i have been more fortunate , in my life , and it may']

1/1 - 7s - 7s/step - loss: 3.8710 - perplexity: 48.0090


<keras.src.callbacks.history.History at 0x7f9c927026d0>

# GPT2-Model 

In [25]:
keras.mixed_precision.set_global_policy("mixed_float16")

# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)



In [33]:
type(train_ds)

tensorflow.python.data.ops.prefetch_op._PrefetchDataset

In [36]:
num_epochs = 3


# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps= 200 * num_epochs,
    end_learning_rate=0.0,
)

loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(raw_train_ds, epochs=num_epochs)

Epoch 1/3


W0000 00:00:1731167352.601774  285864 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert


   2327/Unknown [1m601s[0m 240ms/step - accuracy: 0.3069 - loss: 3.6951

W0000 00:00:1731167931.382240  285864 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert


   2444/Unknown [1m649s[0m 248ms/step - accuracy: 0.3071 - loss: 3.6933

2024-11-09 10:59:38.105590: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]


[1m2445/2445[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m649s[0m 248ms/step - accuracy: 0.3071 - loss: 3.6933
Epoch 2/3
[1m2445/2445[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m593s[0m 240ms/step - accuracy: 0.3128 - loss: 3.6326
Epoch 3/3
[1m2445/2445[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m593s[0m 240ms/step - accuracy: 0.3129 - loss: 3.6343


<keras.src.callbacks.history.History at 0x7f9c903780d0>

# Bert Model For Text Generation

In [15]:
# Will Used a BertMaskedLM

keras_hub.models.BertMaskedLM(backbone, preprocessor=None, **kwargs)

Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en/2/download/config.json...


100%|██████████| 510/510 [00:00<00:00, 546kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en/2/download/model.weights.h5...


100%|██████████| 414M/414M [00:04<00:00, 89.8MB/s] 


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en/2/download/tokenizer.json...


100%|██████████| 548/548 [00:00<00:00, 1.33MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en/2/download/assets/tokenizer/vocabulary.txt...


100%|██████████| 208k/208k [00:00<00:00, 7.86MB/s]


In [19]:

# Pretrained language model.
masked_lm = keras_hub.models.BertMaskedLM.from_preset(
    "bert_base_en_uncased",
)
#masked_lm.fit(x=raw_train_ds)

# Re-compile (e.g., with a new learning rate).
masked_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(5e-5),
    jit_compile=True,
)
# Access backbone programmatically (e.g., to change `trainable`).
masked_lm.backbone.trainable = False
# Fit again.
masked_lm.fit(x=raw_train_ds)

W0000 00:00:1731205344.972978  491620 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert


   2320/Unknown [1m1181s[0m 500ms/step - loss: 1.2632 - sparse_categorical_accuracy: 0.3731

W0000 00:00:1731206515.037613  491615 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert



   2444/Unknown [1m1254s[0m 504ms/step - loss: 1.2486 - sparse_categorical_accuracy: 0.3777

2024-11-09 21:43:06.775721: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
2024-11-09 21:43:06.775825: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3076185181018547487
  self.gen.throw(typ, value, traceback)


[1m2445/2445[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1254s[0m 504ms/step - loss: 1.2483 - sparse_categorical_accuracy: 0.3777


<keras.src.callbacks.history.History at 0x7eff0c524c10>