<a href="https://colab.research.google.com/github/Anurag20072002/Data-Dreamers-5201/blob/main/Copy_of_gpt2_text_generation_with_kerasnlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT2 Text Generation with KerasNLP

**Author:** Chen Qian<br>
**Date created:** 2023/04/17<br>
**Last modified:** 2024/04/12<br>
**Description:** Use KerasNLP GPT2 model and `samplers` to do text generation.

In this tutorial, you will learn to use [KerasNLP](https://keras.io/keras_nlp/) to load a
pre-trained Large Language Model (LLM) - [GPT-2 model](https://openai.com/research/better-language-models)
(originally invented by OpenAI), finetune it to a specific text style, and
generate text based on users' input (also known as prompt). You will also learn
how GPT2 adapts quickly to non-English languages, such as Chinese.

##  Before we begin

Colab offers different kinds of runtimes. Make sure to go to **Runtime ->
Change runtime type** and choose the GPU Hardware Accelerator runtime
(which should have >12G host RAM and ~15G GPU RAM) since you will finetune the
GPT-2 model. Running this tutorial on CPU runtime will take hours.

## Install KerasNLP, Choose Backend and Import Dependencies

This examples uses [Keras 3](https://keras.io/keras_3/) to work in any of
`"tensorflow"`, `"jax"` or `"torch"`. Support for Keras 3 is baked into
KerasNLP, simply change the `"KERAS_BACKEND"` environment variable to select
the backend of your choice. We select the JAX backend below.

In [None]:
!pip install git+https://github.com/keras-team/keras-nlp.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
import os

os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

import keras_nlp
import keras
import tensorflow as tf
import time

keras.mixed_precision.set_global_policy("mixed_float16")

## Introduction to Generative Large Language Models (LLMs)

Large language models (LLMs) are a type of machine learning models that are
trained on a large corpus of text data to generate outputs for various natural
language processing (NLP) tasks, such as text generation, question answering,
and machine translation.

Generative LLMs are typically based on deep learning neural networks, such as
the [Transformer architecture](https://arxiv.org/abs/1706.03762) invented by
Google researchers in 2017, and are trained on massive amounts of text data,
often involving billions of words. These models, such as Google [LaMDA](https://blog.google/technology/ai/lamda/)
and [PaLM](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html),
are trained with a large dataset from various data sources which allows them to
generate output for many tasks. The core of Generative LLMs is predicting the
next word in a sentence, often referred as **Causal LM Pretraining**. In this
way LLMs can generate coherent text based on user prompts. For a more
pedagogical discussion on language models, you can refer to the
[Stanford CS324 LLM class](https://stanford-cs324.github.io/winter2022/lectures/introduction/).

## Introduction to KerasNLP

Large Language Models are complex to build and expensive to train from scratch.
Luckily there are pretrained LLMs available for use right away. [KerasNLP](https://keras.io/keras_nlp/)
provides a large number of pre-trained checkpoints that allow you to experiment
with SOTA models without needing to train them yourself.

KerasNLP is a natural language processing library that supports users through
their entire development cycle. KerasNLP offers both pretrained models and
modularized building blocks, so developers could easily reuse pretrained models
or stack their own LLM.

In a nutshell, for generative LLM, KerasNLP offers:

- Pretrained models with `generate()` method, e.g.,
    `keras_nlp.models.GPT2CausalLM` and `keras_nlp.models.OPTCausalLM`.
- Sampler class that implements generation algorithms such as Top-K, Beam and
    contrastive search. These samplers can be used to generate text with
    custom models.

## Load a pre-trained GPT-2 model and generate some text

KerasNLP provides a number of pre-trained models, such as [Google
Bert](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
and [GPT-2](https://openai.com/research/better-language-models). You can see
the list of models available in the [KerasNLP repository](https://github.com/keras-team/keras-nlp/tree/master/keras_nlp/models).

It's very easy to load the GPT-2 model as you can see below:

In [None]:
# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/model.safetensors...
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/model.safetensors.index.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/metadata.json...
100%|██████████| 141/141 [00:00<00:00, 124kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/preprocessor.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/tokenizer.json...
100%|██████████| 448/448 [00:00<00:00, 494kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/vocabulary.json...
100%|██████████| 0.99M/0.99M [00:00<00:00, 1.80MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/merges.txt...
100%|█████

Once the model is loaded, you can use it to generate some text right away. Run
the cells below to give it a try. It's as simple as calling a single function
*generate()*:

In [None]:
start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
My trip to Yosemite was one of those things where it's easy to see why the world is full of amazing places, but it's hard to see the whole picture when you have the whole world in front of you. I spent a great deal of time in Yosemite and I've never had to walk through the park in the whole day, and there were some amazing things I could not get a good look at. I've also had a lot of fun with the Yosemite National Park. It was one of those things where it's easy to see why the world is full of amazing places, but it's hard to see the whole picture when you have the whole world in front of you.

I've had a lot of fun with the Yosemite National Park. It was one of those things where it's easy to see why the world is full of amazing places, but it's hard to see the whole picture when you have the whole world in front of you.

I love the
TOTAL TIME ELAPSED: 10.02s


Try another one:

In [None]:
start = time.time()

output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
That Italian restaurant is the best in town and the best in town for a good Italian meal! It's a great place to eat Italian food. We have Italian dishes on the menu, but we have no complaints here.

The menu is a lot of what you expect from Italian restaurants, but there are some really good things about this one. It has a great Italian restaurant with great food, and a good atmosphere. The food is very good. It's a nice place to go for lunch and dinner. The service is very friendly. The food is good and the atmosphere is good. The staff is nice, and the food is good. The service is very good and the atmosphere is good. The restaurant is a little crowded, but the service is good, the food is excellent, and the atmosphere is good.

This place is the perfect place for a quick dinner. The food is good and the atmosphere is nice. The place has an Italian vibe and I love the place
TOTAL TIME ELAPSED: 2.01s


Notice how much faster the second call is. This is because the computational
graph is [XLA compiled](https://www.tensorflow.org/xla) in the 1st run and
re-used in the 2nd behind the scenes.

The quality of the generated text looks OK, but we can improve it via
fine-tuning.

## More on the GPT-2 model from KerasNLP

Next up, we will actually fine-tune the model to update its parameters, but
before we do, let's take a look at the full set of tools we have to for working
with for GPT2.

The code of GPT2 can be found
[here](https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/gpt2/).
Conceptually the `GPT2CausalLM` can be hierarchically broken down into several
modules in KerasNLP, all of which have a *from_preset()* function that loads a
pretrained model:

- `keras_nlp.models.GPT2Tokenizer`: The tokenizer used by GPT2 model, which is a
    [byte-pair encoder](https://huggingface.co/course/chapter6/5?fw=pt).
- `keras_nlp.models.GPT2CausalLMPreprocessor`: the preprocessor used by GPT2
    causal LM training. It does the tokenization along with other preprocessing
    works such as creating the label and appending the end token.
- `keras_nlp.models.GPT2Backbone`: the GPT2 model, which is a stack of
    `keras_nlp.layers.TransformerDecoder`. This is usually just referred as
    `GPT2`.
- `keras_nlp.models.GPT2CausalLM`: wraps `GPT2Backbone`, it multiplies the
    output of `GPT2Backbone` by embedding matrix to generate logits over
    vocab tokens.

## Finetune on imbd reviews dataset

Now you have the knowledge of the GPT-2 model from KerasNLP, you can take one
step further to finetune the model so that it generates text in a specific
style, short or long, strict or casual. In this tutorial, we will use imbd
dataset for example.

In [None]:
import tensorflow_datasets as tfds

imdb_reviews_ds = tfds.load("imdb_reviews", split="train", as_supervised=True)

Let's take a look inside sample data from the reddit TensorFlow Dataset. There
are two features:

- **__document__**: text of the post.
- **__title__**: the title.

In [None]:
for document, title in imdb_reviews_ds:
    print(document.numpy())
    print(title.numpy())
    break

b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
0


In our case, we are performing next word prediction in a language model, so we
only need the 'document' feature.

In [None]:
train_ds = (
    imdb_reviews_ds.map(lambda document, _: document)
    .batch(32)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

Now you can finetune the model using the familiar *fit()* function. Note that
`preprocessor` will be automatically called inside `fit` method since
`GPT2CausalLM` is a `keras_nlp.models.Task` instance.

This step takes quite a bit of GPU memory and a long time if we were to train
it all the way to a fully trained state. Here we just use part of the dataset
for demo purposes.

In [None]:
train_ds = train_ds.take(500)
num_epochs = 1

# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m203s[0m 335ms/step - accuracy: 0.3225 - loss: 3.6373


<keras.src.callbacks.history.History at 0x7d4c25bef250>

After fine-tuning is finished, you can again generate text using the same
*generate()* function. This time, the text will be closer to Reddit writing
style, and the generated length will be close to our preset length in the
training set.

In [None]:
start = time.time()

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
I like basketball, but I don't like watching it. The movie is very bad and I don't like the characters. I like the acting, but it doesn't make sense. The acting sucks, but there are a few good ones. The only one that really made sense was when the two girls get together and start to kiss each other. I think they are going to get married soon, but it seems like it's not going to happen. I think that it is going to be a very bad movie.
TOTAL TIME ELAPSED: 7.23s


## Into the Sampling Method

In KerasNLP, we offer a few sampling methods, e.g., contrastive search,
Top-K and beam sampling. By default, our `GPT2CausalLM` uses Top-k search, but
you can choose your own sampling method.

Much like optimizer and activations, there are two ways to specify your custom
sampler:

- Use a string identifier, such as "greedy", you are using the default
configuration via this way.
- Pass a `keras_nlp.samplers.Sampler` instance, you can use custom configuration
via this way.

In [None]:
# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_nlp.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)


GPT-2 output:
I like basketball, but I don't like the way it's done. I'm not a big fan of the NBA, but I'm a fan of basketball. I think this film is a great way to get some laughs out of this movie. The film was filmed on a rainy night with no sunlight and I had to sit and wait for the rain to fall. I think I was in a good mood, I just couldn't sit still. I was in awe of the film, I was amazed how it was made and I was so glad it was made. I think it has the potential of making a great movie, I hope to see

GPT-2 output:
I like basketball, but I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the movie. I don't like the 

For more details on KerasNLP `Sampler` class, you can check the code
[here](https://github.com/keras-team/keras-nlp/tree/master/keras_nlp/samplers).

## Finetune on Chinese Poem Dataset

We can also finetune GPT2 on non-English datasets. For readers knowing Chinese,
this part illustrates how to fine-tune GPT2 on Chinese poem dataset to teach our
model to become a poet!

Because GPT2 uses byte-pair encoder, and the original pretraining dataset
contains some Chinese characters, we can use the original vocab to finetune on
Chinese dataset.

In [None]:
!# Load chinese poetry dataset.
!git clone https://github.com/chinese-poetry/chinese-poetry.git

Cloning into 'chinese-poetry'...
remote: Enumerating objects: 7323, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 7323 (delta 5), reused 10 (delta 3), pack-reused 7309[K
Receiving objects: 100% (7323/7323), 236.98 MiB | 22.95 MiB/s, done.
Resolving deltas: 100% (5003/5003), done.
Updating files: 100% (2285/2285), done.


Load text from the json file. We only use《全唐诗》for demo purposes.

In [None]:
import os
import json

poem_collection = []
for file in os.listdir("chinese-poetry/全唐诗"):
    if ".json" not in file or "poet" not in file:
        continue
    full_filename = "%s/%s" % ("chinese-poetry/全唐诗", file)
    with open(full_filename, "r") as f:
        content = json.load(f)
        poem_collection.extend(content)

paragraphs = ["".join(data["paragraphs"]) for data in poem_collection]

Let's take a look at sample data.

In [None]:
print(paragraphs[0])

短檠三尺照座隅，眵昏兩目頭不梳。丈夫功業務廣大，安用事此牛尾書。多求舊聞助器識，欲駕萬里須舟輿。要堅志節在專苦，積螢照夜真前車。道鄉先生好門戶，髯季晚出充門閭。昻昻鷄群見野鶴，炯炯虎視嗟黔驢。如何天公不着眼，棄此異寶猶紛挐。我知造物自有意，將騁健駿先虛徐。金須百錬作鐘鼎，玉試三火真璠璵。來年明光再射策，聊取髙第酬三餘。古今人事自差别，見晚用速皆乘除。他年雲路着鞭穩，無忘過我中田廬。


 we convert to TF dataset, and only use partial data
to train.

In [None]:
train_ds = (
    tf.data.Dataset.from_tensor_slices(paragraphs)
    .batch(16)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 1

learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-4,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 233ms/step - accuracy: 0.2356 - loss: 2.9264


<keras.src.callbacks.history.History at 0x7d4c26277580>

Let's check the result!

In [None]:
output = gpt2_lm.generate("昨夜雨疏风骤", max_length=200)
print(output)

昨夜雨疏风骤馬，萬渾書風萬秋。書頭秋秋萬風，居翠江倚汝頻。秋秋曲萬霜著，江風書爭欲樂。


#titanic dataset

In [None]:
import json # Import the json module

with open('/content/titanic.json', 'r', encoding='utf-8') as f:
    titanic_data = json.load(f) # Now you can use json.load()

In [None]:
# Create a TensorFlow dataset from the JSON data
import tensorflow as tf
texts = titanic_data["text"]
titanic_ds = tf.data.Dataset.from_tensor_slices(texts)

In [None]:
# Prepare the dataset for training
train_ds = (
    titanic_ds
    .batch(32)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

In [None]:
# Take a subset for training (optional, depending on your needs)
train_ds = train_ds.take(1000)
num_epochs = 1

# Configure a dynamically decreasing learning rate based on linear decay
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    initial_learning_rate=5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)

In [None]:
# Define the loss function
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Assuming gpt2_lm is your GPT-2 model
# If not already defined, you need to load or create your GPT-2 model here

# Model compilation
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)
# Train the model
gpt2_lm.fit(train_ds, epochs=num_epochs)

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 18s/step - accuracy: 0.1468 - loss: 0.8197


<keras.src.callbacks.history.History at 0x7d4c0afc78e0>

In [None]:
start = time.time()
output = gpt2_lm.generate("I'm the king", max_length=200)
print("\nGPT-2 output:")
print(output)
end = time.time()
print(f"TOTAL TIME ELAPSED: {end-start:.2f}s")


GPT-2 output:
I'm the king of the world, I don
TOTAL TIME ELAPSED: 0.18s


# Real time Dataset

In [None]:
import json
with open('/content/realtime_dataset.json', 'r', encoding='utf-8') as f:
    realtime_data=json.load(f)

In [None]:
# Extract texts from the dataset
texts = [realtime_data.get('text', '')]  # Access the 'text' value directly, or provide an empty string if it doesn't exist

# Create a TensorFlow dataset from the texts
realtime_ds = tf.data.Dataset.from_tensor_slices(texts)

# Prepare the dataset for training
train_ds = (
    realtime_ds
    .batch(32)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

In [None]:
# Take a subset for training (optional, depending on your needs)
train_ds = train_ds.take(1000)
num_epochs = 3

# Configure a dynamically decreasing learning rate based on linear decay
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    initial_learning_rate=5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)

# Define the loss function
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Model compilation
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

# Train the model
gpt2_lm.fit(train_ds, epochs=num_epochs)

Epoch 1/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 21s/step - accuracy: 0.2031 - loss: 4.5116
Epoch 2/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 16s/step - accuracy: 0.2578 - loss: 4.6216
Epoch 3/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.2188 - loss: 4.6375


<keras.src.callbacks.history.History at 0x7d4c0c2318d0>

In [None]:
# Inference
start = time.time()
output = gpt2_lm.generate("the god is", max_length=200)
print("\nGPT-2 output:")
print(output)
end = time.time()
print(f"TOTAL TIME ELAPSED: {end-start:.2f}s")


GPT-2 output:
the god is it the best of the old, the most important of the world, and I can't say it doesn
TOTAL TIME ELAPSED: 0.33s
