## Installation: `keras-hub`

In [1]:
! pip install git+https://github.com/keras-team/keras-hub.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m615.3/615.3 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m104.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for keras-hub (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tf-keras 2.17.0 requires tensorflow<2.18,>=2.17, but you have tensorflow 2.18.0 which is incompatible.[0m[31m
[0m

Large Language Models are complex to build and expensive to train from scratch. Luckily there are pretrained LLMs available for use right away. KerasHub provides a large number of pre-trained checkpoints that allow you to experiment with SOTA models without needing to train them yourself.

KerasHub is a natural language processing library that supports users through their entire development cycle. KerasHub offers both pretrained models and modularized building blocks, so developers could easily reuse pretrained models or stack their own LLM.

In a nutshell, for generative LLM, KerasHub offers:

Pretrained models with generate() method, e.g., keras_hub.models.GPT2CausalLM and keras_hub.models.OPTCausalLM.
Sampler class that implements generation algorithms such as Top-K, Beam and contrastive search. These samplers can be used to generate text with custom models.

## Import

In [1]:
import os

os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

import keras_hub
import keras
import tensorflow as tf
import time

keras.mixed_precision.set_global_policy("mixed_float16")

## Installation: `huggify-data`

In [3]:
! pip install huggify-data

Collecting huggify-data
  Downloading huggify_data-0.4.4-py3-none-any.whl.metadata (7.5 kB)
Collecting accelerate<0.21.0,>=0.20.3 (from huggify-data)
  Downloading accelerate-0.20.3-py3-none-any.whl.metadata (17 kB)
Collecting bitsandbytes==0.40.2 (from huggify-data)
  Downloading bitsandbytes-0.40.2-py3-none-any.whl.metadata (9.8 kB)
Collecting datasets==2.20.0 (from huggify-data)
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface-hub==0.23.4 (from huggify-data)
  Downloading huggingface_hub-0.23.4-py3-none-any.whl.metadata (12 kB)
Collecting matplotlib==3.9.0 (from huggify-data)
  Downloading matplotlib-3.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting numpy==1.24.4 (from huggify-data)
  Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting peft==0.4.0 (from huggify-data)
  Downloading peft-0.4.0-py3-none-any.whl.metadata (21 kB)
Collecting pymupdf

## Acquire Foundation Model: GPT2

KerasHub provides a number of pre-trained models, such as Google Bert and GPT-2. You can see the list of models available in the KerasHub repository.

It's very easy to load the GPT-2 model as you can see below:

In [2]:
# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_hub.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_hub.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/3/download/config.json...


100%|██████████| 431/431 [00:00<00:00, 390kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/3/download/tokenizer.json...


100%|██████████| 618/618 [00:00<00:00, 781kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/3/download/assets/tokenizer/vocabulary.json...


100%|██████████| 0.99M/0.99M [00:00<00:00, 45.3MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/3/download/assets/tokenizer/merges.txt...


100%|██████████| 446k/446k [00:00<00:00, 7.20MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/3/download/model.weights.h5...


100%|██████████| 475M/475M [00:02<00:00, 193MB/s]


In [3]:
start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
My trip to Yosemite was a bit of a roller coaster.

I've been hiking with my friends for the last couple months and I was surprised how many of us had the chance to get to see this place. It's not a big hike, but it was fun. It's a great place to get to know some of the people, the people who live here and the people who are here, and I really liked the views.

We arrived at the trailhead and the view of this area was breathtaking. It's a little bit of a bit of a walk, but there's a good deal of shade.

The hike up to the summit was very scenic and it was pretty clear. I was a bit worried about the wind, but it was pretty easy to get to the summit.

I was a bit nervous at first, but I felt like I had done something right. I was really looking forward to the day when I was backpacking in the
TOTAL TIME ELAPSED: 9.19s


Try another one:

In [4]:
start = time.time()

output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
That Italian restaurant is called the "Souveillance Grill" in the city of Milan. The restaurant was named by Italian media as "the place to be for the night."

The restaurant is located inside the main dining room of the restaurant and has been open since the beginning of 2012. According to Italian news outlet Il Sole 24, it was opened in September 2014. According to Italian newspaper La Repubblica, it was named after a restaurant in the city of Milan called the "Souveillance Grill." The Italian newspaper reports that the restaurant has been open since the beginning of 2012.

"It's a very nice restaurant. I don't know about other restaurants that have been in the city of Milan," said owner Giovanni Pizzi.

The restaurant is located inside the main dining room, located in front of the main dining hall. The restaurant's name, according to Italian media, refers to a "Souveillance Grill." In fact,
TOTAL TIME ELAPSED: 1.61s


Now you have the knowledge of the GPT-2 model from KerasHub, you can take one step further to finetune the model so that it generates text in a specific style, short or long, strict or casual. In this tutorial, we will use reddit dataset for example.

## Scrape PDF using `huggify-data`

In [5]:
from huggify_data.scrape_modules import *

# Example usage:
pdf_path = "/content/Yiqiao Yin - List of Publications.pdf"
openai_api_key = "sk-xxx"
generator = PDFQnAGenerator(pdf_path, openai_api_key)
generator.process_scraped_content()
generator.generate_questions_answers()
df = generator.convert_to_dataframe()
print(df)

100%|██████████| 11/11 [00:05<00:00,  2.18it/s]

    index                                          questions  \
0       0  What is the email address associated with YIQI...   
1       1  What is the title of the paper co-authored by ...   
2       2  Who are the individuals included in the conten...   
3       3  What are the key advancements and applications...   
4       4  What is the significance or relevance of the c...   
5       5        What can you do with the content "Learn"?\n   
6       6  What forms of art do you enjoy exploring or cr...   
7       7  What is the title of the paper written by Shaw...   
8       8                       What do you want to learn?\n   
9       9  What might the missing letters in the content ...   
10     10  What are some of the recent publications by Yi...   

                                              answers  
0   YIQIAO YINB: Eagle0504@gmail.com : (+1) 585-9...  
1   Citizen ⋄www.Y-Yin.io/ ⋄YouTube.com/YiqiaoYin ...  
2                            Eby, LindaHill, Thelma J  
3   Mie




In [6]:
df.head(3)

Unnamed: 0,index,questions,answers
0,0,What is the email address associated with YIQI...,YIQIAO YINB: Eagle0504@gmail.com : (+1) 585-9...
1,1,What is the title of the paper co-authored by ...,Citizen ⋄www.Y-Yin.io/ ⋄YouTube.com/YiqiaoYin ...
2,2,Who are the individuals included in the conten...,"Eby, LindaHill, Thelma J"


In [7]:
# Assuming 'df' is your existing dataframe with 'questions' and 'answers' columns

# Step 1: Create the 'combined' column
df['combined'] = df['questions'] + ' ' + df['answers']

# Step 2: Create a list of strings from the 'combined' column
paragraphs = df['combined'].tolist()

# Verify the result
print(paragraphs)

['What is the email address associated with YIQIAO YINB?\n YIQIAO YINB: Eagle0504@gmail.com \x07: (+1) 585-953-8396 ⋄U.S', 'What is the title of the paper co-authored by Yiqiao Yin and Keshav Rangan in 2024?\n Citizen ⋄www.Y-Yin.io/ ⋄YouTube.com/YiqiaoYin (100K+ Subs)PUBLICATIONS (SELECTED WORK)· Papers• 2024-03 | Keshav Rangan and Yiqiao Yin (2024), A Fine-tuning Enhanced RAG System with Quantized InfluenceMeasure as AI Judge, Scientific Reports (a Nature journal) 14 (27446), paper.• 2023-12 | Kieran Pichai and Yiqiao Yin (as mentor) (2024), A Retrieval-Augmented Generation Based LargeLanguage Model Benchmarked On a Novel Dataset, Journal of Student Research, 12(4).• 2023-04 | Xuan Di, Yiqiao Yin, Yongjie Fu, Zhaobin Mo, Shaw-Hwa Lo, Carolyn DiGuiseppi, David W', 'Who are the individuals included in the content "Eby, LindaHill, Thelma J"?\n Eby, LindaHill, Thelma J', 'What are the key advancements and applications of artificial intelligence in the field of medicine and image analysis 

In [8]:
type(paragraphs)

list

In [9]:
len(paragraphs)

11

In [26]:
paragraphs[7]

'What is the title of the paper written by Shaw-hwa Lo and Yiqiao Yin in December 2021 regarding an Interaction-based Recurrent Neural Network (IRNN)?\n Inte., 3(1), 01-11, paper.• 2021-12 | Shaw-hwa Lo and Yiqiao Yin (2021), An Interaction-based Recurrent Neural Network (IRNN) (Dec.,2021), Mach'

### Convert to `tf-dataset`

Convert to TF dataset, and only use partial data to train

In [12]:
train_ds = (
    tf.data.Dataset.from_tensor_slices(paragraphs)
    .batch(16)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

In [13]:
type(train_ds)

## Train `gpt2`

In [16]:
%%time

# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 100

learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-4,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 24s/step - accuracy: 0.9143 - loss: 0.1582
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 21s/step - accuracy: 0.8492 - loss: 0.2875
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step - accuracy: 0.8619 - loss: 0.2719
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step - accuracy: 0.8746 - loss: 0.2376
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step - accuracy: 0.8873 - loss: 0.1722
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step - accuracy: 0.9206 - loss: 0.1311
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step - accuracy: 0.9476 - loss: 0.0910
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step - accuracy: 0.9540 - loss: 0.0629
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1

<keras.src.callbacks.history.History at 0x784b1d1d63b0>

## Inference

Since this is a text generation model, we'd expect to do some post processing. Let us use the following code to generate some content. Then let us see how we can split the text to generate actual response.

In [27]:
prompt = "What is the title of the paper written by Shaw-hwa Lo and Yiqiao Yin in December 2021?"
output = gpt2_lm.generate(prompt, max_length=200)
print("\nGPT-2 output:")
print(output)
response = output.split(prompt)[1]
print("\nGPT-2 post processed response:")
print(response)


GPT-2 output:
What is the title of the paper written by Shaw-hwa Lo and Yiqiao Yin in December 2021?
 Citizen ⋄www.Y-Yin.io/ ⋄YouTube.com/YiqiaoYin (100K+ Subs)PUBLICATIONS (SELECTED WORK)· Papers• 2021-12 | Shaw-hwa Lo and Yiqiao Yin (2021), A Fine-tuning Enhanced RAG System with Quantized InfluenceMeasure as AI Judge, Scientific Reports (a Nature journal) 14 (27446), paper.• 2021-12 | Shaw-hwa Lo and Yiqiao Yin (2021), A Fine-tuning Enhanced RAG System with Quantized InfluenceMeasure as AI Judge, Scientific Reports (a Nature journal) 14 (27446), paper.• 2021-12 | Shaw-hwa Lo and Yiqiao Yin (2021), A Fine-tuning Enhanced RAG System with Quantized

GPT-2 post processed response:

 Citizen ⋄www.Y-Yin.io/ ⋄YouTube.com/YiqiaoYin (100K+ Subs)PUBLICATIONS (SELECTED WORK)· Papers• 2021-12 | Shaw-hwa Lo and Yiqiao Yin (2021), A Fine-tuning Enhanced RAG System with Quantized InfluenceMeasure as AI Judge, Scientific Reports (a Nature journal) 14 (27446), paper.• 2021-12 | Shaw-hwa Lo and Yiqia