<a href="https://colab.research.google.com/github/sourcesync/kagglex_gemma/blob/gw%2Finitial/colab/gemma_ft_dolly__with_context_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#  This notebook demonstrates the following:
   * fine tuning "gemma2_2b_en" on the dolly dataset
   * shows prompt completion before and after fine-tuning this model
   * it runs successfully in COLAB


# Get access to Gemma via your Kaggle account:
  * Log into your Kaggle account
  * Request access to Gemma models using your Kaggle account.  You can follow these instructions here: https://www.kaggle.com/code/nilaychauhan/get-started-with-gemma-using-kerasnlp
  * You need to wait for confirmation.  Note that this didn't take too long for me.
  * Create an API key in your Kaggle account you will need later.  You can follow these instructions here: https://christianjmills.com/posts/kaggle-obtain-api-key-tutorial/



# Ensure your Colab account can access Gemma:
  * Add the Kaggle API key into your COLAB secrets.  You can follow these instructions here: https://drlee.io/how-to-use-secrets-in-google-colab-for-api-key-protection-a-guide-for-openai-huggingface-and-c1ec9e1277e0



# Select an AI hardware accelerator
  * Select hardware options near the top right of your Colab notebook
  * I tested with A100 and it worked well.  Note that I have a Colab Pro subscription.


# Install required python packages

In [1]:
%%time
!pip install -q -U keras-nlp
!pip install -q -U "keras>=3"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/548.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m548.4/548.4 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m103.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[?25hCPU times: user 46.7 ms, sys: 11.9 ms, total: 58.6 ms
Wall time: 8.33 s


# Import required python packages

In [1]:
import os
import keras
import keras_nlp
from keras_nlp.models import GemmaBackbone, BertBackbone
from keras.models import load_model
from IPython.display import Markdown
import textwrap
from google.colab import userdata
import json
import random
import pprint
import gc

# Configure this notebook
* set up KERAS parameters recommended by Google
* integrate KAGGLE API secret key

In [2]:
os.environ["KERAS_BACKEND"] = "jax"  # Or "torch" or "tensorflow".
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00" # Avoid memory fragmentation on JAX backend.
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME') # Link to KAGGLE API secret key
os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY') # Link to KAGGLE API secret key

# Retrieve the fine-tuning dataset

In [4]:
%%time
!wget -O databricks-dolly-15k.jsonl https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl
!pwd
!ls

--2024-09-27 18:23:58--  https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl
Resolving huggingface.co (huggingface.co)... 13.35.210.114, 13.35.210.66, 13.35.210.77, ...
Connecting to huggingface.co (huggingface.co)|13.35.210.114|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/34/ac/34ac588cc580830664f592597bb6d19d61639eca33dc2d6bb0b6d833f7bfd552/2df9083338b4abd6bceb5635764dab5d833b393b55759dffb0959b6fcbf794ec?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27databricks-dolly-15k.jsonl%3B+filename%3D%22databricks-dolly-15k.jsonl%22%3B&Expires=1727720638&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNzcyMDYzOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy8zNC9hYy8zNGFjNTg4Y2M1ODA4MzA2NjRmNTkyNTk3YmI2ZDE5ZDYxNjM5ZWNhMzNkYzJkNmJiMGI2ZDgzM2Y3YmZkNTUyLzJkZjkwODMzMzhiNGFiZDZiY2ViNTYzNTc2NGRhYjVkODMzYjM5M2I1NTc1OWRmZmI

# Define some useful functions used later
* display_chat() function

In [3]:
def display_chat(prompt, response):
  '''Displays an LLM prompt and response in a pretty way.'''
  prompt = prompt.replace('\n\n','<br><br>')
  prompt = prompt.replace('\n','<br>')
  formatted_prompt = "<font size='+1' color='brown'>🙋‍♂️<blockquote>" + prompt + "</blockquote></font>"
  response = response.replace('•', '  *')
  response = textwrap.indent(response, '', predicate=lambda _: True)
  response = response.replace('\n\n','<br><br>')
  response = response.replace('\n','<br>')
  response = response.replace("```","")
  formatted_text = "<font size='+1' color='teal'>🤖<blockquote>" + response + "</blockquote></font>"
  return Markdown(formatted_prompt+formatted_text)

# Load the fine-tuning dataset

In [14]:
ft_dataset_all = []
with open("/content/databricks-dolly-15k.jsonl") as file:
    ft_dataset_all = [ json.loads(ln) for ln in file.readlines()]

# Decide on how much fine-tuning data to use
* Often this is determined experimentally
* I've found at least 1000 data points suffice in general

In [16]:
ft_data = ft_dataset_all[:1000]

# Randomly sample one of the fine-tuning dataset items.  We will focus our prompt experiments on this one.


In [27]:
ran_choice = random.choice(range(len(ft_data)))
ft_record = ft_data[ran_choice]
print("ran choice=", ran_choice)
pprint.pp(ft_record)

ran choice= 314
{'instruction': 'From the passage provided, extract the languages in which '
                'Kishore Kumar provided his vocals as a playback singer. '
                'Separate them with a comma.',
 'context': 'Kishore Kumar (born Abhas Kumar Ganguly; pronunciation '
            '(help·info); 4 August 1929 – 13 October 1987) was an Indian '
            'playback singer and actor. He is widely regarded as one of the '
            'greatest, most influential and dynamic singers in the history of '
            'Indian music. He was one of the most popular singers in the '
            'Indian subcontinent, notable for his yodeling and ability to sing '
            'songs in different voices. He used to sing in different genres '
            'but some of his rare compositions, considered classics, were lost '
            'in time. According to his brother and legendary actor Ashok '
            'Kumar, Kishore Kumar was successful as a singer because his '
            '"voi

# Load the Gemma model

In [28]:
%%time
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma2_2b_en")
# uncomment the following lines to "sample the softmax probabilities of the model"
#sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
#gemma_lm.compile(sampler=sampler)

CPU times: user 10.3 s, sys: 10.2 s, total: 20.4 s
Wall time: 41.8 s


# Ask the model something related to the random record
* If its not general knowledge (or in the pre-training of Gemma), we would not expect the model to "know" anything about it, or at least, may not be faithful to the record's context.

In [30]:
%%time
prompt = "What languages in which did Kishore Kumar provide his vocals as a playback singer?  Separate them with a comma."
completion = gemma_lm.generate(prompt,max_length=1024)
response = completion.replace(prompt, "")
display_chat(prompt, response)

CPU times: user 12.6 s, sys: 13.9 ms, total: 12.6 s
Wall time: 12.5 s


<font size='+1' color='brown'>🙋‍♂️<blockquote>What languages in which did Kishore Kumar provide his vocals as a playback singer?  Separate them with a comma.</blockquote></font><font size='+1' color='teal'>🤖<blockquote><br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</em>” in the movie “<em>Aradhana</em>”?<br><br>What is the name of the song in which Kishore Kumar sang the song “<em>Aaj ki raat hoon tumse pyar karne ko</blockquote></font>

# At this point, the model's answer has some issues:
* the responses aren't well-formatted
* the response seems to indicate the model doesn't know to much about the topic

# First, let's prepare a fine-tuning dataset to deal with the response formatting issue (so-called "instruction following")
* we won't use the context field here

[link text](https://)# New Section

In [32]:
data = []
for item in ft_data:
    template = "Instruction:\n{instruction}\n\nResponse:\n{response}"
    data.append(template.format(**item))
print("Here is what one item of the fine-tuning dataset looks like:")
pprint.pp(random.choice(data))

Here is what one item of the fine-tuning dataset looks like:
('Instruction:\n'
 'How do you make an iced matcha latter?\n'
 '\n'
 'Response:\n'
 'You will need 1-2 teaspoons of matcha powder, milk of your choice, 1 cup of '
 'hot water, and ice. In a cup or bowl, you will add your match powder and '
 'pour your hot water into it and use a whisk until the matcha powder is well '
 'incorporated, which should create a paste like consistency. Then you will '
 'take a glass and pour in your ice and milk and your matcha paste on top and '
 'stir.')


# Fine-tune the model for just proper "instruction following"

In [33]:
%%time
gemma_lm.backbone.enable_lora(rank=4)

# Limit the input sequence length to 256 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 256
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=5e-5,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=1, batch_size=1)

[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m162s[0m 58ms/step - loss: 0.8271 - sparse_categorical_accuracy: 0.5427
CPU times: user 4min 41s, sys: 12.1 s, total: 4min 53s
Wall time: 2min 42s


<keras.src.callbacks.history.History at 0x7dd98ff19e10>

# Now let's ask the fine-tuned model the same question
* we should expect better response formatting (ie, instruction following)
* we might expect it to answer correctly based since the response was included in the fine-tuning training set

In [34]:
prompt = template.format(
    instruction="What languages in which did Kishore Kumar provide his vocals as a playback singer?  Separate them with a comma.",
    response="",
)
completion = gemma_lm.generate(prompt)
response = completion.replace(prompt, "")
display_chat(prompt, response)

<font size='+1' color='brown'>🙋‍♂️<blockquote>Instruction:<br>What languages in which did Kishore Kumar provide his vocals as a playback singer?  Separate them with a comma.<br><br>Response:<br></blockquote></font><font size='+1' color='teal'>🤖<blockquote>Hindi, Bengali, Tamil, Telugu, Malayalam, Kannada, Marathi, Gujarati, Oriya, Assamese, Punjabi, Urdu, English, French, German, Spanish, Portuguese, Russian, Japanese, Chinese, Korean, Arabic, Persian, Dutch, Danish, Swedish, Norwegian, Finnish, Icelandic, Hebrew, Polish, Czech, Slovak, Hungarian, Romanian, Bulgarian, Serbian, Croatian, Bosnian, Slovenian, Lithuanian, Latvian, Estonian, Greek, Armenian, Turkish, Indonesian, Thai, Vietnamese, Cambodian, Lao, Burmese, Cambodian, Nepali, Bhutanese, Tibetan, Mongolian, Japanese, Korean, Chinese, Vietnamese, Thai, Cambodian, Lao, Burmese, Nepali, Bhutanese, Tibetan, Mongolian, Japanese, Korean, Chinese, Vietnamese, Thai, Cambodian, Lao, Burmese, Nepali, Bhutanese, Tibetan, Mongolian, Japanese, Korean, Chinese, Vietnamese, Thai, Cambodian, Lao, Burmese, Nepali, Bhutanese, Tibetan, Mongolian, Japanese, Korean, Chinese, Vietnamese, Thai, Cambodian, Lao, Burmese, Nepali, Bhutanese, Tibetan, Mongolian, Japanese, Korean, Chinese, Vietnamese</blockquote></font>

# By fine-tuning on just the instruction and response in the fine-tuning QA dataset seemed to unlock the model's ability to properly format answers to questions.  Let's see what happens if we add the dataset context as well

# Now let's create a fine-tuning dataset using the dataset context
* in this case, we assume the dataset context has ground-truth facts that the model should be using

In [35]:
data = []
for item in ft_data:
    template = "Instruction:\n{instruction}\n\nContext:\n{context}\n\nResponse:\n{response}"
    data.append(template.format(**item))
print("Here is what one item of the fine-tuning dataset looks like:")
pprint.pp(random.choice(data))

Here is what one item of the fine-tuning dataset looks like:
('Instruction:\n'
 'According to this paragraph, tell me what is referred to as the most '
 'exciting collegiate sporting event.\n'
 '\n'
 'Context:\n'
 'In 2019, Virginia men\'s basketball won the NCAA Championship in "March '
 'Madness", the single-elimination national college basketball tournament '
 'considered by YouGov polled American viewers (as of the same year) to be the '
 'most exciting collegiate sporting event. In 2015, when Virginia first won '
 'its first Capital One Cup its teams won the 2014 College Cup, the 2015 '
 'College World Series, and the 2015 NCAA Tennis Championships. When it '
 'repeated the feat in 2019, the program won both March Madness and the 2019 '
 "Men's Lacrosse Championship.\n"
 '\n'
 'Response:\n'
 'What is referred to as the most exciting collegiate sporting event is when '
 "the Virginia men's basketball team won the NCAA Championship in 2019.")


# Reload base model and fine-tune on the new dataset which has context included

In [36]:
%%time

# unload previous model to make room for new model
gemma_lm = None
gc.collect()

gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma2_2b_en")
gemma_lm.backbone.enable_lora(rank=4)

# Limit the input sequence length to 256 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 256
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=5e-5,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=1, batch_size=1)

[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m139s[0m 59ms/step - loss: 1.0938 - sparse_categorical_accuracy: 0.5654
CPU times: user 2min 23s, sys: 19.7 s, total: 2min 43s
Wall time: 2min 59s


<keras.src.callbacks.history.History at 0x7dd87d9291b0>

# Now let's ask the new fine-tuned model the same question
* we should expect better response formatting (ie, instruction following)
* we might expect it to asnwer correctly based on the fine-tuning context data since we used it in the fine-tuning dataset

In [37]:
prompt = template.format(
    instruction="What languages in which did Kishore Kumar provide his vocals as a playback singer?  Separate them with a comma.",
    context="",
    response="",
)
completion = gemma_lm.generate(prompt)
response = completion.replace(prompt, "")
display_chat(prompt, response)

<font size='+1' color='brown'>🙋‍♂️<blockquote>Instruction:<br>What languages in which did Kishore Kumar provide his vocals as a playback singer?  Separate them with a comma.<br><br>Context:<br><br><br>Response:<br></blockquote></font><font size='+1' color='teal'>🤖<blockquote>Hindi, Bengali, Tamil, Telugu, Malayalam, Kannada, Marathi, Gujarati, Oriya, Assamese, Punjabi, Urdu, English, and Sanskrit</blockquote></font>

# Did the model incoporate the context during the fine-tuning process?  Now, let's just prompt the model with the context at prompt-time.



In [40]:
prompt = template.format(
    instruction="What languages in which did Kishore Kumar provide his vocals as a playback singer?  Separate them with a comma.",
    context=ft_record['context'],
    response="",
)
completion = gemma_lm.generate(prompt,max_length=1024)
response = completion.replace(prompt, "")
display_chat(prompt, response)

<font size='+1' color='brown'>🙋‍♂️<blockquote>Instruction:<br>What languages in which did Kishore Kumar provide his vocals as a playback singer?  Separate them with a comma.<br><br>Context:<br>Kishore Kumar (born Abhas Kumar Ganguly; pronunciation (help·info); 4 August 1929 – 13 October 1987) was an Indian playback singer and actor. He is widely regarded as one of the greatest, most influential and dynamic singers in the history of Indian music. He was one of the most popular singers in the Indian subcontinent, notable for his yodeling and ability to sing songs in different voices. He used to sing in different genres but some of his rare compositions, considered classics, were lost in time. According to his brother and legendary actor Ashok Kumar, Kishore Kumar was successful as a singer because his "voice hits the mike, straight, at its most sensitive point".<br><br>Besides Hindi, he sang in many other Indian languages, including Bengali, Marathi, Assamese, Gujarati, Kannada, Bhojpuri, Malayalam, Odia and Urdu. He also released a few non-film albums in multiple languages, especially in Bengali, which are noted as all-time classics.<br><br>He won 8 Filmfare Awards for Best Male Playback Singer and holds the record for winning the most Filmfare Awards in that category. He was awarded the Lata Mangeshkar Award by the Madhya Pradesh government in 1985.<br><br>Response:<br></blockquote></font><font size='+1' color='teal'>🤖<blockquote>Bengali, Marathi, Assamese, Gujarati, Kannada, Bhojpuri, Malayalam, Odia, Urdu</blockquote></font>

# Let's ask the model slightly differently, telling it to explicitly use the provided context

In [42]:
prompt = template.format(
    instruction='From the passage provided, extract the languages in which '
                'Kishore Kumar provided his vocals as a playback singer. '
                'Separate them with a comma.',
    context=ft_record['context'],
    response="",
)
completion = gemma_lm.generate(prompt,max_length=1024)
response = completion.replace(prompt, "")
display_chat(prompt, response)

<font size='+1' color='brown'>🙋‍♂️<blockquote>Instruction:<br>From the passage provided, extract the languages in which Kishore Kumar provided his vocals as a playback singer. Separate them with a comma.<br><br>Context:<br>Kishore Kumar (born Abhas Kumar Ganguly; pronunciation (help·info); 4 August 1929 – 13 October 1987) was an Indian playback singer and actor. He is widely regarded as one of the greatest, most influential and dynamic singers in the history of Indian music. He was one of the most popular singers in the Indian subcontinent, notable for his yodeling and ability to sing songs in different voices. He used to sing in different genres but some of his rare compositions, considered classics, were lost in time. According to his brother and legendary actor Ashok Kumar, Kishore Kumar was successful as a singer because his "voice hits the mike, straight, at its most sensitive point".<br><br>Besides Hindi, he sang in many other Indian languages, including Bengali, Marathi, Assamese, Gujarati, Kannada, Bhojpuri, Malayalam, Odia and Urdu. He also released a few non-film albums in multiple languages, especially in Bengali, which are noted as all-time classics.<br><br>He won 8 Filmfare Awards for Best Male Playback Singer and holds the record for winning the most Filmfare Awards in that category. He was awarded the Lata Mangeshkar Award by the Madhya Pradesh government in 1985.<br><br>Response:<br></blockquote></font><font size='+1' color='teal'>🤖<blockquote>Languages in which Kishore Kumar provided his vocals as a playback singer: Hindi, Bengali, Marathi, Assamese, Gujarati, Kannada, Bhojpuri, Malayalam, Odia, Urdu</blockquote></font>