Published on November 19, 2024. By Marília Prata, mpwolke

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#Does language shape how we think? 

Linguistic relativity & linguistic determinism -- Linguistics 101

<iframe width="640" height="360" src="https://www.youtube.com/embed/Df25r8pcuI8" title="Does language shape how we think? Linguistic relativity &amp; linguistic determinism -- Linguistics 101" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

https://www.youtube.com/watch?v=Df25r8pcuI8

In [None]:
#Read One parquet file. Obviously, it's big.

train = pd.read_parquet("../input/wsdm-cup-multilingual-chatbot-arena/train.parquet")
train.tail(2)

#Different languages. Different ways of thinking.

"When we talk about language, we often dig down to universal categories like nouns and verbs, consonants and vowels, phrases and sentences.  We end up with these cross-language concepts that individual languages are built on almost as if the colorful diversity found in the world's languages is just icing on the strong unity of the linguistic cake and language is grounded in our way of
thinking and processing information which is itself universal among humans. **So languages and cultures are superficial, but language and cognition run deep**. But this isn't the only way to look at language. What if the language we are brought up to speak actually relates to the way we look at reality? From this perspective a language is a particular way of conceptualizing the world, and has close ties to culture.

The **Sapir-Whorf Hypothesis** posits that language either determines or influences one's thought. In other words, people who speak different languages see the world differently, based on the language they use to describe it.

In the 1930's, Benjamin Lee Whorf talked about language this way. He argued that **different languages represent different ways of thinking about the world around us**. This view has come to be called linguistic relativity. Exploring the grammar of the Hopi language, he concluded that the Hopi have an entirely different concept of the time than European languages do and that the European concepts of "time" and "matter" are actually conditioned by language itself. One practical consequence of linguistic relativity: direct translation between languages isn't always possible.

"Since Hopi (Native American language) and English aren't simply ways of expressing the same thing in different words, you can't actually preserve thoughts or viewpoints when you translate between them. In its strongest expression, linguistic relativity - the idea that viewpoints vary from language to language - relies on linguistic determinism - the idea that language determines thought."

"In other words how people think doesn't just vary depending on their language but is actually grounded in - determined by - the specific language of their community."

"Linguistic relativity has been abandoned and criticized over the decades with critics aiming to show that perception and cognition are universal, not tied to language and culture, but some psychologists and anthropologists continue to argue that differences in a language's structure and words may play a role in determining how we think. Experiments on how speakers of different languages approach non-linguistic tasks continue to spark this debate. Thank you for joining me on this quick tour of linguistic relativity and linguistic determinism."

https://www.youtube.com/watch?v=Df25r8pcuI8

http://www.nativlang.com/linguistics/...

In [None]:
zh = train[(train['language']=='Chinese')].reset_index(drop=True)
zh.tail()

#Keras installation

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U keras>=3
!pip install -q -U kagglehub --upgrade

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

os.environ["KERAS_BACKEND"] = "jax" # you can also use tensorflow or torch
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00" # avoid memory fragmentation on JAX backend.
os.environ["JAX_PLATFORMS"] = ""
import keras
import keras_nlp
import kagglehub

### Kaggle Secrets

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

#Make yours and Add copy to clipboard
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("hf_licorne")

#Gabriel's line
#from kaggle_secrets import UserSecretsClient
#user_secrets = UserSecretsClient()
#os.environ["KAGGLE_USERNAME"] = user_secrets.get_secret("kaggle_username")
#os.environ["KAGGLE_KEY"] = user_secrets.get_secret("kaggle_key")

from tqdm.notebook import tqdm
tqdm.pandas() # progress bar for pandas

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

class Config:
    seed = 42
    dataset_path = "/kaggle/input/wsdm-cup-multilingual-chatbot-arena"
    preset = "gemma2_2b_en" # name of pretrained Gemma 2
    sequence_length = 512 # max size of input sequence for training
    batch_size = 1 # size of the input batch in training
    lora_rank = 4 # rank for LoRA, higher means more trainable parameters
    learning_rate=8e-5 # learning rate used in train
    epochs = 5 # Original is 10 number of epochs to train

In [None]:
keras.utils.set_random_seed(Config.seed)

In [None]:
#Don't change anything on Template line. Just the rows (in blue)
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

template = "\n\nCategory:\nkaggle-{Category}\n\nQuestion:\n{Question}\n\nAnswer:\n{Answer}"
zh["prompt"] = zh.apply(lambda row: template.format(Category=row.response_b,
                                                             Question=row.prompt,
                                                             Answer=row.response_a), axis=1)
data = zh.prompt.tolist()

## Template utility function

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

def colorize_text(text):
    for word, color in zip(["Category", "Question", "Answer"], ["blue", "red", "green"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

#Gemma Causal

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

gemma_causal_lm = keras_nlp.models.GemmaCausalLM.from_preset(Config.preset)
gemma_causal_lm.summary()

#Define the specialized class

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

class GemmaQA:
    def __init__(self, max_length=512):
        self.max_length = max_length
        self.prompt = template
        self.gemma_causal_lm = gemma_causal_lm
        
    def query(self, category, question):
        response = self.gemma_causal_lm.generate(
            self.prompt.format(
                Category=category,
                Question=question,
                Answer=""), 
            max_length=self.max_length)
        display(Markdown(colorize_text(response)))

In [None]:
x, y, sample_weight = gemma_causal_lm.preprocessor(data[0:2])

In [None]:
print(x, y)

#Perform fine-tuning with LoRA

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

# Enable LoRA for the model and set the LoRA rank to the lora_rank as set in Config (4).
gemma_causal_lm.backbone.enable_lora(rank=Config.lora_rank)
gemma_causal_lm.summary()

#Gemma_causal_lm

Epochs!

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

#set sequence length cf. config (512)
gemma_causal_lm.preprocessor.sequence_length = Config.sequence_length 

# Compile the model with loss, optimizer, and metric
gemma_causal_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=Config.learning_rate),
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

# Train model
gemma_causal_lm.fit(data, epochs=Config.epochs, batch_size=Config.batch_size)

In [None]:
gemma_qa = GemmaQA()

### Prompt chinese index 4307

日本人为什么智力高

Google Translate:

Why are Japanese people so intelligent?

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

row = zh.iloc[4307]
gemma_qa.query(row.response_a,row.prompt)

### Prompt chinese index 4309

你好，请你介绍下你自己

Google translate:

Hello, introduce yourself

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

row = zh.iloc[4309]
gemma_qa.query(row.response_a,row.prompt)

### Prompt chinese index 5

任务概述：\n\n1. **选择数据集**：我们将选择一个至少包含30个特征的开源分类数据集

Google translate:

Task Overview:\n\n1. **Select a dataset**: We will select an open source classification dataset with at least 30 features

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

row = zh.iloc[5]
gemma_qa.query(row.response_a,row.prompt)

### Prompt chinese index 6

以下是一些提升英语水平的有效方法：\n\n1.  **定期学习英语**：每天都进行点滴的英语学习，尤其是在工作或 leisure 时间。在学习英语前，考虑好自己的学习时间规划和强调重要内容。在第二个月上学之后，各日都安排一些学习时间。\n2.  **阅读外国 Literature**

Google translate:

Here are some effective ways to improve your English:\n\n1. **Study English regularly**: Study English a little bit every day, especially during work or leisure time. Before learning English, think about your study schedule and emphasize important content. After the second month of school, arrange some study time every day.\n2. **Read foreign literature**

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

row = zh.iloc[6]
gemma_qa.query(row.response_a,row.prompt)

### zh'prompt'2

我们需要解决的问题是：一根12米长的电线被切成两片。如果然后使用较长的部分形成正方形的周长，如果原始电线在任意点切割，正方形面积大于4的概率是多少？

Google Translate:

The problem we need to solve is: A 12-meter-long wire is cut into two pieces. If the longer piece is then used to form the perimeter of a square, what is the probability that the area of ​​the square is greater than 4 if the original wire is cut at any point?

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

category = "让我们来分解这个问题" #response_a row index2 translation: Let’s break this down
question = "我们需要解决的问题是：一根12米长的电线被切成两片。如果然后使用较长的部分形成正方形的周长，如果原始电线在任意点切割，正方形面积大于4的概率是多少？"  #How many cover arts do we have?
gemma_qa.query(category,question)

### Prompt chinese index6

以下是一些提升英语水平的有效方法：\n\n1. 定期学习英语：每天都进行点滴的英语学习  

Here are some effective ways to improve your English:\n\n1. Learn English regularly: Learn English a little bit every day  

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

category = "定期学习英语" #model_a row index 6 Study english regularly
question = "以下是一些提升英语水平的有效方法：\n\n1. 定期学习英语：每天都进行点滴的英语学习"  #Here are some effective ways to improve your English:\n\n1. Learn English regularly: Learn English a little bit every day
gemma_qa.query(category,question)

### Prompt chinese index 4309

你好，请你介绍下你自 (Hello, please introduce yourself)

Google translate:

介绍\n\n我是一款人工智能语言模型 (response_b)

Introduction\n\nI am an artificial intelligence language model己

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

category = "介绍\n\n我是一款人工智能语言模型" # Introduction\n\nI am an artificial intelligence language model
question = "你好，请你介绍下你自己"  #Hello, please introduce yourself Prompt 4309
gemma_qa.query(category,question)

### Prompt chinese index 10


探索性问题:** 在未来的城市中，如何设计一个智能的交通系统

Exploratory question: How to design an intelligent transportation system in the future city?

In [None]:
#By Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook

category = "让城市的交通更加高效、绿色和可持续" # Make urban transportation more efficient, green and sustainable
question = "探索性问题:** 在未来的城市中，如何设计一个智能的交通系统"  # Prompt 10 Exploratory question: How to design an intelligent transportation system in the future city?
gemma_qa.query(category,question)

#Save the model

In [None]:
preset_dir = ".\gemma2_2b_en_kaggle_docs"
gemma_causal_lm.save_to_preset(preset_dir)

#Acknowledgements:

Gabriel Preda https://www.kaggle.com/code/gpreda/fine-tuning-gemma-2-model-using-lora-and-keras/notebook