Published on November 18, 2024. By Marília Prata, mpwolke.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.offline as py
import plotly.express as px


import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Competition Citation

@misc{wsdm-cup-multilingual-chatbot-arena,

    author = {Wei-lin Chiang and Evan Frick and Lisa Dunlap and Anastasios Angelopoulos and Joseph E. Gonzalez and Ion Stoica and Sohier Dane and Maggie Demkin and Nate Keating}
    ,
    title = {WSDM Cup - Multilingual Chatbot Arena
    },
    year = {2024},
    howpublished = {\url{https://kaggle.com/competitions/wsdm-cup-multilingual-chatbot-aren
    a}},
    note = {Kaggle}
}

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQa9psK1q-xrMPxO-9oLH20PB8BCFu05HSgvg&s)WSDM - Call for Papers

## WSDM (Web Search and Data Mining)

"WSDM is a highly selective conference that includes invited talks, as well as refereed full papers. WSDM publishes original, high-quality papers related to search and data mining on the Web and the Social Web, with an emphasis on practical yet principled novel models of search and data mining, algorithm design and analysis, economic implications, and in-depth experimental analysis of accuracy and performance."

**List of Topics**

* Web Search

"Algorithms for web-scale search; Adversarial search; Search user interfaces and interaction; Distributed search, metasearch, peer-to-peer search; Local and mobile search; Multimedia web search, cross-lingual search; Query analysis and query processing; Search benchmarking and evaluation; Search user behavior and log analysis; Search with Foundation Models."


* Web Mining and Content Analysis

"Crawling and indexing web content; Web recommender systems and algorithms; Clustering, classification, and summarization of web data; Data, entity, event, and relationship extraction; Knowledge acquisition and automatic construction of knowledge bases; Large-scale graph analysis; Semantic search, faceted search, and knowledge graphs; Multimodal data mining; Scalable algorithms for mining web data; Opinion mining and sentiment analysis; Web traffic and log analysis; Web measurements, web evolution, and web models."

* Web of Things, Ubiquitous and Mobile Computing

"Algorithms for web-scale search; Adversarial search; Search user interfaces and interaction; Distributed search, metasearch, peer-to-peer search; Local and mobile search; Multimedia web search, cross-lingual search; Query analysis and query processing; Search benchmarking and evaluation; Search user behavior and log analysis; Search with Foundation Models."

* Privacy, Fairness, Interpretability

"Fairness and accountability in ranking, recommendations and advertising; Explainability in web systems; Model and algorithm transparency; Interpretable models of individual and social behavior; Web search and data mining under privacy constraints; Fairness and interpretability in applications of web mining for social good."

* Social Networks

"Link prediction and community detection; Social network analysis and graph algorithms; Computational social science; Influence and viral marketing in social networks; Social sensing; Searching social and real-time content; Social network dynamics; Sampling, experiments, and evaluation in social networks; Social media analysis: blogs and friendship networks; Social network analysis, theories, models and applications; Social reputation and trust."

* Intelligent Assistants

"Voice search, conversational search, and dialog systems; Personal assistants, dialogue models, and conversational interaction; Task-driven search; Zero-query and implicit search."

* Crowdsourcing and Human Computation

"Collaborative search and question answering; Human-in-the-Loop and Collaborative Human-AI systems."

* Emerging and Creative Applications

"Mental health and well-being support systems; Web mining for social good; Systems and algorithms for urban applications such as smart cities/buildings/etc; Online education systems; Monitoring and prevention of epidemics; Social chatbots."

* Information Integrity

"Systems and algorithms for monitoring and detection of misinformation and fake news; Prevalence and virality of misinformation; Misinformation sources and origins; Source and content credibility; Detecting and combating spamming, trolling, aggression, dog whistles, and toxic online behaviors; Methods for detecting and mitigating low quality and offensive content, bullying and hate speech; Polarization, extremism and radicalization; Echo chambers and filter bubbles."

* Foundation Models

"Use of large language models (LLMs) and other foundation models for web search and data mining, including but not limited to the following tasks: Generative question answering; Indexing and query analysis; Pre-training and self-supervised learning for web-based tasks; Development of new user interfaces and user experiences; Support to information integrity."abs

http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=181081&copyownerid=187788

## Train parquet last five rows

In [None]:
#Read One parquet file. Obviously, it's big.

train = pd.read_parquet("../input/wsdm-cup-multilingual-chatbot-arena/train.parquet")
train.tail(2)

In [None]:
train.head()

## Test file: Spanish/English/Portuguese 

In [None]:
test = pd.read_parquet("../input/wsdm-cup-multilingual-chatbot-arena/test.parquet")
test.tail()

## Prompt 4th row: Please be Boring

In [None]:
train['prompt'][4]

## Understood?

In [None]:
train['response_b'][4]

## Alright, I'll be as boring as possible ; )  

In [None]:
train['response_a'][4]

In [None]:
#Two lines Required to Plot Plotly

import plotly.io as pio
pio.renderers.default = 'iframe'

In [None]:
#https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=B_PYA7oVyaHO

fig = px.bar(train["winner"].value_counts(),
             title="Counts of Battle Winner Models", text_auto=True, height=400, color_discrete_sequence=['crimson'])
fig.update_layout(xaxis_title="Winner Models", yaxis_title="Count", 
                  showlegend=False)
fig

In [None]:
#https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=B_PYA7oVyaHO

fig = px.bar(train["model_a"].value_counts(),
             title="Counts of Battle Outcomes A", text_auto=True, height=400, color_discrete_sequence=['Chartreuse'])
fig.update_layout(xaxis_title="Battle Outcome A", yaxis_title="Count", 
                  showlegend=False)
fig

In [None]:
#https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=B_PYA7oVyaHO

fig = px.bar(train["model_b"].value_counts(),
             title="Counts of Battle Outcomes B", text_auto=True, height=400, color_discrete_sequence=['coral'])
fig.update_layout(xaxis_title="Battle Outcome B", yaxis_title="Count",
                  showlegend=False)
fig

In [None]:
#https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=B_PYA7oVyaHO

fig = px.bar(pd.concat([train["model_a"], train["model_b"]]).value_counts(),
             title="Battle Count for Each Model", text_auto=True)
fig.update_layout(xaxis_title="model", yaxis_title="Battle Count", height=400,
                  showlegend=False)
fig  

In [None]:
#https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=B_PYA7oVyaHO

def visualize_battle_count(train, title, show_num_models=30):
    ptbl = pd.pivot_table(train, index="model_a", columns="model_b", aggfunc="size",
                          fill_value=0)

    battle_counts = ptbl + ptbl.T
    ordering = battle_counts.sum().sort_values(ascending=False).index
    ordering = ordering[:show_num_models]
    fig = px.imshow(battle_counts.loc[ordering, ordering],
                    title=title, text_auto=True)
    fig.update_layout(xaxis_title="Model B",
                      yaxis_title="Model A",
                      xaxis_side="top", height=800, width=800,
                      title_y=0.07, title_x=0.5,
                      font=dict(size=10))
    fig.update_traces(hovertemplate=
                      "Model A: %{y}<br>Model B: %{x}<br>Count: %{z}<extra></extra>")
    return fig

fig = visualize_battle_count(train, title="Battle Count of Each Combination of Models", show_num_models=30)
fig

In [None]:
#AttributeError: Can only use .str accessor with string values!

battles_no_ties = train[~train["winner"].str.contains("winner_tie")]

In [None]:
visualize_battle_count(battles_no_ties, "Battle Count for Each Combination of Models (without Ties)")

In [None]:
#https://www.kaggle.com/code/awsaf49/lmsys-kerasnlp-starter

model_train = pd.concat([train.model_a, train.model_b])
counts = model_train.value_counts().reset_index()
counts.columns = ['LLM', 'Count']

# Create a bar plot with custom styling using Plotly
fig = px.bar(counts, x='LLM', y='Count',
             title='Distribution of LLMs',
             color='Count', color_continuous_scale='turbo')

fig.update_layout(xaxis_tickangle=-45)  # Rotate x-axis labels for better readability

fig.show()

#Install Keras

In [None]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp datasets
!pip install -q -U keras

In [None]:
#By Solo https://www.kaggle.com/code/guru001/translator-of-tamil-thirukural-old-literature/notebook

# Set the backbend before importing Keras
os.environ["KERAS_BACKEND"] = "jax"
# Avoid memory fragmentation on JAX backend.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00"

import keras_nlp
import keras

# Run at half precision.
#keras.config.set_floatx("bfloat16")

# Training Configurations
token_limit = 256
lora_name = "translator"
lora_rank = 4
lr_value = 1e-4
train_epoch = 5
model_id = "gemma2_instruct_2b_en"

#Tokenizing the Dataset

In [None]:
#By StackOverflow https://stackoverflow.com/questions/53982871/pandas-reading-first-n-rows-from-parquet-file

from pyarrow.parquet import ParquetFile
import pyarrow as pa 

pf = ParquetFile('../input/wsdm-cup-multilingual-chatbot-arena/train.parquet') 
first_eight_rows = next(pf.iter_batches(batch_size = 8)) 
#df = pa.Table.from_batches([first_ten_rows]).to_pandas()

#Memory Issues:

I reduced the number of batches/rows cause the Notebook allocated too much memory.  With less I couldn't also print even (train[1])

I also reduced to Gemma2_2b instead of the 9b.

In [None]:
#By Solo https://www.kaggle.com/code/guru001/translator-of-tamil-thirukural-old-literature/notebook

tokenizer = keras_nlp.models.GemmaTokenizer.from_preset(model_id)
import pandas as pd
df = pa.Table.from_batches([first_eight_rows]).to_pandas()

train = []

for i,x in df.iterrows():
    item = f"<start_of_turn>user\n{x['response_b']}<end_of_turn>\n<start_of_turn>model\n{x['prompt']}<end_of_turn>"
    length = len(tokenizer(item))
    # skip data if the token length is longer than our limit
    if length < token_limit:
        train.append(item)

print(len(train))
print(train[0])
print(train[1])
#print(train[2])

#LoRA fine-tuning

In [None]:
#By Solo https://www.kaggle.com/code/guru001/translator-of-tamil-thirukural-old-literature/notebook

import time

gemma = keras_nlp.models.GemmaCausalLM.from_preset(model_id)
gemma.summary()

tick_start = 0


def tick():
    global tick_start
    tick_start = time.time()


def tock():
    print(f"TOTAL TIME ELAPSED: {time.time() - tick_start:.2f}s")


def text_gen(prompt):
    tick()
    input = f"<start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"
    output = gemma.generate(input, max_length=token_limit)
    print("\nGemma output:")
    print(output)

tock()


text_gen("Please be boring")
text_gen(
    "Alright, I'll be as boring as possible.Today, I woke up at 6:30 AM, just like every other weekday."
)
text_gen(
    "Do you understand? Please be boring."
)
text_gen("Once I arrived at the office, I settled into my cubicle and started working on the same project I've been working on for the past few weeks. The tasks were monotonous and required no creativity or problem-solving skills.")

#LoRA fine-tuning

In [None]:
#By Solo https://www.kaggle.com/code/guru001/translator-of-tamil-thirukural-old-literature/notebook

# Enable LoRA for the model and set the LoRA rank (4, 8 or 16).
gemma.backbone.enable_lora(rank=lora_rank)
gemma.summary()

# Limit the input sequence length (to control memory usage).
gemma.preprocessor.sequence_length = token_limit
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=lr_value,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

#Save LoRA for each epoch

In [None]:
#By Solo https://www.kaggle.com/code/guru001/translator-of-tamil-thirukural-old-literature/notebook

class CustomCallback(keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs=None):
    model_name = f"/kaggle/working/{lora_name}_{lora_rank}_last.lora.h5"
    gemma.backbone.save_lora_weights(model_name)

    # Evaluate

    text_gen("Please be boring")
    text_gen(
      "I started working on the same project I've been working on for the past few weeks. The tasks were monotonous and required no creativity."
    )
    text_gen(
      "Be boring. Do you understand?"
    )
    text_gen("Indeed, be as boring as possible")

history = gemma.fit(train, epochs=train_epoch, batch_size=1, callbacks=[CustomCallback()])

import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.show()

![](https://m.media-amazon.com/images/I/31drt0p5OGL.jpg)https://www.amazon.sg/Boring-Notebook-Please-make-boring/dp/1688792139

#Acknowledgements:

https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=B_PYA7oVyaH

Awsaf/Keras Team https://www.kaggle.com/code/awsaf49/lmsys-kerasnlp-starter

Solo https://www.kaggle.com/code/guru001/translator-of-tamil-thirukural-old-literature/notebook


Work with few parquet rows:
mpwolke https://www.kaggle.com/code/mpwolke/us-coast-guard

mpwolke https://www.kaggle.com/code/mpwolke/by-law-against-the-jungle-russian-mlsummarization