---

<div style="background: linear-gradient(45deg, #fa0cbb, #fa0519); padding: 20px; border-radius: 10px; box-shadow: 0 0 10px rgba(0, 0, 0, 0.5); text-align: center; background-color: green; color: white;">
  <h1 style="font-family: Arial, sans-serif; font-size: 32px; color: white;"> WSDM Cup - Multilingual Chatbot Arena </h1>
</div>


---

# Competition Overview
#### *This competition focuses on predicting user preferences between responses from large language models (LLMs) based on real-world data. Participants will use a dataset from Chatbot Arena where users compare two LLM responses. The goal is to address biases in preference predictions that can affect user satisfaction, such as position and verbosity biases. Various machine-learning techniques are encouraged to develop models that better align LLM outputs with individual user preferences. Successful entries will contribute to more personalized and effective AI-driven conversational systems.*


### **📝 Agenda**  
1. [📂 Import Necessary Libraries](#1)  
2. [📝 Dataset Info](#2)  
   - [ Null Values](#)
   - [ Duplicates](#) 
   - [ Data Descriptive Statistics](#)
3. [🚀 Preprocessing](#3)  
   - [Language-Specific Preprocessing](#)
   - [General Cleaning](#)
4. [🔍 Feature Engineering](#4)  
   - [Generated Features](#)
   - [Feature Descriptions](#) 
5. [🧠 Data Visualization and Insights](#5)  
6. [📦 Model Ensemble Learning](#6)  
   - [Model Selection](#)
   - [Hyperparameter Tuning](#)
   - [Evaluation Metrics](#)  
7. [🔗 Conclusion and Future Work](#8)

---



<a id = "1"></a><br>

# <div style="text-align:center; border-radius:15px; padding:15px; margin:0; font-size:100%; font-family:Arial, sans-serif; background-color:#EB6A20; overflow:hidden; box-shadow:0 3px 6px rgba(0, 0, 0, 0.3);"><b> 1. Import Necessary Libraries </b></div>


In [None]:
!pip install langdetect pyvi janome konlpy

In [None]:
import pandas as pd  


from langdetect.lang_detect_exception import LangDetectException 
from janome.tokenizer import Tokenizer as JapaneseTokenizer  
from pyvi.ViTokenizer import tokenize as ViTokenizer 
from langdetect import detect, DetectorFactory  
from sklearn.metrics import accuracy_score
from nltk.stem import WordNetLemmatizer 
import jieba as chinese_tokenizer  
from nltk.corpus import stopwords  
from bs4 import BeautifulSoup  
from textblob import TextBlob  
from textblob import TextBlob
from konlpy.tag import Okt 

import unicodedata  
import emoji  
import re  

import warnings
warnings.filterwarnings("ignore")

<a id = "2"></a><br>

# <div style="text-align:center; border-radius:15px; padding:15px; margin:0; font-size:100%; font-family:Arial, sans-serif; background-color:#EB6A20; overflow:hidden; box-shadow:0 3px 6px rgba(0, 0, 0, 0.3);"><b> 1. Dataset Info </b></div>


In [None]:
%%time
def display_dataset_info(dataset, name):
    print("-----------------------------------------------------------------")
    print(f"{name} DataFrame Shape: Rows = {dataset.shape[0]}, Columns = {dataset.shape[1]}")
    
    # Numerical and categorical columns information
    num_cols = dataset.select_dtypes(include='number')
    cat_cols = dataset.select_dtypes(exclude='number')
    print(f"{name} DataFrame numeric columns size = {len(num_cols.columns)}, categorical columns size = {len(cat_cols.columns)}")
    
    # Missing values information
    total_missing = dataset.isnull().sum().sum()
    if total_missing > 0:
        missing_perc = (total_missing / (dataset.shape[0] * dataset.shape[1])) * 100
        print(f"There are a total of {total_missing} missing values in the {name} DataFrame ({missing_perc:.2f}% of all values).")
        print("Missing values per column:")
        print(dataset.isnull().sum().sort_values(ascending=False).head(10))
    else:
        print(f"There are no missing values in the {name} DataFrame.")
    
    # Duplicate rows information
    total_duplicates = dataset.duplicated().sum()
    if total_duplicates > 0:
        print(f"There are {total_duplicates} duplicate rows in the {name} DataFrame.")
    else:
        print(f"There are no duplicate rows in the {name} DataFrame.")   
    
    # Check for column data types
    print("\nColumn data types:")
    print(dataset.dtypes.value_counts())   
    
    print("-----------------------------------------------------------------")
    print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<")

train = pd.read_parquet('/kaggle/input/wsdm-cup-multilingual-chatbot-arena/train.parquet')
test = pd.read_parquet('/kaggle/input/wsdm-cup-multilingual-chatbot-arena/test.parquet')

In [None]:
%%time
display_dataset_info(train, "Train")  
display_dataset_info(test, "Test")

In [None]:
train['language'].value_counts()
train['winner'].value_counts()
    

In [None]:
train.groupby('language')['winner'].value_counts(normalize=True)
train.groupby('model_a')['winner'].value_counts(normalize=True)
train.groupby('model_b')['winner'].value_counts(normalize=True)


<a id = "3"></a><br>

# <div style="text-align:center; border-radius:15px; padding:15px; margin:0; font-size:100%; font-family:Arial, sans-serif; background-color:#EB6A20; overflow:hidden; box-shadow:0 3px 6px rgba(0, 0, 0, 0.3);"><b> 2. Preprocessing </b></div>


In [None]:
DetectorFactory.seed = 0

# Initialize tokenizers
japanese_tokenizer = JapaneseTokenizer()
korean_tokenizer = Okt()

# Preprocess function with extended language support
def preprocess_multilingual(text):
    # Ensure input is a string
    text = str(text).strip()

    # Detect the language of the text
    try:
        lang = detect(text)
    except LangDetectException:
        lang = "unknown"

    # Process text based on detected language
    if lang == "zh-cn":  # Chinese 
        text = " ".join(chinese_tokenizer.cut(text))
        text = emoji.replace_emoji(text, replace="")
        text = re.sub(r'\d+', '', text)   
        text = re.sub(r'\s+', ' ', text).strip()

    elif lang == "ja":  # Japanese
        text = " ".join(token.surface for token in japanese_tokenizer.tokenize(text))
        text = emoji.replace_emoji(text, replace="")
        text = re.sub(r'\d+', '', text)   
        text = re.sub(r'\s+', ' ', text).strip()

    elif lang == "ko":  # Korean
        text = " ".join(korean_tokenizer.morphs(text))
        text = emoji.replace_emoji(text, replace="")
        text = re.sub(r'\d+', '', text)   
        text = re.sub(r'\s+', ' ', text).strip()

    elif lang == "vi":  # Vietnamese
        text = ViTokenizer(text)
        text = emoji.replace_emoji(text, replace="")
        text = re.sub(r'\d+', '', text)   
        text = re.sub(r'\s+', ' ', text).strip()

    elif lang == "de":  # German
        text = text.lower()
        text = emoji.replace_emoji(text, replace="")
        text = re.sub(r'\d+', '', text) 
        text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        text = re.sub(r'\s+', ' ', text).strip()

    elif lang == "en":  # English
        text = text.lower()
        text = text.replace('%', ' percent').replace('$', ' dollar ').replace('₹', ' rupee ').replace('€', ' euro ').replace('@', ' at ')
        text = emoji.replace_emoji(text, replace="")
        text = re.sub(r'\d+', '', text)  
        text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        text = BeautifulSoup(text, "html.parser").get_text()
        text = re.sub(r'\W', ' ', text).strip()

    else:
        # For unknown or unsupported languages, keep it simple
        text = emoji.replace_emoji(text, replace="")
        text = re.sub(r'\d+', '', text)   
        text = re.sub(r'\s+', ' ', text).strip()

    return text


In [None]:

def preprocessing(text):
    # Initialize stopwords and lemmatizer
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    # Process question1
    preprocessing_text1 = text["responseA"].str.lower()
    preprocessing_text1 = preprocessing_text1.str.replace(r'[^a-zA-Z\s]', '', regex=True)
    preprocessing_text1 = preprocessing_text1.str.split()

    preprocessing_text1 = preprocessing_text1.apply(lambda words: [word for word in words if word not in stop_words])
    preprocessing_text1 = preprocessing_text1.apply(lambda words: [lemmatizer.lemmatize(word) for word in words])

    
    # Process question2
    preprocessing_text2 = text["response_b"].str.lower()
    preprocessing_text2 = preprocessing_text2.str.replace(r'[^a-zA-Z\s]', '', regex=True)
    preprocessing_text2 = preprocessing_text2.str.split()
    preprocessing_text2 = preprocessing_text2.apply(lambda words: [word for word in words if word not in stop_words])
    preprocessing_text2 = preprocessing_text2.apply(lambda words: [lemmatizer.lemmatize(word) for word in words])
     # Join the lists back to strings
    preprocessing_text1 = preprocessing_text1.str.join(' ')
    preprocessing_text2 = preprocessing_text2.str.join(' ')

    return preprocessing_text1, preprocessing_text2

"""
train["processed_response_a"], train["processed_response_b"] = preprocessing(train)
"""

In [None]:
%%time

train["prompt"] = train["prompt"].apply(lambda x: preprocess_multilingual(x))  
train["responseA"] = train["response_a"].apply(lambda x: preprocess_multilingual(x))  
train["response_b"] = train["response_b"].apply(lambda x: preprocess_multilingual(x)) 

test["prompt"] = test["prompt"].apply(lambda x: preprocess_multilingual(x))  
test["responseA"] = test["response_a"].apply(lambda x: preprocess_multilingual(x))  
test["response_b"] = test["response_b"].apply(lambda x: preprocess_multilingual(x)) 



<a id = "4"></a><br>

# <div style="text-align:center; border-radius:15px; padding:15px; margin:0; font-size:100%; font-family:Arial, sans-serif; background-color:#EB6A20; overflow:hidden; box-shadow:0 3px 6px rgba(0, 0, 0, 0.3);"><b> 3. Feature Engineering </b></div>

In [None]:

def process_responses(train):  
    # Calculate the character length of the prompt  
    train['prompt_length'] = train['prompt'].apply(len)  

    # Calculate the absolute difference in word counts between response_a and response_b  
    train['response_length_diff'] = abs(  
        train['response_a'].apply(lambda x: len(x.split())) - train['response_b'].apply(lambda x: len(x.split()))  
    )  

    # Get the word count of the winning response based on the 'winner' column  
    train['winner_length'] = train.apply(  
        lambda x: len(x['response_a'].split()) if x['winner'] == 'model_a' else len(x['response_b'].split()), axis=1  
    )  

    # Check if the language specified is mentioned in the prompt  
    train['language_match'] = train.apply(lambda x: 1 if x['language'] in x['prompt'] else 0, axis=1)  

    # Compute sentiment polarity of response_a and response_b using TextBlob  
    train['response_a_sentiment'] = train['response_a'].apply(lambda x: TextBlob(x).sentiment.polarity)  
    train['response_b_sentiment'] = train['response_b'].apply(lambda x: TextBlob(x).sentiment.polarity)  

    # Create binary flags to indicate whether model_a or model_b was the winner  
    train['model_a_winner'] = train['winner'].apply(lambda x: 1 if x == 'model_a' else 0)  
    train['model_b_winner'] = train['winner'].apply(lambda x: 1 if x == 'model_b' else 0)  

    # Check if the prompt contains a question mark ('?')  
    train['contains_question'] = train['prompt'].apply(lambda x: 1 if '?' in x else 0)  

    # Combine model_a and model_b values to create a unique identifier for the model pair  
    train['model_pair'] = train['model_a'] + "_" + train['model_b']  

    # Calculate the ratio of unique words to total words for response_a and response_b  
    train['response_a_unique_ratio'] = train['response_a'].apply(lambda x: len(set(x.split())) / len(x.split()) if len(x.split()) > 0 else 0)  
    train['response_b_unique_ratio'] = train['response_b'].apply(lambda x: len(set(x.split())) / len(x.split()) if len(x.split()) > 0 else 0)  

    # Check if either response_a or response_b contains code blocks (indicated by triple backticks ```)  
    train['contains_code'] = train.apply(  
        lambda x: 1 if '```' in x['response_a'] or '```' in x['response_b'] else 0, axis=1  
    )  

    return train
train = process_responses(train)

In [None]:
%%time
numeric_cols = train.select_dtypes(include=["number"]).columns
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = train[numeric_cols].quantile(0.25)
Q3 = train[numeric_cols].quantile(0.75)
# Calculate IQR (Interquartile Range)
IQR = Q3 - Q1
# Define lower and upper bounds for outlier removal
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter out rows where any value is outside the bounds
train_no_outliers = train[~((train[numeric_cols] < lower_bound) | (train[numeric_cols] > upper_bound)).any(axis=1)]

<a id = "5"></a><br>

# <div style="text-align:center; border-radius:15px; padding:15px; margin:0; font-size:100%; font-family:Arial, sans-serif; background-color:#EB6A20; overflow:hidden; box-shadow:0 3px 6px rgba(0, 0, 0, 0.3);"><b> 4. Visualizing Data Insights </b></div>

In [None]:
import matplotlib.pyplot as plt
import plotly.express as px  
import seaborn as sns


In [None]:
language_counts = train_no_outliers['language'].value_counts().reset_index()  
language_counts.columns = ['language', 'count']  

# Create the bar plot  
fig = px.bar(language_counts, x='language', y='count',   
             title='Language Count',   
             labels={'language': 'Language', 'count': 'Count'},  
             width=1800, height=600)  

# Show the figure  
fig.show()

In [None]:
# Plot the distribution of sentiment scores for response_a and response_b
fig = px.histogram(train_no_outliers, x="response_a_sentiment", nbins=30, color="winner", 
                   title="Distribution of Sentiment Scores for Response A", 
                   labels={"response_a_sentiment": "Sentiment Score", "winner": "Winner"})
fig.show()

fig = px.histogram(train_no_outliers, x="response_b_sentiment", nbins=30, color="winner", 
                   title="Distribution of Sentiment Scores for Response B", 
                   labels={"response_b_sentiment": "Sentiment Score", "winner": "Winner"})
fig.show()


In [None]:
fig = px.scatter(train_no_outliers, x='response_a_sentiment', y='response_length_diff', color='winner', 
                 title="Response A Sentiment vs. Response Length Difference", 
                 labels={"response_a_sentiment": "Response A Sentiment", 
                         "response_length_diff": "Response Length Difference", "winner": "Winner"})
fig.show()

fig = px.scatter(train_no_outliers, x='response_b_sentiment', y='response_length_diff', color='winner', 
                 title="Response B Sentiment vs. Response Length Difference", 
                 labels={"response_b_sentiment": "Response B Sentiment", 
                         "response_length_diff": "Response Length Difference", "winner": "Winner"})
fig.show()


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

corr = train_no_outliers[['prompt_length', 'response_length_diff', 'winner_length',
           'response_a_sentiment', 'response_b_sentiment', 'response_a_unique_ratio', 
           'response_b_unique_ratio']].corr()
barplot = sns.barplot(y= train_no_outliers["prompt_length"], x= train_no_outliers["winner"], palette="Paired", ax = axes[0, 0])

axes[0, 0].set_title("Correlation Heatmap")

sns.boxplot(data=train_no_outliers, x='winner', y='response_length_diff', palette='Set2', ax=axes[0, 1])
axes[0, 1].set_title("Response Length Difference by Winner")

sns.histplot(train_no_outliers['response_a_sentiment'], kde=True, color='skyblue', ax=axes[1, 0])
axes[1, 0].set_title("Distribution of Response A Sentiment")

sns.histplot(train_no_outliers['response_b_sentiment'], kde=True, color='salmon', ax=axes[1, 1])
axes[1, 1].set_title("Distribution of Response B Sentiment")

plt.tight_layout()
plt.show()


In [None]:
# Create subplots  
fig, ax = plt.subplots(1, 2, figsize=(15, 5))  

# Bar plot on the first subplot  
barplot = sns.barplot(x=train_no_outliers["language"].value_counts().nlargest(5).index,   
                      y=train_no_outliers["language"].value_counts().nlargest(5).values,   
                      palette="Paired", ax=ax[0])

ax[0].set_xlabel('language')  
ax[0].set_ylabel('language Length')  
ax[0].set_title('language lenght Comparison')  
ax[0].tick_params(axis='x', rotation=90)  
ax[0].grid(axis='x', linestyle='--', alpha=0.7, color="black")  

# Pie chart on the second subplot  
train_no_outliers['winner'].value_counts().plot(kind='pie', autopct='%1.1f%%', ax=ax[1], startangle=90)  
ax[1].set_ylabel('')  # Hide the y-label for the pie chart  
ax[1].set_title('Winner Distribution')  

# Adjust layout  
plt.tight_layout()  
plt.show()

In [None]:
# Calculate correlations between numeric features  
corr = train[numeric_cols].corr()  

# Create the heatmap  
plt.figure(figsize=(10, 6))  
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)  
# Rotate the x-axis labels  
plt.title("Correlation Heatmap of Features")  
plt.show()

<a id = "5"></a><br>

# <div style="text-align:center; border-radius:15px; padding:15px; margin:0; font-size:100%; font-family:Arial, sans-serif; background-color:#EB6A20; overflow:hidden; box-shadow:0 3px 6px rgba(0, 0, 0, 0.3);"><b> 5. Model Ensemble Learning </b></div>

In [None]:
from sklearn.ensemble import VotingRegressor  
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier 
import lightgbm as lgb  
  

In [None]:
lgb_param_grid = {  
    'n_estimators': [100, 200],  
    'learning_rate': [0.01, 0.05],  
}  

xgb_param_grid = {  
    'n_estimators': [100, 200],  
    'learning_rate': [0.01, 0.05],  
    'max_depth': [3, 5],  
}  

catboost_param_grid = {  
    'iterations': [500, 1000],  
    'learning_rate': [0.01, 0.05],  
}  

rf_param_grid = {  
    'n_estimators': [100, 200],  
    'max_depth': [10, 20],  
}

In [None]:
%%time

# Encode categorical columns (like 'language', 'model_a', 'model_b', 'model_pair', etc.)
label_columns = ['language', 'model_a', 'model_b', 'model_pair']  # Add more columns if necessary
label_encoder = LabelEncoder()

# Apply label encoding only to specified categorical columns
for col in train_no_outliers.columns:
    if col in train_no_outliers.columns and train_no_outliers[col].dtypes == "object":
        train_no_outliers[col] = label_encoder.fit_transform(train_no_outliers[col])

# Split data into train and test sets
X = train_no_outliers.drop(columns=["winner", "id"])  # Ensure you're using the correct DataFrame
y = train_no_outliers["winner"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 1. Define individual models
lgb_model = lgb.LGBMClassifier()
xgb_model = XGBClassifier()
catboost_model = CatBoostClassifier(silent=True)
rf_model = RandomForestClassifier(random_state=42)

# 3. Perform hyperparameter tuning
lgb_search = GridSearchCV(lgb_model, lgb_param_grid, scoring='accuracy', cv=3, n_jobs=-1, verbose=0)
xgb_search = GridSearchCV(xgb_model, xgb_param_grid, scoring='accuracy', cv=3, n_jobs=-1, verbose=0)
catboost_search = GridSearchCV(catboost_model, catboost_param_grid, scoring='accuracy', cv=3, n_jobs=-1, verbose=0)
rf_search = GridSearchCV(rf_model, rf_param_grid, scoring='accuracy', cv=3, n_jobs=-1, verbose=0)

# Fit the models
lgb_search.fit(X_train_scaled, y_train)
xgb_search.fit(X_train_scaled, y_train)
catboost_search.fit(X_train, y_train)
rf_search.fit(X_train_scaled, y_train)

# Extract best models
best_lgb_model = lgb_search.best_estimator_
best_xgb_model = xgb_search.best_estimator_
best_catboost_model = catboost_search.best_estimator_
best_rf_model = rf_search.best_estimator_

# 4. Create an ensemble model using VotingClassifier
ensemble_model = VotingClassifier(estimators=[
    ('lgb', best_lgb_model),
    ('xgb', best_xgb_model),
    ('catboost', best_catboost_model),
    ('rf', best_rf_model)
], voting='soft')  # Use 'soft' voting for probabilities or 'hard' for majority voting

# 5. Train the ensemble model
ensemble_model.fit(X_train_scaled, y_train)

# 6. Make predictions
ensemble_predictions = ensemble_model.predict(X_test_scaled)
ensemble_probabilities = ensemble_model.predict_proba(X_test_scaled)  # For probabilities (optional)


In [None]:
# Evaluate the model on training data
train_predictions = ensemble_model.predict(X_train_scaled)
train_accuracy = accuracy_score(y_train, train_predictions)

# Evaluate the model on test data
test_predictions = ensemble_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, test_predictions)

In [None]:
# Print performance metrics
print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Test Accuracy: {test_accuracy:.2f}")

# Check for overfitting/underfitting
if train_accuracy > test_accuracy + 0.1:
    print("The model is likely overfitting.")
elif train_accuracy < 0.7 and test_accuracy < 0.7:
    print("The model is likely underfitting.")
else:
    print("The model has a good balance between bias and variance.")

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  

# Evaluate the model's performance  
accuracy = accuracy_score(y_test, ensemble_predictions)  
report = classification_report(y_test, ensemble_predictions)  
conf_matrix = confusion_matrix(y_test, ensemble_predictions)  

# Print out evaluation metrics  
print(f"Accuracy: {accuracy}")  
print("Classification Report:")  
print(report)  
print("Confusion Matrix:")  
print(conf_matrix)  

---

<div style="border: 2px solid black; border-radius: 10px; padding: 15px; text-align: left; font-family: Arial, sans-serif; width: 80%; max-width: 700px; margin: auto;">
   <h2> Upvoting </h2>

  <ul>
    <li>Quick and Simple Upvoting</li>
  </ul>
  <p>If you found this notebook helpful, please consider upvoting and leaving a comment. Your input helps improve the content and supports a collaborative learning space!</p>
  
  <ol>
    <li>Upvote</li>
    <li>Leave a comment</li>
    <ul>
      <li>Share your thoughts</li>
      <li>Provide feedback</li>
      <li>Ask questions or suggest improvements</li>
    </ul>
  </ol>
  
  <div style="margin-top: 20px;">
    <a href="https://www.linkedin.com" target="_blank">
      <img src="https://cdn-icons-png.flaticon.com/512/174/174857.png" alt="LinkedIn" style="width: 30px; height: 30px; margin-right: 10px;">
    </a>
    <a href="https://github.com" target="_blank">
      <img src="https://cdn-icons-png.flaticon.com/512/25/25231.png" alt="GitHub" style="width: 30px; height: 30px; margin-right: 10px;">
    </a>
    <a href="https://twitter.com" target="_blank">
      <img src="https://cdn-icons-png.flaticon.com/512/733/733579.png" alt="Twitter" style="width: 30px; height: 30px;">
    </a>
  </div>
</div>


---