# WSDM Cup - Multilingual Chatbot Arena: All you need is a decision tree

The Multilingual Chatbot Arena competition "challenges you to predict which responses users will prefer in a head-to-head battle between chatbots powered by large language models."

This notebook shows what score we can get with a decision tree which looks at a single feature, the difference of the lengths of the two responses.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier, plot_tree

In [None]:
train = pd.read_parquet('/kaggle/input/wsdm-cup-multilingual-chatbot-arena/train.parquet')
test = pd.read_parquet('/kaggle/input/wsdm-cup-multilingual-chatbot-arena/test.parquet')


We compute the single feature `len_b-len_a` as the difference in length of the two response strings.

In [None]:
def add_features(df):
    df['len_a'] = df.response_a.str.len()
    df['len_b'] = df.response_b.str.len()
    df['len_b-len_a'] = df['len_b'] - df['len_a']
    
add_features(train)
add_features(test)


We cross-validate decision trees of depth 1 through 7. Depth 3 gives the best cv score, and we plot the tree.

In [None]:
# Cross-validate
for max_depth in range(1, 7):
    model = DecisionTreeClassifier(max_depth=max_depth)
    print(f"CV Accuracy with {max_depth=}: "
          f"{cross_val_score(model, train[['len_b-len_a']], train['winner'], scoring='accuracy').mean():.3f}")

# Refit and plot the tree
model = DecisionTreeClassifier(max_depth=3)
model.fit(train[['len_b-len_a']], train['winner'])
plt.figure(figsize=(15, 6))
plot_tree(model, ax=plt.gca(), filled=True, impurity=False, proportion=True,
          class_names=model.classes_, feature_names=model.feature_names_in_)
plt.show()

# Interpretation

The decision tree model can be summarized in the following statements:

- `len_b < len_a - 8430`: Model B wins because the response of A is too long
- `len_a - 8430 ≤ len_b < len_a - 15`: Model A wins because its response is longer
- `len_a - 15 ≤ len_b < len_a + 8644`: Model B wins because its response is the same length or longer
- `len_a + 8644 ≤ len_b`: Model A wins because the response of B is too long
  
In other words:
> - The longer response wins — unless the difference is too high: If the difference is greater than 8500, the shorter response wins.
> - If the response lengths are almost the same, response B wins.

Notice that the extreme cases where one response is too long are rare: They make up only 0.5 % of the samples on either end of the scale.

# Submission

In [None]:
submission = pd.DataFrame({'id': test.id,
                           'winner': model.predict(test[['len_b-len_a']])})
submission.to_csv('submission.csv', index=False)
!head submission.csv