# Introduction

In this notebook, we will employ the Elo rating system to evaluate the performance of large language models (LLMs).
The analysis is based on the pairwise battle results we collected from https://arena.lmsys.org between April 24 and May 22, 2023.
This crowdsourcing way of data collection represents some use cases of LLMs in the wild.
Below, we present the calculation procedure along with some basic analyses.





In [None]:
from collections import defaultdict
import json, math, gdown
import numpy as np
import pandas as pd
import plotly.express as px
from tqdm import tqdm
pd.options.display.float_format = '{:.2f}'.format

# Obtaining and Cleaning the Tournament Data
We are hosting the initial tournament results as a JSON file on Google Drive. We use the `gdown` function to download the data. The data only contains voting results without conversation history because releasing the conversation history will raise concerns such as privacy and toxicity.

In [None]:
url = "https://drive.google.com/file/d/1gjs-APnGZjw8vmN5pwykV2SukS0KYE-z/view?usp=share_link"
filename = gdown.download(url, quiet=False, fuzzy=True)

Downloading...
From: https://drive.google.com/uc?id=1gjs-APnGZjw8vmN5pwykV2SukS0KYE-z
To: /content/clean_battle_20230522.json
100%|██████████| 8.39M/8.39M [00:00<00:00, 35.9MB/s]


In [None]:
raw_data = pd.read_json(filename).sort_values(ascending=True, by=["tstamp"])
raw_data

Unnamed: 0,model_a,model_b,win,anony,rounds,language,tstamp
0,alpaca-13b,dolly-v2-12b,model_b,False,2,English,1682295238.71
1,koala-13b,dolly-v2-12b,model_b,False,1,English,1682295376.63
2,koala-13b,dolly-v2-12b,model_b,False,2,English,1682295408.08
3,koala-13b,dolly-v2-12b,model_b,False,3,English,1682295434.13
4,vicuna-13b,koala-13b,model_a,False,1,English,1682295451.25
...,...,...,...,...,...,...,...
45094,vicuna-7b,mpt-7b-chat,tie (bothbad),True,1,French,1684773225.74
45095,RWKV-4-Raven-14B,claude-instant-v1,model_b,True,1,French,1684773284.15
45096,claude-v1,claude-instant-v1,model_b,True,1,English,1684773310.18
45097,vicuna-13b,claude-instant-v1,model_a,True,1,English,1684773317.56


## Removing the Non Anonymous Pairings
The data also includes the side-by-side chat web interface where users were able to see (and select) which models were paired. In this case, the battles were **not anonymous**.
We have included this data for others to study but it was not part of the tournament so we drop these records from the remaining analysis.
In addition, we mark all conversations where the models reveal their own identity (e.g., a model says its name is Vicuna) as not anonymous.

In [None]:
raw_data['anony'].value_counts()

True     27016
False    18083
Name: anony, dtype: int64

In [None]:
battles = raw_data[raw_data['anony']].reset_index(drop=True)
battles

Unnamed: 0,model_a,model_b,win,anony,rounds,language,tstamp
0,chatglm-6b,koala-13b,model_b,True,1,English,1682351591.13
1,oasst-pythia-12b,alpaca-13b,tie,True,1,English,1682351654.67
2,koala-13b,oasst-pythia-12b,model_b,True,1,English,1682351708.94
3,vicuna-13b,oasst-pythia-12b,model_b,True,1,English,1682351785.19
4,vicuna-13b,koala-13b,model_a,True,1,English,1682351891.66
...,...,...,...,...,...,...,...
27011,oasst-pythia-12b,gpt-3.5-turbo,model_b,True,1,French,1684773194.12
27012,vicuna-7b,mpt-7b-chat,tie (bothbad),True,1,French,1684773225.74
27013,RWKV-4-Raven-14B,claude-instant-v1,model_b,True,1,French,1684773284.15
27014,claude-v1,claude-instant-v1,model_b,True,1,English,1684773310.18


# Exploratory Analysis

Before computing the Elo ratings, we first conduct some basic exploratory analysis to highlight a few key properties and caveates with this data.

## Signfiicant Number of Ties

We allowed the user to declare a tie between the pairs of models.  To collect additional data, later in the tournament we also allowed the user to declare a tie in which both models were bad.  There were a significant number of tied outcomes.

In [None]:
fig = px.bar(battles["win"].value_counts(),
             title="Counts of Battle Outcomes", text_auto=True, height=400)
fig.update_layout(xaxis_title="Battle Outcome", yaxis_title="Count",
                  showlegend=False)
fig

In [None]:
battles_no_ties = battles[~battles["win"].str.contains("tie")]

## Non-uniform Model Frequency

The model frequency is not uniform because of the follwoing reasons:
- Several different matching and sampling algorithms were used. We employed uniform sampling as well as weighted sampling methods, which assign greater weights to better models.
- Some new models were added later.


In [None]:
fig = px.bar(pd.concat([battles["model_a"], battles["model_b"]]).value_counts(),
             title="Battle Count for Each Model", text_auto=True)
fig.update_layout(xaxis_title="model", yaxis_title="Battle Count", height=400,
                  showlegend=False)
fig

We examing the number of pairings for each combination of models.

In [None]:
def visualize_battle_count(battles, title):
    ptbl = pd.pivot_table(battles, index="model_a", columns="model_b", aggfunc="size",
                          fill_value=0)
    battle_counts = ptbl + ptbl.T
    ordering = battle_counts.sum().sort_values(ascending=False).index
    fig = px.imshow(battle_counts.loc[ordering, ordering],
                    title=title, text_auto=True, width=600)
    fig.update_layout(xaxis_title="Model B",
                      yaxis_title="Model A",
                      xaxis_side="top", height=600, width=600,
                      title_y=0.07, title_x=0.5)
    fig.update_traces(hovertemplate=
                      "Model A: %{y}<br>Model B: %{x}<br>Count: %{z}<extra></extra>")
    return fig

fig = visualize_battle_count(battles, title="Battle Count of Each Combination of Models")
fig

### Battles Excluding Ties

In [None]:
visualize_battle_count(battles_no_ties, "Battle Count for Each Combination of Models (without Ties)")

### Counting Ties

In [None]:
visualize_battle_count(battles[battles['win'].str.contains("tie")], "Tie Count for Each Combination of Models")

## Inferred Language

We also inferred the language for each conversation using `polyglot` package. This is just an estimate but will help guide future analysis.  The vast majority of conversations were in English.

In [None]:
topk = 15
fig = px.bar(battles["language"].value_counts().head(topk),
             title=f"Battle Counts for the Top {topk} Languages",
             text_auto=True, height=400)
fig.update_layout(xaxis_title="Language", yaxis_title="Count", showlegend=False)
fig

## Number of Conversation rounds

We also noticed that most counversations only have one round.

In [None]:
fig = px.histogram(battles["rounds"],
             title=f"Number of Conversation Rounds",
             text_auto=True, height=400)
fig.update_layout(xaxis_title="Rounds", yaxis_title="Count", showlegend=False)
fig

## Pairwise Win Fractions

Finally, we can also compute the pairwise win fractions. However, because each model can play as Model A and as Model B and win in both situations we need to compute the wins in both configurations divided by the number of pairings of each model.

In [None]:
def compute_pairwise_win_fraction(battles):
    # Times each model wins as Model A
    a_win_ptbl = pd.pivot_table(
        battles[battles['win'] == "model_a"],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Table counting times each model wins as Model B
    b_win_ptbl = pd.pivot_table(
        battles[battles['win'] == "model_b"],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Table counting number of A-B pairs
    num_battles_ptbl = pd.pivot_table(battles,
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Computing the proportion of wins for each model as A and as B
    # against all other models
    row_beats_col_freq = (
        (a_win_ptbl + b_win_ptbl.T) /
        (num_battles_ptbl + num_battles_ptbl.T)
    )

    # Arrange ordering according to proprition of wins
    prop_wins = row_beats_col_freq.mean(axis=1).sort_values(ascending=False)
    model_names = list(prop_wins.keys())
    row_beats_col = row_beats_col_freq.loc[model_names, model_names]
    return row_beats_col

def visualize_pairwise_win_fraction(battles, title):
    row_beats_col = compute_pairwise_win_fraction(battles)
    fig = px.imshow(row_beats_col, color_continuous_scale='RdBu',
                    text_auto=".2f", title=title)
    fig.update_layout(xaxis_title=" Model B: Loser",
                  yaxis_title="Model A: Winner",
                  xaxis_side="top", height=600, width=600,
                  title_y=0.07, title_x=0.5)
    fig.update_traces(hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Fraction of A Wins: %{z}<extra></extra>")

    return fig

In [None]:
fig = visualize_pairwise_win_fraction(battles_no_ties,
      title = "Fraction of Model A Wins for All Non-tied A vs. B Battles")
fig

## Preliminary Ranking

Using just the average win rate against all other models we can already compute an estimated leaderboard.
However, this method may not be as scalable as the Elo rating system that we will use later because this method requires data from all model combinations.

In [None]:
row_beats_col_freq = compute_pairwise_win_fraction(battles_no_ties)
fig = px.bar(row_beats_col_freq.mean(axis=1).sort_values(ascending=False),
             title="Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)",
             text_auto=".2f")
fig.update_layout(yaxis_title="Average Win Rate", xaxis_title="Model",
                  showlegend=False)
fig.write_html("average_win_rate.html", full_html=False, include_plotlyjs="cdn")
fig

#Elo Ratings

The [Elo rating system ](https://en.wikipedia.org/wiki/Elo_rating_system)is a method for calculating the relative skill levels of players, which has been widely adopted in chess and other competitive games. The difference in the ratings between two players serves as a predictor of the outcome of a match. The Elo rating system works well for our case because we have multiple models and we run pairwise battles between them.
In this section, we present different methods for calculating Elo ratings.

### Compute Ratings
We first use the online linear update algorithm to compute Elo ratings.
We choose a small K-factor of 4 to make the Elo ratings more stable and less biased towards recent games.
This is the algorithm we used for our default leaderboard.

In [None]:
def compute_elo(battles, K=4, SCALE=400, BASE=10, INIT_RATING=1000):
    rating = defaultdict(lambda: INIT_RATING)

    for rd, model_a, model_b, win in battles[['model_a', 'model_b', 'win']].itertuples():
        ra = rating[model_a]
        rb = rating[model_b]
        ea = 1 / (1 + BASE ** ((rb - ra) / SCALE))
        eb = 1 / (1 + BASE ** ((ra - rb) / SCALE))
        if win == "model_a":
            sa = 1
        elif win == "model_b":
            sa = 0
        elif win == "tie" or win == "tie (bothbad)":
            sa = 0.5
        else:
            raise Exception(f"unexpected vote {win}")
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (1 - sa - eb)

    return rating

In [None]:
def preety_print_elo_ratings(ratings):
    df = pd.DataFrame([
        [n, elo_ratings[n]] for n in elo_ratings.keys()
    ], columns=["Model", "Elo rating"]).sort_values("Elo rating", ascending=False).reset_index(drop=True)
    df["Elo rating"] = (df["Elo rating"] + 0.5).astype(int)
    df.index = df.index + 1
    return df

elo_ratings = compute_elo(battles)
preety_print_elo_ratings(elo_ratings)

Unnamed: 0,Model,Elo rating
1,gpt-4,1225
2,claude-v1,1195
3,claude-instant-v1,1153
4,gpt-3.5-turbo,1143
5,vicuna-13b,1054
6,palm-2,1042
7,vicuna-7b,1007
8,koala-13b,980
9,mpt-7b-chat,952
10,fastchat-t5-3b,941


### Predict Win Rates
Utilizing Elo ratings allows us to predict win probabilities. By comparing the predicted win rates with the actual win rates, we can gain insight into the accuracy and quality of the Elo rating system.






In [None]:
def predict_win_rate(elo_ratings, SCALE=400, BASE=10, INIT_RATING=1000):
    names = sorted(list(elo_ratings.keys()))
    wins = defaultdict(lambda: defaultdict(lambda: 0))
    for a in names:
        for b in names:
            ea = 1 / (1 + BASE ** ((elo_ratings[b] - elo_ratings[a]) / SCALE))
            wins[a][b] = ea
            wins[b][a] = 1 - ea

    data = {
        a: [wins[a][b] if a != b else np.NAN for b in names]
        for a in names
    }

    df = pd.DataFrame(data, index=names)
    df.index.name = "model_a"
    df.columns.name = "model_b"
    return df.T

In [None]:
win_rate = predict_win_rate(compute_elo(battles))
ordered_models = win_rate.mean(axis=1).sort_values(ascending=False).index
fig = px.imshow(win_rate.loc[ordered_models, ordered_models],
                color_continuous_scale='RdBu', text_auto=".2f",
                title="Predicted Win Rate Using Elo Ratings for Model A in an A vs. B Battle")
fig.update_layout(xaxis_title="Model B",
                  yaxis_title="Model A",
                  xaxis_side="top", height=600, width=600,
                  title_y=0.07, title_x=0.5)
fig.update_traces(hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Win Rate: %{z}<extra></extra>")
fig

### Compute Bootstrap Confidence Interavals for Elo Scores

The previous linear update method may be sensitive to battle orders. Here we use bootstrap to estimate the confidence intervals


In [None]:
def get_bootstrap_result(battles, func_compute_elo, num_round):
    rows = []
    for i in tqdm(range(num_round), desc="bootstrap"):
        rows.append(func_compute_elo(battles.sample(frac=1.0, replace=True)))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]


In [None]:
BOOTSTRAP_ROUNDS = 1000

bootstrap_elo_lu = get_bootstrap_result(battles, compute_elo, BOOTSTRAP_ROUNDS)
bootstrap_lu_median = bootstrap_elo_lu.median().reset_index().set_axis(["model", "rating"], axis=1)
bootstrap_lu_median

bootstrap: 100%|██████████| 1000/1000 [01:18<00:00, 12.76it/s]


Unnamed: 0,model,rating
0,gpt-4,1251.16
1,claude-v1,1197.5
2,claude-instant-v1,1153.03
3,gpt-3.5-turbo,1146.38
4,vicuna-13b,1074.79
5,palm-2,1059.31
6,vicuna-7b,1012.25
7,koala-13b,999.31
8,mpt-7b-chat,960.05
9,RWKV-4-Raven-14B,948.49


In [None]:
def visualize_bootstrap_scores(df, title):
    bars = pd.DataFrame(dict(
        lower = df.quantile(.025),
        rating = df.quantile(.5),
        upper = df.quantile(.975))).reset_index(names="model").sort_values("rating", ascending=False)
    bars['error_y'] = bars['upper'] - bars["rating"]
    bars['error_y_minus'] = bars['rating'] - bars["lower"]
    bars['rating_rounded'] = np.round(bars['rating'], 2)
    fig = px.scatter(bars, x="model", y="rating", error_y="error_y",
                     error_y_minus="error_y_minus", text="rating_rounded",
                     title=title)
    fig.update_layout(xaxis_title="Model", yaxis_title="Rating")
    return fig

fig = visualize_bootstrap_scores(bootstrap_elo_lu, "Bootstrap of Elo Estimates")
fig

### Compute Bootstrap Confidence Intervals Assuming Uniform Sampling

We also study how the ratings will change if we only sample an equal number of battles for each model pair.

In [None]:
def sample_battle_even(battles, n_per_battle):
    groups = battles.groupby(["model_a", "model_b"], as_index=False)
    resampled = (groups
                 .apply(lambda grp: grp.sample(n_per_battle, replace=True))
                 .reset_index(drop=True))
    return resampled

In [None]:
num_samples = 50
battles_even = sample_battle_even(battles, num_samples)
pd.pivot_table(battles_even, index="model_a", columns="model_b", aggfunc="size", fill_value=0)

model_b,RWKV-4-Raven-14B,alpaca-13b,chatglm-6b,claude-instant-v1,claude-v1,dolly-v2-12b,fastchat-t5-3b,gpt-3.5-turbo,gpt-4,koala-13b,llama-13b,mpt-7b-chat,oasst-pythia-12b,palm-2,stablelm-tuned-alpha-7b,vicuna-13b,vicuna-7b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
RWKV-4-Raven-14B,0,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50
alpaca-13b,50,0,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50
chatglm-6b,50,50,0,50,50,50,50,50,50,50,50,50,50,50,50,50,50
claude-instant-v1,50,50,50,0,50,50,50,50,50,50,0,50,50,50,50,50,50
claude-v1,50,50,50,50,0,50,50,50,50,50,50,50,50,50,50,50,50
dolly-v2-12b,50,50,50,50,50,0,50,50,50,50,50,50,50,50,50,50,50
fastchat-t5-3b,50,50,50,50,50,50,0,50,50,50,50,50,50,50,50,50,50
gpt-3.5-turbo,50,50,50,50,50,50,50,0,50,50,50,50,50,50,50,50,50
gpt-4,50,50,50,50,50,50,50,50,0,50,50,50,50,50,50,50,50
koala-13b,50,50,50,50,50,50,50,50,50,0,50,50,50,50,50,50,50


In [None]:
# Sampling Battles Evenly
def get_bootstrap_even_sample(battles, n_per_battle, func_compute_elo, num_round=BOOTSTRAP_ROUNDS):
    rows = []
    for n in tqdm(range(num_round), desc="sampling battles evenly"):
        resampled = sample_battle_even(battles, n_per_battle)
        rows.append(func_compute_elo(resampled))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]

In [None]:
# num_samples = int(np.min(pd.pivot_table(battles, index="model_a", columns="model_b", aggfunc="size", fill_value=1e10).values))
print("number of samples per battle pair:", num_samples)
bootstrap_even_lu = get_bootstrap_even_sample(battles, num_samples, compute_elo)

number of samples per battle pair: 50


sampling battles evenly: 100%|██████████| 1000/1000 [03:29<00:00,  4.78it/s]


In [None]:
fig = visualize_bootstrap_scores(bootstrap_even_lu, f"Bootstrap of Elo Estimates - Even sample")
fig

In [None]:
px.violin(bootstrap_even_lu.melt(), x="variable", y="value")

### Maximum Likelihood Estimation
Another way to fit Elo ratings is using maximum likelihood estimation. Here, we provide an impelmentation with logistic regression.

In [None]:
def compute_elo_mle(df, SCALE=400, BASE=10, INIT_RATING=1000):
    from sklearn.linear_model import LogisticRegression
    models = pd.concat([df["model_a"], df["model_b"]]).unique()
    models = pd.Series(np.arange(len(models)), index=models)
    p = len(models.index)
    n = df.shape[0]

    X = np.zeros([n, p])
    X[np.arange(n), models[df["model_a"]]] = +math.log(BASE)
    X[np.arange(n), models[df["model_b"]]] = -math.log(BASE)

    Y = np.zeros(n)
    Y[df["win"] == "model_a"] = 1.0

    lr = LogisticRegression(fit_intercept=False)
    lr.fit(X,Y)

    elo_scores = SCALE * lr.coef_[0] + INIT_RATING

    return pd.Series(elo_scores, index = models.index).sort_values(ascending=False)


In [None]:
compute_elo_mle(battles_no_ties)

gpt-4                     1345.83
claude-v1                 1271.86
claude-instant-v1         1249.98
gpt-3.5-turbo             1203.26
vicuna-13b                1109.58
palm-2                    1088.07
vicuna-7b                 1031.39
koala-13b                 1004.07
mpt-7b-chat                948.85
RWKV-4-Raven-14B           929.92
alpaca-13b                 901.16
oasst-pythia-12b           900.54
chatglm-6b                 845.36
fastchat-t5-3b             841.81
stablelm-tuned-alpha-7b    810.48
dolly-v2-12b               778.92
llama-13b                  738.95
dtype: float64

In [None]:
elo_mle_bootstrap = get_bootstrap_result(battles_no_ties, compute_elo_mle, 500)

bootstrap: 100%|██████████| 500/500 [01:07<00:00,  7.40it/s]


In [None]:
visualize_bootstrap_scores(elo_mle_bootstrap, "Bootstrap of MLE Elo Estimates")

# Language-specific Leaderboards
We present two language-specific leaderboards, by isolating the chat data into two subsets based on the language: (1) English-only and (2) Non-English.

## English-only

In [None]:
english_only_battles = battles[battles["language"] == "English"]
elo_ratings = compute_elo(english_only_battles)
preety_print_elo_ratings(elo_ratings)

Unnamed: 0,Model,Elo rating
1,gpt-4,1221
2,claude-v1,1186
3,claude-instant-v1,1144
4,gpt-3.5-turbo,1134
5,palm-2,1074
6,vicuna-13b,1059
7,vicuna-7b,1005
8,koala-13b,985
9,mpt-7b-chat,954
10,fastchat-t5-3b,952


## Non-English

In [None]:
non_english_battles = battles[battles["language"] != "English"]
elo_ratings = compute_elo(non_english_battles)
preety_print_elo_ratings(elo_ratings)

Unnamed: 0,Model,Elo rating
1,gpt-4,1270
2,claude-v1,1220
3,gpt-3.5-turbo,1183
4,vicuna-13b,1068
5,claude-instant-v1,1067
6,vicuna-7b,999
7,chatglm-6b,985
8,koala-13b,971
9,mpt-7b-chat,966
10,RWKV-4-Raven-14B,953


# Investigate PaLM 2 Lost Cases
Let us download a new version of data with more tags about PaLM.


In [None]:
url = "https://drive.google.com/file/d/1as81BWjao55YFWU_cr_OZl2ymOAXhZNq/view?usp=share_link"
filename = gdown.download(url, quiet=False, fuzzy=True)
raw_data = pd.read_json(filename).sort_values(ascending=True, by=["tstamp"])
battles = raw_data[raw_data['anony']].reset_index(drop=True)
battles

Downloading...
From: https://drive.google.com/uc?id=1as81BWjao55YFWU_cr_OZl2ymOAXhZNq
To: /content/clean_battle_20230522_more_tag.json
100%|██████████| 11.8M/11.8M [00:00<00:00, 54.9MB/s]


Unnamed: 0,model_a,model_b,win,anony,rounds,language,tstamp,model_a_is_palm_refusal,model_b_is_palm_refusal
0,chatglm-6b,koala-13b,model_b,True,1,English,1682351591.13,False,False
1,oasst-pythia-12b,alpaca-13b,tie,True,1,English,1682351654.67,False,False
2,koala-13b,oasst-pythia-12b,model_b,True,1,English,1682351708.94,False,False
3,vicuna-13b,oasst-pythia-12b,model_b,True,1,English,1682351785.19,False,False
4,vicuna-13b,koala-13b,model_a,True,1,English,1682351891.66,False,False
...,...,...,...,...,...,...,...,...,...
27011,oasst-pythia-12b,gpt-3.5-turbo,model_b,True,1,French,1684773194.12,False,False
27012,vicuna-7b,mpt-7b-chat,tie (bothbad),True,1,French,1684773225.74,False,False
27013,RWKV-4-Raven-14B,claude-instant-v1,model_b,True,1,French,1684773284.15,False,False
27014,claude-v1,claude-instant-v1,model_b,True,1,English,1684773310.18,False,False


## Losing Rate to Non-proprieratry Chatbots

We calcuate the rate of PaLM 2 losing battles to non-proprieratry players that are not one of the the top-4 models: gpt-4, claude-v1, claude-instant-v1, gpt-3.5-turbo.

Due to the lack of multilingual ablities and strong regulations in the current offered PaLM 2 versions, we observe that it loses more arena battles to open-source chatbots.

In [None]:
def calculate_losing_rate(model_name):
  model_battles = battles[(battles['model_a'] == model_name) | (battles['model_b'] == model_name)]
  model_battles = model_battles[model_battles["win"] != "tie"]
  model_battles = model_battles[model_battles["win"] != "tie (bothbad)"]

  model_lost_battles = model_battles[((model_battles["model_a"] == model_name) & (model_battles["win"] == "model_b")) | ((model_battles["model_b"] == model_name) & (model_battles["win"] == "model_a"))]
  print(f"{model_name} losing rate in non-tie battles: {model_lost_battles.shape[0] / model_battles.shape[0]}")

  strong_models = ["gpt-4", "claude-v1", "claude-instant-v1", "gpt-3.5-turbo", "palm-2"]
  model_lost_battles_to_weak_chatbots = model_lost_battles[(~model_lost_battles["model_a"].isin(strong_models)) | (~model_lost_battles["model_b"].isin(strong_models))]
  lost_rate_to_weak_chatbots = model_lost_battles_to_weak_chatbots.shape[0] / model_battles.shape[0]
  print(f"{model_name} losing rate to chatbots that are not one of [gpt-4, claude-v1, claude-instant-v1, gpt-3.5-turbo] is {lost_rate_to_weak_chatbots}")

calculate_losing_rate("palm-2")

palm-2 losing rate in non-tie battles: 0.4375484871993794
palm-2 losing rate to chatbots that are not one of [gpt-4, claude-v1, claude-instant-v1, gpt-3.5-turbo] is 0.21644685802948022


We can also calcuate the above statistics for GPT-3.5 as a reference.

In [None]:
calculate_losing_rate("gpt-3.5-turbo")

gpt-3.5-turbo losing rate in non-tie battles: 0.24959586162301972
gpt-3.5-turbo losing rate to chatbots that are not one of [gpt-4, claude-v1, claude-instant-v1, gpt-3.5-turbo] is 0.1280310378273521


## Losing Rate due to Refusal
Below we estimate the rates of PaLM 2 losing Arena battles due to refusal anwsers.

In [None]:
model_name = "palm-2"
model_battles = battles[(battles['model_a'] == model_name) | (battles['model_b'] == model_name)]
model_battles = model_battles[model_battles["win"] != "tie"]
model_battles = model_battles[model_battles["win"] != "tie (bothbad)"]

model_lost_battles = model_battles[((model_battles["model_a"] == model_name) & (model_battles["win"] == "model_b")) | ((model_battles["model_b"] == model_name) & (model_battles["win"] == "model_a"))]
model_lost_battles_due_to_refusal = model_lost_battles[(model_lost_battles["model_a_is_palm_refusal"] == True) | (model_lost_battles["model_b_is_palm_refusal"] == True)]
model_lost_battles_due_to_refusal

n_lost_battles = model_lost_battles.shape[0]
n_lost_battles_due_to_refusal = model_lost_battles_due_to_refusal.shape[0]
print(f"Among all {n_lost_battles} battles, {model_name} loses {n_lost_battles_due_to_refusal} ({n_lost_battles_due_to_refusal / n_lost_battles * 100}%) battles due to providing a refusal answer.")

strong_models = ["gpt-4", "claude-v1", "claude-instant-v1", "gpt-3.5-turbo", "palm-2"]
model_lost_battles_to_weak_chatbots = model_lost_battles[(~model_lost_battles["model_a"].isin(strong_models)) | (~model_lost_battles["model_b"].isin(strong_models))]
model_lost_battles_to_weak_chatbots_due_to_refusal = model_lost_battles_to_weak_chatbots[(model_lost_battles_to_weak_chatbots["model_a_is_palm_refusal"] == True) | (model_lost_battles_to_weak_chatbots["model_b_is_palm_refusal"] == True)]
n_lost_battles_to_weak_chatbots = model_lost_battles_to_weak_chatbots.shape[0]
n_lost_battles_to_weak_chatbots_due_to_refusal = model_lost_battles_to_weak_chatbots_due_to_refusal.shape[0]
print(f"Among all {n_lost_battles_to_weak_chatbots} battles to chatbots that are not one of {strong_models}, {model_name} loses {n_lost_battles_to_weak_chatbots_due_to_refusal} ({n_lost_battles_to_weak_chatbots_due_to_refusal / n_lost_battles_to_weak_chatbots * 100}%) battles due to providing a refusal answer.")


Among all 564 battles, palm-2 loses 118 (20.921985815602838%) battles due to providing a refusal answer.
Among all 279 battles to chatbots that are not one of ['gpt-4', 'claude-v1', 'claude-instant-v1', 'gpt-3.5-turbo', 'palm-2'], palm-2 loses 86 (30.824372759856633%) battles due to providing a refusal answer.


## Elo Ratings if Removing Non-English and Refusal Conversations

We remove all non-English conversations and all conversations for which that PaLM-2 didn’t provide an answer, and calculate the Elo ratings of each model in the filtered data. This rating represents a hypothetical upper bound of PaLM-2's Elo in the Arena.

In [None]:
english_battles = battles[battles["language"] == "English"]
english_and_non_refusal_battles = english_battles[english_battles["model_a_is_palm_refusal"] != True]
english_and_non_refusal_battles = english_and_non_refusal_battles[english_and_non_refusal_battles["model_b_is_palm_refusal"] != True]
elo_ratings = compute_elo(english_and_non_refusal_battles)
preety_print_elo_ratings(elo_ratings)

Unnamed: 0,Model,Elo rating
1,gpt-4,1222
2,claude-v1,1187
3,claude-instant-v1,1145
4,gpt-3.5-turbo,1136
5,palm-2,1101
6,vicuna-13b,1058
7,vicuna-7b,1006
8,koala-13b,987
9,mpt-7b-chat,952
10,fastchat-t5-3b,948


# Links



Some good resources to learn more about Elo rating systems:
- Wikipedia https://en.wikipedia.org/wiki/Elo_rating_system
- An introduction video https://www.youtube.com/watch?v=AsYfbmp0To0
- A FiveThirtyEight article https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/
