# Introduction

In this notebook, we will employ the Elo rating system to evaluate the performance of large language models (LLMs).
The analysis is based on the pairwise battle results we collected from https://arena.lmsys.org between April 24 and May 1, 2023.
This crowdsourcing way of data collection represents some use cases of LLMs in the wild.
Below, we present the calculation procedure along with some basic analyses.





In [None]:
from collections import defaultdict
import json, math, gdown
import numpy as np
import pandas as pd
import plotly.express as px
from tqdm import tqdm
pd.options.display.float_format = '{:.2f}'.format

# Obtaining and Cleaning the Tournament Data
We are hosting the initial tournament results as a JSON file on Google Drive. We use the `gdown` function to download the data. The data only contains voting results without conversation histories because releasing the conversation history will raise concerns such as privacy and toxicity.

In [None]:
url = "https://drive.google.com/file/d/1DljLStkGSSwXecevGzsqD6cGxm3LERxG/view?usp=sharing"
gdown.download(url, quiet=False, fuzzy=True)

Downloading...
From: https://drive.google.com/uc?id=1DljLStkGSSwXecevGzsqD6cGxm3LERxG
To: /content/clean_battle_20230501.json
100%|██████████| 1.67M/1.67M [00:00<00:00, 110MB/s]


'clean_battle_20230501.json'

In [None]:
raw_data = pd.read_json("clean_battle_20230501.json").sort_values(ascending=True, by=["tstamp"])
raw_data

Unnamed: 0,model_a,model_b,win,anony,tstamp,language
761,alpaca-13b,dolly-v2-12b,model_b,False,1682295238.71,English
762,koala-13b,dolly-v2-12b,model_b,False,1682295376.63,English
763,koala-13b,dolly-v2-12b,model_b,False,1682295408.08,English
764,koala-13b,dolly-v2-12b,model_b,False,1682295434.13,English
765,vicuna-13b,koala-13b,model_a,False,1682295451.25,English
...,...,...,...,...,...,...
9775,vicuna-13b,koala-13b,model_a,False,1682982812.03,English
9598,chatglm-6b,stablelm-tuned-alpha-7b,model_a,True,1682982923.13,English
9776,vicuna-13b,koala-13b,model_a,True,1682982943.31,English
9777,alpaca-13b,dolly-v2-12b,model_b,True,1682983185.82,English


## Removing the Non Anonymous Pairings 
The lmsys data also includes the side-by-side chat web interface where users were able to see (and select) which models were paired. In this case, the battles were **not anonymous**. We have included this data for others to study but it was not part of the tournament so we drop these records from the remaining analysis. 

In [None]:
raw_data['anony'].value_counts()

False    5152
True     4689
Name: anony, dtype: int64

In [None]:
battles = raw_data[raw_data['anony']].reset_index(drop=True)
battles

Unnamed: 0,model_a,model_b,win,anony,tstamp,language
0,chatglm-6b,koala-13b,model_b,True,1682351591.13,English
1,oasst-pythia-12b,alpaca-13b,tie,True,1682351654.67,English
2,koala-13b,oasst-pythia-12b,model_b,True,1682351708.94,English
3,vicuna-13b,oasst-pythia-12b,model_b,True,1682351785.19,English
4,vicuna-13b,koala-13b,model_a,True,1682351891.66,English
...,...,...,...,...,...,...
4684,chatglm-6b,alpaca-13b,tie (bothbad),True,1682982673.43,English
4685,chatglm-6b,stablelm-tuned-alpha-7b,model_a,True,1682982923.13,English
4686,vicuna-13b,koala-13b,model_a,True,1682982943.31,English
4687,alpaca-13b,dolly-v2-12b,model_b,True,1682983185.82,English


# Exploratory Analysis

Before computing the Elo ratings, we first conduct some basic exploratory analysis to highlight a few key properties and caveates with this data. 

## Signfiicant Number of Ties

We allowed the user to declare a tie between the pairs of models.  To collect additional data, later in the tournament we also allowed the user to declare a tie in which both models were bad.  There were a significant number of tied outcomes. 

In [None]:
fig = px.bar(battles["win"].value_counts(),
             title="Counts of Battle Outcomes", text_auto=True, height=400)
fig.update_layout(xaxis_title="Battle Outcome", yaxis_title="Count", 
                  showlegend=False)
# save for blog
fig.write_html("outcome_counts.html", full_html=False, include_plotlyjs="cdn")
fig

In [None]:
battles_no_ties = battles[~battles["win"].str.contains("tie")]

## Non-uniform Model Frequency

When we initially launched the tournament, we had prior information on the likely ranking based on our benchmarks and chose to pair models according to this ranking. We gave preference to what we believed would be strong pairings based on this ranking. However, we later switched to uniform sampling to get better overall coverage of the rankings.
Towards the end of the tournament, we also introduced a new model `fastchat-t5-3b`.
All of these result in non-uniform model frequency.

In [None]:
fig = px.bar(pd.concat([battles["model_a"], battles["model_b"]]).value_counts(),
             title="Battle Count for Each Model", text_auto=True)
fig.update_layout(xaxis_title="model", yaxis_title="Battle Count", 
                  showlegend=False)
# save the plot for the blog:
fig.write_html("model_counts.html", full_html=False, include_plotlyjs="cdn")
fig

We examing the number of pairings for each combination of models.

In [None]:
def visualize_battle_count(battles, title):
    ptbl = pd.pivot_table(battles, index="model_a", columns="model_b", aggfunc="size", 
                          fill_value=0)
    battle_counts = ptbl + ptbl.T
    ordering = battle_counts.sum().sort_values(ascending=False).index
    fig = px.imshow(battle_counts.loc[ordering, ordering], 
                    title=title, text_auto=True, width=600)
    fig.update_layout(xaxis_title="Model B", 
                      yaxis_title="Model A",
                      xaxis_side="top",
                      title_y=0.07, title_x=0.5)
    fig.update_traces(hovertemplate=
                      "Model A: %{y}<br>Model B: %{x}<br>Count: %{z}<extra></extra>")
    return fig

fig = visualize_battle_count(battles, title="Battle Count of Each Combination of Models")
# save for blog
fig.write_html("battle_counts.html", full_html=True, include_plotlyjs="cdn")
fig

### Battles Excluding Ties

In [None]:
visualize_battle_count(battles_no_ties, "Battle Count for Each Combination of Models (without Ties)")

### Counting Ties

In [None]:
visualize_battle_count(battles[battles['win'].str.contains("tie")], "Tie Count for Each Combination of Models")

## Inferred Language

We also inferred the language for each of the the conversations using `polyglot` package. This is just an estimate but will help guide future analysis.  The vast majority of conversations were in English.

In [None]:
topk = 15
fig = px.bar(battles["language"].value_counts().head(topk),
             title=f"Battle Counts for the Top {topk} Languages", 
             text_auto=True, height=400)
fig.update_layout(xaxis_title="Language", yaxis_title="Count", showlegend=False)
fig.write_html("language_counts.html", full_html=False, include_plotlyjs="cdn")
fig

## Pairwise Win Fractions

Finally, we can also compute the pairwise win fractions. However, because each model can play as Model A and as Model B and win in both situations we need to compute the wins in both configurations divided by the number of pairings of each model.

In [None]:
def compute_pairwise_win_fraction(battles):
    # Times each model wins as Model A
    a_win_ptbl = pd.pivot_table(
        battles[battles['win'] == "model_a"], 
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Table counting times each model wins as Model B
    b_win_ptbl = pd.pivot_table(
        battles[battles['win'] == "model_b"], 
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Table counting number of A-B pairs
    num_battles_ptbl = pd.pivot_table(battles, 
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Computing the proportion of wins for each model as A and as B 
    # against all other models
    row_beats_col_freq = (
        (a_win_ptbl + b_win_ptbl.T) / 
        (num_battles_ptbl + num_battles_ptbl.T)
    )

    # Arrange ordering according to proprition of wins
    prop_wins = row_beats_col_freq.mean(axis=1).sort_values(ascending=False)
    model_names = list(prop_wins.keys())
    row_beats_col = row_beats_col_freq.loc[model_names, model_names]
    return row_beats_col
  
def visualize_pairwise_win_fraction(battles, title):
    row_beats_col = compute_pairwise_win_fraction(battles)
    fig = px.imshow(row_beats_col, color_continuous_scale='RdBu',
                    text_auto=".2f", title=title)
    fig.update_layout(xaxis_title="Model B", 
                  yaxis_title="Model A",
                  xaxis_side="top",
                  title_y=0.07, title_x=0.5)
    fig.update_traces(hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Fraction of A Wins: %{z}<extra></extra>")

    return fig   

In [None]:
fig = visualize_pairwise_win_fraction(battles_no_ties,
      title = "Fraction of Model A Wins for All Non-tied A vs. B Battles")
fig.write_html("row_beats_column_probs.html", full_html=False, include_plotlyjs="cdn")
fig

## Preliminary Ranking

Using just the average win rate against all other models we can already compute an estimated leaderboard.
However, this method may not be as scalable as the Elo rating system that we will use later because this method requires data from all model combinations.

In [None]:
row_beats_col_freq = compute_pairwise_win_fraction(battles_no_ties)
fig = px.bar(row_beats_col_freq.mean(axis=1).sort_values(ascending=False),
             title="Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)", 
             text_auto=".2f")
fig.update_layout(yaxis_title="Average Win Rate", xaxis_title="Model",
                  showlegend=False)
fig.write_html("average_win_rate.html", full_html=False, include_plotlyjs="cdn")
fig

#Elo Ratings

The [Elo rating system ](https://en.wikipedia.org/wiki/Elo_rating_system)is a method for calculating the relative skill levels of players, which has been widely adopted in chess, sports, and MOBA games. The difference in the ratings between two players serves as a predictor of the outcome of a match. The Elo rating system works well for our case because we have multiple models and we run pairwise battles between them.
Next, we compute the Elo ratings of these models.

### Compute Ratings

In [None]:
def compute_elo(battles, K=32, SCALE=400, BASE=10, INIT_RATING=1000):
    rating = defaultdict(lambda: INIT_RATING)
 
    for rd, model_a, model_b, win in battles[['model_a', 'model_b', 'win']].itertuples():
        ra = rating[model_a]
        rb = rating[model_b]
        ea = 1 / (1 + BASE ** ((rb - ra) / SCALE))
        eb = 1 / (1 + BASE ** ((ra - rb) / SCALE))
        if win == "model_a":
            sa = 1
        elif win == "model_b":
            sa = 0
        elif win == "tie" or win == "tie (bothbad)":
            sa = 0.5
        else:
            raise Exception(f"unexpected vote {win}")
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (1 - sa - eb)
    
    return rating

In [None]:
elo_ratings = compute_elo(battles)
df = pd.DataFrame([
    [n, elo_ratings[n]] for n in elo_ratings.keys()
], columns=["Model", "Elo rating"]).sort_values("Elo rating", ascending=False).reset_index(drop=True)
df["Elo rating"] = df["Elo rating"].astype(int)
df

Unnamed: 0,Model,Elo rating
0,vicuna-13b,1169
1,koala-13b,1082
2,oasst-pythia-12b,1065
3,alpaca-13b,1008
4,chatglm-6b,985
5,fastchat-t5-3b,951
6,dolly-v2-12b,944
7,llama-13b,932
8,stablelm-tuned-alpha-7b,858


In [None]:
def predict_win_rate(elo_ratings, K=32, SCALE=400, BASE=10, INIT_RATING=1000):
    names = sorted(list(elo_ratings.keys()))
    wins = defaultdict(lambda: defaultdict(lambda: 0))
    for a in names:
        for b in names:
            ea = 1 / (1 + BASE ** ((elo_ratings[b] - elo_ratings[a]) / SCALE))
            wins[a][b] = ea
            wins[b][a] = 1 - ea
      
    data = {
        a: [wins[a][b] if a != b else np.NAN for b in names]
        for a in names
    }

    df = pd.DataFrame(data, index=names)
    df.index.name = "model_a"
    df.columns.name = "model_b"
    return df.T

In [None]:
win_rate = predict_win_rate(compute_elo(battles))
ordered_models = win_rate.mean(axis=1).sort_values(ascending=False).index
fig = px.imshow(win_rate.loc[ordered_models, ordered_models], 
                color_continuous_scale='RdBu', text_auto=".2f",
                title="Predicted Win Rate Using Elo Ratings for Model A in an A vs. B Battle")
fig.update_layout(xaxis_title="Model B", 
                  yaxis_title="Model A",
                  xaxis_side="top",
                  title_y=0.07, title_x=0.5)
fig.update_traces(hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Win Rate: %{z}<extra></extra>")
fig.write_html("elo_predicted_win_rate.html", full_html=False, include_plotlyjs="cdn")
fig

### Compute Bootstrap Confidence Interavals for Elo Scores

The previous linear update method may be sensitive to battle orders. Here we use bootstrap to estimate the confidence intervals


In [None]:
def get_bootstrap_result(battles, func_compute_elo, num_round):
    rows = []
    for i in tqdm(range(num_round), desc="bootstrap"):
        rows.append(func_compute_elo(battles.sample(frac=1.0, replace=True)))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]


In [None]:
BOOTSTRAP_ROUNDS = 1000

bootstrap_elo_lu = get_bootstrap_result(battles, compute_elo, BOOTSTRAP_ROUNDS)
bootstrap_lu_median = bootstrap_elo_lu.median().reset_index().set_axis(["model", "rating"], axis=1)
bootstrap_lu_median

bootstrap: 100%|██████████| 1000/1000 [00:26<00:00, 37.61it/s]


Unnamed: 0,model,rating
0,vicuna-13b,1172.9
1,koala-13b,1091.47
2,alpaca-13b,1018.72
3,oasst-pythia-12b,1008.28
4,chatglm-6b,975.62
5,fastchat-t5-3b,956.63
6,stablelm-tuned-alpha-7b,946.23
7,dolly-v2-12b,933.25
8,llama-13b,896.0


In [None]:
def visualize_bootstrap_scores(df, title):
    bars = pd.DataFrame(dict(
        lower = df.quantile(.025),
        rating = df.quantile(.5),
        upper = df.quantile(.975))).reset_index(names="model").sort_values("rating", ascending=False)
    bars['error_y'] = bars['upper'] - bars["rating"]
    bars['error_y_minus'] = bars['rating'] - bars["lower"]
    bars['rating_rounded'] = np.round(bars['rating'], 2)
    fig = px.scatter(bars, x="model", y="rating", error_y="error_y", 
                      error_y_minus="error_y_minus", text="rating_rounded", 
                      title=title)
    fig.update_layout(xaxis_title="Model", yaxis_title="Rating")
    return fig

fig = visualize_bootstrap_scores(bootstrap_elo_lu, "Bootstrap of Elo Estimates")
fig.write_html("bootstrap_elo_estimates.html", full_html=False, include_plotlyjs="cdn")
fig

### Compute Bootstrap Confidence Intervals Assuming Uniform Sampling

In [None]:
def sample_battle_even(battles, n_per_battle):
    groups = battles.groupby(["model_a", "model_b"], as_index=False)
    resampled = (groups
                 .apply(lambda grp: grp.sample(n_per_battle, replace=True))
                 .reset_index(drop=True))
    return resampled

In [None]:
num_samples = 50
battles_even = sample_battle_even(battles, num_samples)
pd.pivot_table(battles_even, index="model_a", columns="model_b", aggfunc="size", fill_value=0)

model_b,alpaca-13b,chatglm-6b,dolly-v2-12b,fastchat-t5-3b,koala-13b,llama-13b,oasst-pythia-12b,stablelm-tuned-alpha-7b,vicuna-13b
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
alpaca-13b,0,50,50,50,50,50,50,50,50
chatglm-6b,50,0,50,50,50,50,50,50,50
dolly-v2-12b,50,50,0,50,50,50,50,50,50
fastchat-t5-3b,50,50,50,0,50,50,50,50,50
koala-13b,50,50,50,50,0,50,50,50,50
llama-13b,50,50,50,50,50,0,50,50,50
oasst-pythia-12b,50,50,50,50,50,50,0,50,50
stablelm-tuned-alpha-7b,50,50,50,50,50,50,50,0,50
vicuna-13b,50,50,50,50,50,50,50,50,0


In [None]:
# Sampling Battles Evenly
def get_bootstrap_even_sample(battles, n_per_battle, func_compute_elo, num_round=BOOTSTRAP_ROUNDS):
    rows = []
    for n in tqdm(range(num_round), desc="sampling battles evenly"):
        resampled = sample_battle_even(battles, n_per_battle)
        rows.append(func_compute_elo(resampled))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]

In [None]:
# num_samples = int(np.min(pd.pivot_table(battles, index="model_a", columns="model_b", aggfunc="size", fill_value=1e10).values))
num_samples = 50
print("number of samples per battle pair:", num_samples)
bootstrap_even_lu = get_bootstrap_even_sample(battles, num_samples, compute_elo)

number of samples per battle pair: 50


sampling battles evenly: 100%|██████████| 1000/1000 [00:48<00:00, 20.42it/s]


In [None]:
fig = visualize_bootstrap_scores(bootstrap_even_lu, f"Bootstrap of Elo Estimates - Even sample")
fig.write_html("bootstrap_elo_estimates_uniform_sampling.html", full_html=False, include_plotlyjs="cdn")
fig

In [None]:
px.violin(bootstrap_even_lu.melt(), x="variable", y="value")

# Links



Some good resources to learn more about Elo rating systems:
- Wikipedia https://en.wikipedia.org/wiki/Elo_rating_system
- An introduction video https://www.youtube.com/watch?v=AsYfbmp0To0
- A FiveThirtyEight article https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/
