# Title: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

#### Members' Names :
Syed Ali Javed    
Sameer Ul Haq

####  Emails:
syedali.javed@torontomu.ca  
sulhaq@torontomu.ca

# Introduction:

This paper introduces a new method for evaluating chatbot models using Large Language Models (LLMs) as judges. Specifically, it explores whether GPT-4 — a model trained using human feedback — can fairly and reliably evaluate other chat assistants.

To do this, the research introduced:

MT-Bench – a benchmark of multi-turn, open-ended questions

Chatbot Arena – a platform where users vote between chatbot responses

The aim of the research was to replace costly judgment with an AI Judge approach.

#### Problem Description:

The research tells that current benchmarks, such as MMLU, which test how well a language model knows facts by asking closed-ended questions (like multiple-choice questions). But real conversations with chatbots are open-ended, often with multiple valid answers, and involve multi-turn dialogue.

As a result, a chatbot can score high on these old benchmarks but still give poor or unhelpful answers in real conversations. We don’t have a good way to measure how helpful or human-aligned a chatbot really is.


#### Context of the Problem:

As LLMs are now used for writing, coding, reasoning, and chatting, their evaluation needs to measure how well they match human expectations — not just correctness.

Manual evaluation by humans is accurate, but it’s also expensive, slow, and unscalable.

LLMs like GPT-4 are trained using Reinforcement Learning from Human Feedback (RLHF), which teaches them to follow human-like instructions. This makes GPT-4 a strong candidate for judging chatbot responses.

But this had never been studied in detail. This paper fills that gap.

#### Limitation About other Approaches:

The authors categorize existing LLM evaluation benchmarks into three groups:

Core knowledge – MMLU, ARC, HumanEval. These test factual knowledge using short answers.

Instruction-following – Flan, Self-Instruct. These check if a model follows tasks, but lack diversity and realism.

Conversational – CoQA, OpenAssistant. These simulate dialogues but lack challenge and don’t scale.

All of them fail to assess open-ended dialogue, multi-turn reasoning, or human preferences like helpfulness, clarity, and creativity.

Hence, they are inadequate to evaluate modern LLMs in real settings

#### Solution:

This paper introduces a new method for evaluating chatbot models using Large Language Models (LLMs) as judges. Specifically, it explores whether GPT-4 — a model trained using human feedback — can fairly and reliably evaluate other chat assistants.

To do this, the authors introduce:

MT-Bench – a benchmark of multi-turn, open-ended questions

Chatbot Arena – a platform where users vote between chatbot responses

They compare GPT-4's evaluations with human preferences and find that GPT-4 can match expert-level agreement over 80% of the time. This opens a door to scalable and automated chatbot evaluation."


# Background


| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Hendrycks et al., 2020 [1] | Evaluates factual recall through multiple-choice questions | 57 knowledge tasks | Benchmarks like MMLU cannot effectively distinguish between base and aligned models (e.g., GPT-3). They primarily focus on closed-ended questions with short responses.
| Reddy, Chen, & Manning, 2019 [2] | Conversational QA benchmark simulating dialogue | CoQA dataset | Conversational benchmarks like CoQA fall short in challenging the capabilities of the latest chatbots.



# Methodology

Benchmarks Introduced:

- MT-Bench – 80 carefully designed multi-turn questions in 8 categories:

    Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities​

    Each question has two turns. Two turns simulate back-and-forth — evaluating not just the first answer but how the model continues the dialogue. Helps test memory, coherence, follow-up accuracy.


- Chatbot Arena – A live web platform where users vote between two anonymous chatbot responses. 30K conversations were collected with 3K used for controlled analysis​.


Evaluation Setup:

The authors test 6 models: GPT-4, GPT-3.5, Claude, Vicuna, Alpaca, LLaMA.
They use two judge types:

LLM Judges (GPT-4, GPT-3.5, Claude)

Humans (58 experts for MT-Bench, 2,114 crowd users for Arena)

Three judgment modes:

- Pairwise Comparison: GPT-4 sees Q + Answer A + Answer B → Picks best or tie​

- Single Answer Grading: GPT-4 scores one answer on 1–10 scale

- Reference-Guided: A reference answer is given → GPT-4 compares both to that.



Evaluation Results:

GPT-4 agrees with expert humans 85% of the time — same as human–human agreement (81%)​

GPT-4 also more decisive: fewer tie votes than Claude or GPT-3.5

75% of users agreed with GPT-4’s explanation

34% changed their vote after seeing GPT-4’s reasoning​


Fixes Applied:

Swap answer order – judge both directions, count as tie if decision flips

Chain-of-thought prompting – GPT-4 solves the problem before judging

Reference-guided grading – use GPT-4's own answer as reference

Few-shot examples – Add judgment examples to improve consistency (↑ to 77.5%)​


![Alternate text ](Figure.png "Title of the figure, location is simply the directory of the notebook")

# Implementation

In this section, you will provide the code and its explanation. You may have to create more cells after this. (To keep the Notebook clean, do not display debugging output or thousands of print statements from hundreds of epochs. Make sure it is readable for others by reviewing it yourself carefully.)

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Wed Apr 16 00:41:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import torch
torch.cuda.is_available()

True

In [3]:
!unzip /FastChat.zip -d /src

Archive:  /FastChat.zip
   creating: /src/FastChat/
   creating: /src/FastChat/.github/
  inflating: /src/FastChat/.github/PULL_REQUEST_TEMPLATE.md  
   creating: /src/FastChat/.github/workflows/
  inflating: /src/FastChat/.github/workflows/python-package.yml  
  inflating: /src/FastChat/.gitignore  
  inflating: /src/FastChat/.pylintrc  
   creating: /src/FastChat/assets/
  inflating: /src/FastChat/assets/demo_narrow.gif  
  inflating: /src/FastChat/assets/qa_browser.png  
  inflating: /src/FastChat/assets/screenshot_cli.png  
  inflating: /src/FastChat/assets/screenshot_gui.png  
  inflating: /src/FastChat/assets/server_arch.png  
  inflating: /src/FastChat/assets/vicuna_logo.jpeg  
   creating: /src/FastChat/data/
  inflating: /src/FastChat/data/dummy_conversation.json  
   creating: /src/FastChat/docker/
  inflating: /src/FastChat/docker/docker-compose.yml  
  inflating: /src/FastChat/docker/Dockerfile  
   creating: /src/FastChat/docs/
  inflating: /src/FastChat/docs/arena.md  
  

In [5]:
# Go into FastChat and install
%cd /src/FastChat
!pip install -e ".[model_worker,llm_judge]"

/src/FastChat
Obtaining file:///src/FastChat
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fastapi (from fschat==0.2.36)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting markdown2[all] (from fschat==0.2.36)
  Downloading markdown2-2.5.3-py3-none-any.whl.metadata (2.1 kB)
Collecting nh3 (from fschat==0.2.36)
  Downloading nh3-0.2.21-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting pydantic-settings (from fschat==0.2.36)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting shortuuid (from fschat==0.2.36)
  Downloading shortuuid-1.0.13-py3-none-any.whl.metadata (5.8 kB)
Collecting tiktoken (from fschat==0.2.36)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manyl

In [6]:
#Download and Review Pre-Generated Model Answers and Judgments on MT-bench Questions

!python3 /src/FastChat/fastchat/llm_judge/download_mt_bench_pregenerated.py

wget -q --show-progress -O data/mt_bench/model_answer/alpaca-13b.jsonl https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_answer/alpaca-13b.jsonl
wget -q --show-progress -O data/mt_bench/model_answer/baize-v2-13b.jsonl https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_answer/baize-v2-13b.jsonl
wget -q --show-progress -O data/mt_bench/model_answer/chatglm-6b.jsonl https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_answer/chatglm-6b.jsonl
wget -q --show-progress -O data/mt_bench/model_answer/claude-instant-v1.jsonl https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_answer/claude-instant-v1.jsonl
wget -q --show-progress -O data/mt_bench/model_answer/claude-v1.jsonl https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_answer/claude-v1.jsonl
wget -q --show-progress -O data/mt_bench/model_answer/dolly-v2-12b.jsonl https://huggingface.co/spaces/lmsys/mt-bench/r

In [13]:
#Generate model answers to MT-bench questions

!python /src/FastChat/fastchat/llm_judge/gen_model_answer.py --model-path lmsys/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5

2025-04-16 01:25:28.911350: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744766728.946109   16607 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744766728.956828   16607 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-16 01:25:28.991145: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Output to data/mt_bench/model_answer/vicuna-7b-v1.5.jsonl
tokenizer_config.json: 100% 749/749 [00:00<00:00, 6.59MB/s]

In [19]:
import os
os.environ["OPENAI_API_KEY"] = ""

In [20]:
!python /src/FastChat/fastchat/llm_judge/gen_judgment.py --model-list vicuna-7b-v1.5 --parallel 2

2025-04-16 02:45:13.929376: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744771513.952535   36739 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744771513.959121   36739 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-16 02:45:13.981893: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Stats:
{
    "bench_name": "mt_bench",
    "mode": "single",
    "judge": "gpt-4",
    "baseline": null,
    "model_l

In [26]:
!python /src/FastChat/fastchat/llm_judge/show_result.py --model-list vicuna-7b-v1.5 vicuna-13b-v1.3 alpaca-13b llama-13b claude-v1 gpt-3.5-turbo gpt-4

Mode: single
Input file: data/mt_bench/model_judgment/gpt-4_single.jsonl

########## First turn ##########
                        score
model           turn         
gpt-4           1     8.95625
claude-v1       1     8.15000
gpt-3.5-turbo   1     8.07500
vicuna-13b-v1.3 1     6.81250
vicuna-7b-v1.5  1     6.71875
alpaca-13b      1     4.97500
llama-13b       1     3.26250

########## Second turn ##########
                       score
model           turn        
gpt-4           2     9.0250
gpt-3.5-turbo   2     7.8125
claude-v1       2     7.6500
vicuna-13b-v1.3 2     5.9625
vicuna-7b-v1.5  2     5.5125
alpaca-13b      2     4.0875
llama-13b       2     1.9500

########## Average ##########
                    score
model                    
gpt-4            8.990625
gpt-3.5-turbo    7.943750
claude-v1        7.900000
vicuna-13b-v1.3  6.387500
vicuna-7b-v1.5   6.115625
alpaca-13b       4.531250
llama-13b        2.606250


In [25]:
!python /src/FastChat/fastchat/llm_judge/show_result.py

Mode: single
Input file: data/mt_bench/model_judgment/gpt-4_single.jsonl

########## First turn ##########
                                    score
model                       turn         
gpt-4                       1     8.95625
claude-v1                   1     8.15000
gpt-3.5-turbo               1     8.07500
claude-instant-v1           1     7.80000
vicuna-33b-v1.3             1     7.45625
wizardlm-30b                1     7.13125
wizardlm-13b                1     7.11875
oasst-sft-7-llama-30b       1     7.10625
Llama-2-13b-chat            1     7.06250
tulu-30b                    1     7.01875
Llama-2-70b-chat            1     6.98750
guanaco-33b                 1     6.88125
vicuna-13b-v1.3             1     6.81250
guanaco-65b                 1     6.78125
vicuna-7b-v1.5              1     6.71875
palm-2-chat-bison-001       1     6.71250
vicuna-7b-v1.3              1     6.69375
mpt-30b-chat                1     6.67500
nous-hermes-13b             1     6.43125
Llama-2-7b-

In [27]:
!python /src/FastChat/fastchat/llm_judge/show_result.py --mode pairwise-all

Mode: pairwise-all
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
                              win  loss  ...  loss_rate  win_rate_adjusted
model                                    ...                              
gpt-4                         111     7  ...   0.043750           0.825000
gpt-3.5-turbo                2854   771  ...   0.160759           0.717160
claude-v1                      75    27  ...   0.168750           0.650000
vicuna-33b-v1.3                70    42  ...   0.262500           0.587500
claude-instant-v1              64    40  ...   0.250000           0.575000
wizardlm-30b                   37    63  ...   0.393750           0.418750
guanaco-65b                    38    68  ...   0.425000           0.406250
guanaco-33b                    42    72  ...   0.450000           0.406250
vicuna-13b-v1.3                33    73  ...   0.456250           0.375000
mpt-30b-chat                   29    78  ...   0.487500           0.346875
vicuna-7b-v1.3         

In [29]:
!wget /src/FastChat/data/mt_bench/model_judgment/gpt-4_single.jsonl
!wget https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_judgment/gpt-4_pair.jsonl

/src/FastChat/data/mt_bench/model_judgment/gpt-4_single.jsonl: Scheme missing.
--2025-04-16 03:19:41--  https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_judgment/gpt-4_pair.jsonl
Resolving huggingface.co (huggingface.co)... 18.164.174.118, 18.164.174.17, 18.164.174.55, ...
Connecting to huggingface.co (huggingface.co)|18.164.174.118|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/12/2b/122bd8e9eccbb3acc98acf73e0ecef3c96f24dcdb5f6639074ed304eb19f9cd4/d662c0b7d1d297f0494fcb4cc09fe8f054fa22d75deb4754a483a921984bc585?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27gpt-4_pair.jsonl%3B+filename%3D%22gpt-4_pair.jsonl%22%3B&Expires=1744777181&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NDc3NzE4MX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy8xMi8yYi8xMjJiZDhlOWVjY2JiM2FjYzk4YWNmNzNlMGVjZWYzYzk2ZjI0ZGNkYjVmNjYzOTA3NGVkMzA0ZWIxOWY5Y2Q0L2Q2

In [30]:
!pip install -U plotly kaleido

Collecting plotly
  Downloading plotly-6.0.1-py3-none-any.whl.metadata (6.7 kB)
Collecting kaleido
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl.metadata (15 kB)
Downloading plotly-6.0.1-py3-none-any.whl (14.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.8/14.8 MB[0m [31m102.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kaleido, plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.24.1
    Uninstalling plotly-5.24.1:
      Successfully uninstalled plotly-5.24.1
Successfully installed kaleido-0.2.1 plotly-6.0.1


In [34]:
import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go


CATEGORIES = ["Writing", "Roleplay", "Reasoning", "Math", "Coding", "Extraction", "STEM", "Humanities"]


def get_model_df():
    cnt = 0
    q2result = []
    fin = open("/src/FastChat/data/mt_bench/model_judgment/gpt-4_single.jsonl", "r")
    for line in fin:
        obj = json.loads(line)
        obj["category"] = CATEGORIES[(obj["question_id"]-81)//10]
        q2result.append(obj)
    df = pd.DataFrame(q2result)
    return df

def toggle(res_str):
    if res_str == "win":
        return "loss"
    elif res_str == "loss":
        return "win"
    return "tie"

def get_model_df_pair():
    fin = open("gpt-4_pair.jsonl", "r")
    cnt = 0
    q2result = []
    for line in fin:
        obj = json.loads(line)

        result = {}
        result["qid"] = str(obj["question_id"])
        result["turn"] = str(obj["turn"])
        if obj["g1_winner"] == "model_1" and obj["g2_winner"] == "model_1":
            result["result"] = "win"
        elif obj["g1_winner"] == "model_2" and obj["g2_winner"] == "model_2":
            result["result"] = "loss"
        else:
            result["result"] = "tie"
        result["category"] = CATEGORIES[(obj["question_id"]-81)//10]
        result["model"] = obj["model_1"]
        q2result.append(result)

    df = pd.DataFrame(q2result)

    return df

df = get_model_df()
df_pair = get_model_df_pair()

In [35]:
df

Unnamed: 0,question_id,model,judge,user_prompt,judgment,score,turn,tstamp,category
0,81,alpaca-13b,"[gpt-4, single-v1]",[Instruction]\nPlease act as an impartial judg...,The assistant's response is relevant and accur...,7.0,1,1.687222e+09,Writing
1,81,alpaca-13b,"[gpt-4, single-v1-multi-turn]",<|The Start of Assistant A's Conversation with...,The assistant failed to follow the user's inst...,1.0,2,1.687222e+09,Writing
2,82,alpaca-13b,"[gpt-4, single-v1]",[Instruction]\nPlease act as an impartial judg...,"The assistant's response is concise, professio...",8.0,1,1.687224e+09,Writing
3,82,alpaca-13b,"[gpt-4, single-v1-multi-turn]",<|The Start of Assistant A's Conversation with...,The AI assistant's self-evaluation is accurate...,7.0,2,1.687223e+09,Writing
4,83,alpaca-13b,"[gpt-4, single-v1]",[Instruction]\nPlease act as an impartial judg...,The assistant's response is very relevant and ...,10.0,1,1.687222e+09,Writing
...,...,...,...,...,...,...,...,...,...
5595,90,vicuna-7b-v1.5,"[gpt-4, single-v1]",[Instruction]\nPlease act as an impartial judg...,The assistant's response is excellent. It has ...,10.0,1,1.744772e+09,Writing
5596,118,vicuna-7b-v1.5,"[gpt-4, single-math-v1]",[Instruction]\nPlease act as an impartial judg...,The assistant's answer is incorrect. The assis...,1.0,1,1.744772e+09,Math
5597,134,vicuna-7b-v1.5,"[gpt-4, single-v1-multi-turn]",<|The Start of Assistant A's Conversation with...,The assistant's response is incorrect. The ass...,2.0,2,1.744772e+09,Extraction
5598,158,vicuna-7b-v1.5,"[gpt-4, single-v1]",[Instruction]\nPlease act as an impartial judg...,"The assistant's response is highly relevant, a...",10.0,1,1.744772e+09,Humanities


In [36]:
df_pair

Unnamed: 0,qid,turn,result,category,model
0,81,1,loss,Writing,alpaca-13b
1,81,2,loss,Writing,alpaca-13b
2,82,1,loss,Writing,alpaca-13b
3,82,2,loss,Writing,alpaca-13b
4,83,1,loss,Writing,alpaca-13b
...,...,...,...,...,...
4795,158,2,tie,Humanities,wizardlm-30b
4796,159,1,loss,Humanities,wizardlm-30b
4797,159,2,win,Humanities,wizardlm-30b
4798,160,1,loss,Humanities,wizardlm-30b


In [37]:
all_models = df["model"].unique()
print(all_models)
scores_all = []
for model in all_models:
    for cat in CATEGORIES:
        # filter category/model, and score format error (<1% case)
        res = df[(df["category"]==cat) & (df["model"]==model) & (df["score"] >= 0)]
        score = res["score"].mean()

        # # pairwise result
        # res_pair = df_pair[(df_pair["category"]==cat) & (df_pair["model"]==model)]["result"].value_counts()
        # wincnt = res_pair["win"] if "win" in res_pair.index else 0
        # tiecnt = res_pair["tie"] if "tie" in res_pair.index else 0
        # winrate = wincnt/res_pair.sum()
        # winrate_adjusted = (wincnt + tiecnt)/res_pair.sum()
        # # print(winrate_adjusted)

        # scores_all.append({"model": model, "category": cat, "score": score, "winrate": winrate, "wtrate": winrate_adjusted})
        scores_all.append({"model": model, "category": cat, "score": score})

['alpaca-13b' 'baize-v2-13b' 'chatglm-6b' 'claude-instant-v1' 'claude-v1'
 'dolly-v2-12b' 'falcon-40b-instruct' 'fastchat-t5-3b' 'gpt-3.5-turbo'
 'gpt-4' 'gpt4all-13b-snoozy' 'guanaco-33b' 'guanaco-65b'
 'h2ogpt-oasst-open-llama-13b' 'koala-13b' 'llama-13b' 'mpt-30b-chat'
 'mpt-30b-instruct' 'mpt-7b-chat' 'nous-hermes-13b'
 'oasst-sft-4-pythia-12b' 'oasst-sft-7-llama-30b' 'palm-2-chat-bison-001'
 'rwkv-4-raven-14b' 'stablelm-tuned-alpha-7b' 'tulu-30b' 'vicuna-13b-v1.3'
 'vicuna-33b-v1.3' 'vicuna-7b-v1.3' 'wizardlm-13b' 'wizardlm-30b'
 'Llama-2-7b-chat' 'Llama-2-13b-chat' 'Llama-2-70b-chat' 'vicuna-7b-v1.5']


In [40]:
target_models = ["vicuna-7b-v1.5", "Llama-2-7b-chat", "Llama-2-13b-chat", "Llama-2-70b-chat", "gpt-3.5-turbo", "claude-v1", "gpt-4"]

scores_target = [scores_all[i] for i in range(len(scores_all)) if scores_all[i]["model"] in target_models]

# sort by target_models
scores_target = sorted(scores_target, key=lambda x: target_models.index(x["model"]), reverse=True)

df_score = pd.DataFrame(scores_target)
df_score = df_score[df_score["model"].isin(target_models)]

rename_map = {"llama-13b": "LLaMA-13B",
              "alpaca-13b": "Alpaca-13B",
              "vicuna-33b-v1.3": "Vicuna-33B",
              "vicuna-13b-v1.3": "Vicuna-13B",
              "gpt-3.5-turbo": "GPT-3.5-turbo",
              "claude-v1": "Claude-v1",
              "gpt-4": "GPT-4"}

for k, v in rename_map.items():
    df_score.replace(k, v, inplace=True)

fig = px.line_polar(df_score, r = 'score', theta = 'category', line_close = True, category_orders = {"category": CATEGORIES},
                    color = 'model', markers=True, color_discrete_sequence=px.colors.qualitative.Pastel)

fig.show()

# Conclusion and Future Direction

This study shows that GPT-4 can reliably judge chatbot responses, aligning with human preferences over 80% of the time. It introduces MT-Bench and Chatbot Arena as scalable evaluation tools for the LLM community.

But the paper also notes limitations:

Focuses only on helpfulness, not honesty or safety

Combines all aspects of helpfulness (accuracy, creativity) into one score

Future work includes:

Adding safety and honesty criteria

Building open-source GPT-like judges

Expanding MT-Bench to cover more task categories

# References:

[1]:  Author names: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas MazeikaDawn Song, Jacob Steinhardt.

Title of the paper: MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING

Conference Name and Year:  ICLR 2021


[2]:  Author names: Siva Reddy, Danqi Chen, Christopher D. Manning

Title of the paper: CoQA: A Conversational Question Answering Challenge

Conference Name and Year: NAACL 2019