<a href="https://colab.research.google.com/github/vgitclt/ECE57000/blob/main/JudgeLLM_Final_GitHubSafe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook Author: Student ECE57000

# **LLM-as-a-judge** is a common technique to evaluate LLM-powered products.
It grew in popularity for a reason: it’s a practical alternative to costly human evaluation when assessing open-ended text outputs. We can see it as an automated testing tool for LLMs effectiveness
Judging generated texts is tricky — whether it's a “simple” summary or a chatbot conversation. Metrics like accuracy don’t work well because there are many ways to be “right” without exactly matching the example answer. And things like style or tone are subjective and hard to pin down.
Humans can handle these nuances, but manually reviewing every response doesn’t scale. LLM-as-a-judge emerged as an alternative: you can use LLMs to evaluate the generated texts. Interestingly, the LLM is both the source of the problem and the solution!

In [None]:
!pip install llama-index
!pip install llama-index-llms-huggingface
!pip install llama-index-embeddings-huggingface
!pip install llama-index-embeddings-huggingface-api

Collecting llama-index
  Downloading llama_index-0.12.30-py3-none-any.whl.metadata (12 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.6-py3-none-any.whl.metadata (727 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.1 (from llama-index)
  Downloading llama_index_cli-0.4.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.30 (from llama-index)
  Downloading llama_index_core-0.12.30-py3-none-any.whl.metadata (2.6 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.6.11-py3-none-any.whl.metadata (3.6 kB)
Collecting llama-index-llms-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_llms_openai-0.3.33-py3-none-any.whl.metadata (3.3 kB)
Colle

## 1. Setup LLMs from hugging face that we want to do model API inference on, as we don't have local resources and setup
1.   LLama3
2.   Mistral
3.   Deepseek

**Note HF_TOKEN must be set in your Colab secrets and access granted. Ensure you have GPU paid tokens for A100 as the free T4 gpus has limitations and timeouts.
**



In [None]:
# Step 2: Log in to Hugging Face and setup LLMs
from huggingface_hub import login
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    token=hf_token,
)

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

tokenizer2 = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    token=hf_token,
)

stopping_ids2 = [
    tokenizer2.eos_token_id,
    tokenizer2.convert_tokens_to_ids("<|eot_id|>"),
]

tokenizer3 = AutoTokenizer.from_pretrained(
    "deepseek-ai/deepseek-llm-7b-chat",
    token=hf_token,
)

stopping_ids3 = [
    tokenizer3.eos_token_id,
    tokenizer3.convert_tokens_to_ids("<|eot_id|>"),
]

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

## 2. Call the huggingface LLM object to instatiate the LLMs with the keyword arguments and tokenizers

In [None]:
# generate_kwargs parameters are taken from https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

import torch
from llama_index.llms.huggingface import HuggingFaceLLM

# Optional quantization to 4bit
# import torch
# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )

#Instantiate LLama3 LLM
llm = HuggingFaceLLM(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.4,
        "top_p": 0.9,
    },
    tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids,
)

#Instantiate Mistral LLM
llm2 = HuggingFaceLLM(
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.4,
        "top_p": 0.9,
    },
    tokenizer_name="mistralai/Mistral-7B-Instruct-v0.3",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids2,
)

#Instantiate Deepseek LLM
llm3 = HuggingFaceLLM(
    model_name="deepseek-ai/deepseek-llm-7b-chat",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.4,
        "top_p": 0.9,
    },
    tokenizer_name="deepseek-ai/deepseek-llm-7b-chat",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids2,
)




## 3. Import the REACT agent Llamaindex framework.

LlamaIndex is an open-source data orchestration framework that simplifies building large language model (LLM) applications by providing tools for data ingestion, indexing, and retrieval, enabling context-rich AI applications through a Retrieval-Augmented Generation (RAG) pipeline. LlamaIndex is designed to make it easier to connect diverse data sources to LLMs, allowing developers to create applications that can access and leverage external knowledge. **LlamaIndex ReAct ** agent is an agent-based chat mode that uses a reasoning and acting loop to answer questions, leveraging tools and external knowledge sources to achieve more precise answers

In [None]:
import json
from typing import Sequence, List
from llama_index.core.llms import ChatMessage
from llama_index.core.tools import BaseTool, FunctionTool
from llama_index.core.agent import ReActAgent

import nest_asyncio

nest_asyncio.apply()

## 4. Define some agent tools for Math when questions have math in them for the LLM to answer

In [None]:
#Agent tools for Math questions evaluations when judge LLM needs to evauate the Math questions.

def multiply(a: int, b: int) -> int:
    """Multiple two integers and returns the result integer"""
    return a * b


def add(a: int, b: int) -> int:
    """Add two integers and returns the result integer"""
    return a + b


def subtract(a: int, b: int) -> int:
    """Subtract two integers and returns the result integer"""
    return a - b


def divide(a: int, b: int) -> int:
    """Divides two integers and returns the result integer"""
    return a / b

multiply_tool = FunctionTool.from_defaults(fn=multiply)
add_tool = FunctionTool.from_defaults(fn=add)
subtract_tool = FunctionTool.from_defaults(fn=subtract)
divide_tool = FunctionTool.from_defaults(fn=divide)

In [None]:
#React agent using math tools sample
agent = ReActAgent.from_tools(
    [multiply_tool, add_tool, subtract_tool, divide_tool],
    llm=llm,
    verbose=True,
)

In [None]:
#response = agent.chat("What is (121 + 2) * 5?")
#print(str(response))

In [None]:
#response = agent.chat("What is (100/5)*2-5+10 ?")
#print(str(response))

## 5. Download the MT-Bench data from huggingface where human judge have provided expert conclusions to LLMs' answers for questions in multiple categories.

In [None]:
# import packages for hugging face datasets
!pip install datasets

import argparse
import json
import os
import pandas as pd
import numpy as np
from datasets import load_dataset

#HuggingFace data set downloaded into JSON format
dataset = load_dataset("lmsys/mt_bench_human_judgments")
dataset["human"].to_json("human_judgments.json")
dataset["gpt4_pair"].to_json("gpt4_pair_judgments.json")

df = pd.DataFrame(dataset["human"])
print(dataset["human"])
df.head(5)
#question = df['conversation_a']
#print(question)

## 6. Read data from MT-Bench and evaluate using the REACT agent tools if the LLM judge agreed with the human expert. Then the judge becomes a student LLM using Mistral and DeepSeek to and helps answer the question posed and we compare how well the student LLMs did to the winning question using text similarities. This is the main code but due to GPU unpredictability we cannot run for all 3355 rows of data, we will show case 1 row for the concept.

***A100 paid GPU in Google Colab runs a single LLM inference in 30 seconds sometimes to 15 mins other times , this is due to limitations and resource demand caps.***

In [None]:
import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from llama_index.core import PromptTemplate

react_system_header_str = """\

You are designed to help with a task, for answering questions \
    to providing summaries into other types of questions acting as an LLM judge.

keep the response to 250 words.

## Tools
You have access to a search tool. You are responsible for using
that tool in any sequence you deem appropriate to complete the task at hand.
This may require breaking the task into subtasks and using different searches
to complete each subtask.

You have access to the following tools:
{tool}

## Output Format
To answer the question, please use the following format.

```
Thought: I need to use a tool to help me answer the question.
Action: tool name (one of {tool_names}) if using a tool.
Action Input: the input to the tool, in a JSON format representing the kwargs (e.g. {{"input": "hello world", "num_beams": 5}})
```

Please ALWAYS start with a Thought.

Please use a valid JSON format for the Action Input. Do NOT do this {{'input': 'hello world', 'num_beams': 5}}.

If this format is used, the user will respond in the following format:

```
Observation: tool response
```

You should keep repeating the above format until you have enough information
to answer the question without using any more tools. At that point, you MUST respond
in the one of the following two formats:

```
Thought: I can answer without using any more tools.
Answer: [your answer here]
```

```
Thought: I cannot answer the question with the provided tools.
Answer: Sorry, I cannot answer your query.
```

## Additional Rules
- The answer MUST contain a sequence of bullet points that explain how you arrived at the answer. This can include aspects of the previous conversation history.
- You MUST obey the function signature of each tool. Do NOT pass in no arguments if the function expects arguments.

## Current Conversation
Below is the current conversation consisting of interleaving human and assistant messages.

"""
react_system_prompt = PromptTemplate(react_system_header_str)
#print(agent.get_prompts())


def compare(a: str, b: str) -> str:
    """compares two strings and returns the result string for the LLM"""
    return "which is better answer a or b? answer in 1 letter"

def answer(a: str) -> str:
    """compares two strings and returns the result string for the LLM"""
    return "You are a LLM that is asked to answer a topic that will be judged, please give you best answer in 500 words or less"

CATEGORIES = ["Writing", "Roleplay", "Reasoning", "Math", "Coding", "Extraction", "STEM", "Humanities"]

def get_model_df():
    cnt = 0
    q2result = []
    fin = open("human_judgments.json", "r")
    for line in fin:
        obj = json.loads(line)
        obj["category"] = CATEGORIES[(obj["question_id"]-81)//10]
        q2result.append(obj)
    df = pd.DataFrame(q2result)
    return df

df = get_model_df()

# loop through 3355 row for LLM inferencing is VERY challeging.
for index, row in df.iterrows():
    print(f"Index: {index}, Row: {row.to_dict()}")
    if index == 10:
        break

num_rows = df.shape[0]

print(f"\nNumber of rows in original human evaluated data set: {num_rows}")

# so we only use 1 row to prove the theory instead of looping though 3354 rows.
question = df['conversation_a']
a_qa = question.iloc[0]
#print(a_qa.count)
for idx, a_qa in enumerate(a_qa):
    #print(f"A Content {idx + 1}: {a_qa['content']}")
    if idx == 0:
        #print("Model A:" + a_qa['content'])
        ResponseA_Q = a_qa['content']
    if idx == 1:
        #print("Model A:" + a_qa['content'])
        ResponseA = "Answer A:" + a_qa['content']

question = df['conversation_b']
b_qa = question.iloc[0]
#print(b_qa.count)
for idx, b_qa in enumerate(b_qa):
    #print(f"B Content {idx + 1}: {b_qa['content']}")
    if idx == 1:
        #print("Model B:" +b_qa['content'])
        ResponseB = "Answer B:" +b_qa['content']

chatIn = ResponseA + "\n" + ResponseB + "\n"
chatIn2 = ResponseA_Q + "\n" + "summarize in 700 words"
#print(chatIn)
compare_tool = FunctionTool.from_defaults(fn=compare)

# LLM will judge the content of 2 answers and predict the best answer that we will compare to human judgements.
#chat_mode="react"
agent = ReActAgent.from_tools([compare_tool],
                 llm=llm,
                 verbose=True,max_iterations=1)
response = agent.chat(chatIn)
agent_res = str(response)
print("\nStart Llama3 wih ReACT agent response-----------------------------------")
print(agent_res.lower())
print("---------------------------------------------------------------------End\n")

# now create 2 new LLMs to help answer a topic, LLMs becomes students now.

res1 = llm2.complete(chatIn2)
print("Start Mistral LLM response-----------------------------------")
print(res1)
print("----------------------------------------------------------end\n")
res2 = llm3.complete(chatIn2)
print("Start DeepSeek LLM response-----------------------------------")
print(res2)
print("----------------------------------------------------------end\n")
winner = df['winner'].iloc[0]
print(winner)
qid = df['question_id'].iloc[0]
print(qid)
cat = df['category'].iloc[0]
print(cat)


import pandas as pd

# Initialize an empty DataFrame
df_judge = pd.DataFrame()
df_student = pd.DataFrame()

#response = "B"  # Replace with actual response logic, for testing only use


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# # Define the two texts to compare, for testing only use
# text1 = "The quick brown fox jumps over the lazy dog."
# text2 = "A fast brown fox leaps over a sleepy dog."

# Define the two texts to compare, one from mistral, the other deepseek
mistral_text1 = res1.text
if winner == "model_a":
    text2 = a_qa['content']
else:
    text2 = b_qa['content']

print(text2)
# Convert the texts into TF-IDF feature vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([mistral_text1, text2])

# Compute the cosine similarity between the two texts
cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
similarity_score_model_a = f'{cosine_sim[0][0]:.2f}'
print(similarity_score_model_a)
# Output the similarity score
print(f"Cosine Similarity: {cosine_sim[0][0]:.4f}")

deepseek_text2 = res2.text
if winner == "model_a":
    text2 = a_qa['content']
else:
    text2 = b_qa['content']

print(text2)
# Convert the texts into TF-IDF feature vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([deepseek_text2, text2])

# Compute the cosine similarity between the two texts
cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
similarity_score_model_b = f'{cosine_sim[0][0]:.2f}'
print(similarity_score_model_b)
# Output the similarity score
print(f"Cosine Similarity: {cosine_sim[0][0]:.4f}")

if similarity_score_model_a > similarity_score_model_b:
    student_winner = "model_a"
else:
    student_winner = "model_b"

# Simulate data generation in a loop for judge data capture
for i in range(1):
    new_row = {"LLMWinner": agent_res.lower(), "HumanWinner": winner, "QuestionID": qid, "Category": cat}  # Define a new row
    df_judge = pd.concat([df_judge, pd.DataFrame([new_row])], ignore_index=True)  # Append the new row

# Simulate data generation in a loop for student data capture
for i in range(1):
    new_row = {"Model_A": "Mistral", "Model_B": "DeepSeek", "Winner": student_winner, "SimilarityToWinner_model_a": similarity_score_model_a, "SimilarityToWinner_model_b": similarity_score_model_b, "Category": cat}  # Define a new row
    df_student = pd.concat([df_student, pd.DataFrame([new_row])], ignore_index=True)  # Append the new row

# Print the resulting DataFrame
print(df_judge)
print(df_student)



Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Index: 0, Row: {'question_id': 81, 'model_a': 'alpaca-13b', 'model_b': 'gpt-3.5-turbo', 'winner': 'model_b', 'judge': 'author_2', 'conversation_a': [{'content': 'Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.', 'role': 'user'}, {'content': 'I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawaiian people and try my hand at traditional crafts and activities. I also had a chance to explore some of the natural wonders of the island, including the breathtaking Hanauma Bay and the majestic Waimea Canyon. Whether you’re looking for a relaxing beach vacation or an adventure filled with culture 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[1;3;38;5;200mThought: The current language of the user is English. I need to compare the two answers to identify the similarities and differences between them.
Action: compare
Action Input: {'a': 'A:I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawaiian people and try my hand at traditional crafts and activities. I also had a chance to explore some of the natural wonders of the island, including the breathtaking Hanauma Bay and the majestic Waimea Canyon. Whether you’re looking for a relaxing beach vacation or an adventure filled with culture and nature, Hawaii is the perfect destination.', 'b': 'Aloha! I recently had the pleasure of embarking on a trip to the b

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1;3;38;5;200mThought: I will compare the two answers based on various factors such as coherence, grammar, and content.
Action: compare
Action Input: {'a': 'A:I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawaiian people and try my hand at traditional crafts and activities. I also had a chance to explore some of the natural wonders of the island, including the breathtaking Hanauma Bay and the majestic Waimea Canyon. Whether you’re looking for a relaxing beach vacation or an adventure filled with culture and nature, Hawaii is the perfect destination.', 'b': 'Aloha! I recently had the pleasure of embarking on a trip to the beautiful island of Hawaii, and let me tel

Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.


Start Mistral LLM response-----------------------------------
or less

Title: Aloha from Paradise: Unforgettable Hawaiian Adventures

Aloha! Just returned from a breathtaking journey to the enchanting islands of Hawaii, where the warm Aloha spirit, stunning landscapes, and rich culture left an indelible mark on my heart. This tropical paradise offers a unique blend of adventure, relaxation, and cultural immersion that is sure to captivate any traveler.

My adventure began on the island of Oahu, home to the vibrant city of Honolulu and the iconic Waikiki Beach. I spent my days soaking up the sun on the pristine sands, surfing the legendary waves, and exploring the historic Pearl Harbor. The USS Arizona Memorial, a poignant tribute to the lives lost during the attack on December 7, 1941, was a humbling and moving experience that provided a glimpse into Hawaii's significant role in American history.

Next, I ventured to the lush, verdant island of Kauai, often referred to as the "Garden I

## 7. Original implementation code to Analyse agreements between gpt4 and humans and human to human.

In [None]:
def get_judge_name(judge):
    if isinstance(judge, list) and judge[0] == "gpt-4" and judge[1].startswith("pair"):
        return "gpt4-pair"
    if judge.startswith("expert"):
        return "human"
    if judge.startswith("author"):
        return "author"
    return judge


def revert(vote):
    if vote == "model_a":
        return "model_b"
    elif vote == "model_b":
        return "model_a"
    return vote


def get_mt_bench_votes_data(raw_votes):
    data = [{}, {}]

    for judge_votes in raw_votes:
        for vote in judge_votes:
            turn = vote["turn"] - 1
            if vote["model_a"] < vote["model_b"]:
                key = (vote["question_id"], vote["model_a"], vote["model_b"])
                winner = vote["winner"]
            else:
                key = (vote["question_id"], vote["model_b"], vote["model_a"])
                winner = revert(vote["winner"])
            judge = get_judge_name(vote["judge"])
            if key not in data[turn]:
                data[turn][key] = {}
            if judge not in data[turn][key]:
                data[turn][key][judge] = []
            data[turn][key][judge].append(winner)

    return data


def convertvote(vote):
    if "tie" in vote:
        return "tie"
    return vote


def equalvote(vote1, vote2):
    if "tie" in vote1 and "tie" in vote2:
        return True
    return vote1 == vote2


# data: Dict[qid -> List[vote]]
def get_mt_bench_agreement(data, judge1, judge2, ban):
    if judge1.startswith("gpt4") and judge2 == "human":
        stats = [0, 0]
        for votes in data.values():
            if judge1 not in votes or judge2 not in votes: continue
            assert len(votes[judge1]) == 1
            if convertvote(votes[judge1][0]) in ban: continue
            for v in votes[judge2]:
                if convertvote(v) in ban: continue
                stats[1] += 1
                stats[0] += equalvote(votes[judge1][0], v)
        return stats[0], stats[1]
    elif judge1 == "human" and judge2 == "human":
        stats = [0, 0]
        for votes in data.values():
            if "human" not in votes: continue
            for i in range(len(votes["human"]) - 1):
                for j in range(i + 1, len(votes["human"])):
                    if convertvote(votes["human"][i]) in ban or convertvote(votes["human"][j]) in ban:
                        continue
                    stats[1] += 1
                    stats[0] += equalvote(votes["human"][i], votes["human"][j])
        return stats[0], stats[1]
    else:
        raise Exception("Unsupported judges.")


def run_mt_bench_agreement(judges, votefiles):
    # votes[i]: List of votes
    votes = []
    for filename in votefiles:
        data = []
        for line in open(filename, "r"):
            data.append(json.loads(line))
        votes.append(data)

    data = get_mt_bench_votes_data(votes)

    agree, total = get_mt_bench_agreement(data[0], judges[0], judges[1], ban=[])
    print(f"turn 1 with tie. #total: {total}, #agree: {agree}, ratio: {agree/total:.2f}")
    agree, total = get_mt_bench_agreement(data[0], judges[0], judges[1], ban=["tie"])
    print(f"turn 1 without tie. #total: {total}, #agree: {agree}, ratio: {agree/total:.2f}")
    agree, total = get_mt_bench_agreement(data[1], judges[0], judges[1], ban=[])
    print(f"turn 2 with tie. #total: {total}, #agree: {agree}, ratio: {agree/total:.2f}")
    agree, total = get_mt_bench_agreement(data[1], judges[0], judges[1], ban=["tie"])
    print(f"turn 2 without tie. #total: {total}, #agree: {agree}, ratio: {agree/total:.2f}")

## 8. Checking for agreements, win/lose or tie GPT4 and Human

In [None]:
# Compute agrement between GPT-4 and humans
run_mt_bench_agreement(["gpt4_pair", "human"], ["gpt4_pair_judgments.json", "human_judgments.json"])

turn 1 with tie. #total: 1343, #agree: 886, ratio: 0.66
turn 1 without tie. #total: 859, #agree: 727, ratio: 0.85
turn 2 with tie. #total: 1325, #agree: 871, ratio: 0.66
turn 2 without tie. #total: 864, #agree: 731, ratio: 0.85


## 9. Checking for agreements, win/lose or tie Human to Human

In [None]:
# Compute agrement between humans and humans
run_mt_bench_agreement(["human", "human"], ["human_judgments.json"])

turn 1 with tie. #total: 721, #agree: 454, ratio: 0.63
turn 1 without tie. #total: 479, #agree: 388, ratio: 0.81
turn 2 with tie. #total: 707, #agree: 471, ratio: 0.67
turn 2 without tie. #total: 474, #agree: 388, ratio: 0.82


## 10. Plot for Scoring The judgement data but due to GPU constraints, we were able to showcase 1 row, intention to showcase cell 11 from original implementation.

In [None]:
import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

import plotly.express as px
#df = px.data.wind()
print(df_judge)
fig = px.scatter_polar(df_judge, r="QuestionID", theta="Category")
fig.show()


  LLMWinner HumanWinner  QuestionID Category
0         b     model_b          81  Writing


In [None]:
!wget https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_judgment/gpt-4_single.jsonl
!wget https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_judgment/gpt-4_pair.jsonl

--2025-04-11 21:53:31--  https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_judgment/gpt-4_single.jsonl
Resolving huggingface.co (huggingface.co)... 3.166.152.65, 3.166.152.44, 3.166.152.105, ...
Connecting to huggingface.co (huggingface.co)|3.166.152.65|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/12/2b/122bd8e9eccbb3acc98acf73e0ecef3c96f24dcdb5f6639074ed304eb19f9cd4/76c55033c6b2b1cc3f62513458f84748a23352495fd42b1062a7401de5ff9bd9?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27gpt-4_single.jsonl%3B+filename%3D%22gpt-4_single.jsonl%22%3B&Expires=1744412011&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NDQxMjAxMX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy8xMi8yYi8xMjJiZDhlOWVjY2JiM2FjYzk4YWNmNzNlMGVjZWYzYzk2ZjI0ZGNkYjVmNjYzOTA3NGVkMzA0ZWIxOWY5Y2Q0Lzc2YzU1MDMzYzZiMmIxY2MzZjYyNTEzNDU4Zjg0NzQ4YTIzMzUyNDk1ZmQ0MmIxMDYyYTc0MDFkZTVmZj

## 11. Original Implementation code below to show a plot of how well the chosen models performed in different categories by GPT-4 as the judge and GPT-4 evaluates itself too.

In [None]:
import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go


CATEGORIES = ["Writing", "Roleplay", "Reasoning", "Math", "Coding", "Extraction", "STEM", "Humanities"]


def get_model_df():
    cnt = 0
    q2result = []
    fin = open("gpt-4_single.jsonl", "r")
    for line in fin:
        obj = json.loads(line)
        obj["category"] = CATEGORIES[(obj["question_id"]-81)//10]
        q2result.append(obj)
        #print(obj["score"])
    df = pd.DataFrame(q2result)
    return df

def toggle(res_str):
    if res_str == "win":
        return "loss"
    elif res_str == "loss":
        return "win"
    return "tie"

def get_model_df_pair():
    fin = open("gpt-4_pair.jsonl", "r")
    cnt = 0
    q2result = []
    for line in fin:
        obj = json.loads(line)

        result = {}
        result["qid"] = str(obj["question_id"])
        result["turn"] = str(obj["turn"])
        if obj["g1_winner"] == "model_1" and obj["g2_winner"] == "model_1":
            result["result"] = "win"
        elif obj["g1_winner"] == "model_2" and obj["g2_winner"] == "model_2":
            result["result"] = "loss"
        else:
            result["result"] = "tie"
        result["category"] = CATEGORIES[(obj["question_id"]-81)//10]
        result["model"] = obj["model_1"]
        q2result.append(result)

    df = pd.DataFrame(q2result)

    return df

df = get_model_df()
df_pair = get_model_df_pair()
all_models = df["model"].unique()
print(all_models)
scores_all = []
for model in all_models:
    for cat in CATEGORIES:
        # filter category/model, and score format error (<1% case)
        res = df[(df["category"]==cat) & (df["model"]==model) & (df["score"] >= 0)]
        score = res["score"].mean()
        scores_all.append({"model": model, "category": cat, "score": score})

target_models = ["Llama-2-7b-chat", "Llama-2-13b-chat", "Llama-2-70b-chat", "gpt-3.5-turbo", "claude-v1", "gpt-4"]

scores_target = [scores_all[i] for i in range(len(scores_all)) if scores_all[i]["model"] in target_models]

# sort by target_models
scores_target = sorted(scores_target, key=lambda x: target_models.index(x["model"]), reverse=True)

df_score = pd.DataFrame(scores_target)
df_score = df_score[df_score["model"].isin(target_models)]

rename_map = {"llama-13b": "LLaMA-13B",
              "alpaca-13b": "Alpaca-13B",
              "vicuna-33b-v1.3": "Vicuna-33B",
              "vicuna-13b-v1.3": "Vicuna-13B",
              "gpt-3.5-turbo": "GPT-3.5-turbo",
              "claude-v1": "Claude-v1",
              "gpt-4": "GPT-4"}

for k, v in rename_map.items():
    df_score.replace(k, v, inplace=True)

fig = px.line_polar(df_score, r = 'score', theta = 'category', line_close = True, category_orders = {"category": CATEGORIES},
                    color = 'model', markers=True, color_discrete_sequence=px.colors.qualitative.Pastel)

fig.show()

['alpaca-13b' 'baize-v2-13b' 'chatglm-6b' 'claude-instant-v1' 'claude-v1'
 'dolly-v2-12b' 'falcon-40b-instruct' 'fastchat-t5-3b' 'gpt-3.5-turbo'
 'gpt-4' 'gpt4all-13b-snoozy' 'guanaco-33b' 'guanaco-65b'
 'h2ogpt-oasst-open-llama-13b' 'koala-13b' 'llama-13b' 'mpt-30b-chat'
 'mpt-30b-instruct' 'mpt-7b-chat' 'nous-hermes-13b'
 'oasst-sft-4-pythia-12b' 'oasst-sft-7-llama-30b' 'palm-2-chat-bison-001'
 'rwkv-4-raven-14b' 'stablelm-tuned-alpha-7b' 'tulu-30b' 'vicuna-13b-v1.3'
 'vicuna-33b-v1.3' 'vicuna-7b-v1.3' 'wizardlm-13b' 'wizardlm-30b'
 'Llama-2-7b-chat' 'Llama-2-13b-chat' 'Llama-2-70b-chat']
