<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Evaluation with MT Bench.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Evaluation with MT Bench

This notebook discusses how you can run end-to-end evaluations for your trained model with [MT Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge). Evaluating with MT Bench is a 2-step process. In the first step, we run inference for your model to generate answers for 80 multi-turn MT-bench questions. In the second step, we generate judgments (GPT-4 is the default judge) comparing your model's answers vs. reference answers. Each answer is scored [1, 10], considering factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail.

## Machine requirements

❗**NOTICE:** It is required to run this notebook on a machine with CUDA support, because MT Bench only runs on CUDA. If running on Google Colab, you can use the free T4 GPU runtime (Colab Menu: `Runtime` -> `Change runtime type`).

In [1]:
import torch

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print(f"GPU type: {torch.cuda.get_device_name()}")
else:
    print("Error! MT Bench will NOT run in a machine without CUDA.")

CUDA version: 12.1
Number of GPUs: 1
GPU type: NVIDIA A100-SXM4-40GB


If your local machine cannot run this notebook, you can instead run this notebook on a cloud platform. The following snippet demonstrates how to open a VSCode instance backed by a GCP node with an A100 GPU, from which the notebook can be run. For installation details, please refer to the [gcloud CLI](https://cloud.google.com/sdk/docs/install) page.

```bash
! gcloud auth application-default login  # Authenticate with GCP.

# The required GPU count depends on your model.
# Here we use 1 A100 40GB GPU.
! make gcpcode ARGS="--resources.accelerators A100:1"
```

## Prerequisites and Configuration

First, start by cloning the [FastChat](https://github.com/lm-sys/FastChat) repo, which includes the MT Bench framework.


In [2]:
FAST_CHAT_REPO = "/tmp/oumi/FastChat"  # Folder to clone to.
! git clone https://github.com/lm-sys/FastChat.git $FAST_CHAT_REPO

Cloning into '/tmp/oumi/FastChat'...
remote: Enumerating objects: 8425, done.[K
remote: Counting objects: 100% (226/226), done.[K
remote: Compressing objects: 100% (115/115), done.[K
remote: Total 8425 (delta 164), reused 154 (delta 108), pack-reused 8199 (from 1)[K
Receiving objects: 100% (8425/8425), 34.48 MiB | 37.25 MiB/s, done.
Resolving deltas: 100% (6406/6406), done.


Then, navigate to that folder and pip install the packages `model_worker` and `llm_judge`.

In [3]:
import os

os.chdir(FAST_CHAT_REPO)
! pip install -q -e ".[model_worker,llm_judge]" jsonlines

When comparing your model's responses vs. the reference responses to calculate the score, a judge is needed. By default, the judge is set to GPT4. To access GPT-4 models, an Open API key is required. Details on creating an OpenAI account and generating a key can be found at [OpenAI's quickstart webpage](https://platform.openai.com/docs/quickstart).

In [4]:
os.environ["OPENAI_API_KEY"] = ""  # NOTE: Set your OpenAI API key here

<b>⚠️ Cost considerations</b>: To get an accurate estimate of the cost to judge 160 examples (80 x 2-turn conversations) with GPT4, please visit [OpenAI's pricing](https://openai.com/api/pricing/) page. The cost for judging Llama 3.2 1B IT responses is <b>$5.10</b> as of December 2024. Since this notebook is sample code, we will only annotate and judge 3 x 2-turn conversations, reducing the GPT-4 judgment cost to only <b>0.2¢</b>.

In [5]:
NUM_EXAMPLES = 3

Finally, point to your model (`MODEL_PATH`). MT Bench supports HuggingFace repo IDs and paths to local folders that contain your model. 
Also, please provide a (human friendly) custom `MODEL_DISPLAY_NAME` for your model; this will be used to uniquely reference your model when generating judgments or inspecting scores.

In [6]:
MODEL_PATH = "HuggingFaceTB/SmolLM2-135M-Instruct"
MODEL_DISPLAY_NAME = "my_model"

## Step 1: Run inference

Navigate to the LLM judge folder and run inference, passing in your model path and model id as shown below. Since this is sample code, note that we are running inference only for the first `NUM_EXAMPLES` examples.

Additional arguments to consider (more details [here](https://github.com/lm-sys/FastChat/blob/1cd4b74fa00d1a60852ea9c88e4cc4fc070e4512/fastchat/llm_judge/gen_model_answer.py#L209C1-L271C6)):
- You can change the location of the output file by setting `--answer-file=<file path>`.
- You can restrict the max number of generated tokens by your model by setting `--max-new-token=<number of tokens>`.
- You can specify the model revision to be loaded by `--revision=<model revision>`.
- You can set the number of GPUs to be used when running inference with your model with `--num-gpus-per-model=<num GPUs>` (if not set, the default is 1).
- You can restrict the GPU memory used when running inference by `--max-gpu-memory=<max memory>`.
- You can overwrite the default `dtype` with `--dtype=<dtype>` (if not set, the default is to use float16 on GPU, float32 on CPU).
- You can run inference on a subset of the examples by setting the index of the first question with `--question-begin=<question index>` and the index of the last question with `--question-end=<question index>`.

In [7]:
LLM_JUDGE_FOLDER = f"{FAST_CHAT_REPO}/fastchat/llm_judge"
os.chdir(LLM_JUDGE_FOLDER)

! python gen_model_answer.py \
    --model-path $MODEL_PATH \
    --model-id $MODEL_DISPLAY_NAME \
    --question-end $NUM_EXAMPLES

Output to data/mt_bench/model_answer/my_model.jsonl
  0%|                                                     | 0/3 [00:00<?, ?it/s]
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
 33%|███████████████                              | 1/3 [00:05<00:10,  5.26s/it]
 67%|██████████████████████████████               | 2/3 [00:17<00:09,  9.49s/it]
100%|█████████████████████████████████████████████| 3/3 [00:44<00:00, 14.84s/it]


Inspect the inference results. 
The default output filename is `<MODEL_DISPLAY_NAME>.jsonl`. 
Note that the question IDs of the 80 multi-turn questions are 81-160. 

In [8]:
import jsonlines

INFERENCE_RESULTS_JSONL = (
    f"{LLM_JUDGE_FOLDER}/data/mt_bench/model_answer/{MODEL_DISPLAY_NAME}.jsonl"
)

with jsonlines.open(INFERENCE_RESULTS_JSONL) as jsonl_examples:
    for example in jsonl_examples:
        for turn in range(2):
            print(f"-----[ question={example['question_id']} turn={turn} ]-----")
            print(example["choices"][0]["turns"][turn], "\n\n")

-----[ question=81 turn=0 ]-----
I'd be happy to help you write a travel blog post about your recent trip to Hawaii. Here's a draft:
**Title:** A Tropical Paradise Awaits: Exploring the Best of Hawaii
**Introduction:**
As I stepped off the plane in Honolulu, I couldn't help but feel a sense of excitement and relaxation wash over me. The warm tropical air, the sound of ukulele music drifting through the airport, and the stunning beaches beckoning me to come and explore – I knew that I was in for an unforgettable adventure. My recent trip to Hawaii was a journey of discovery, where I immersed myself in the rich cultural heritage, breathtaking natural beauty, and warm hospitality of this enchanting island.

**Must-see Attractions:**
One of the first things that caught my attention was the historic Pearl Harbor. A somber reminder of the island's complex history, the USS Arizona Memorial is a poignant tribute to the lives lost during the attack. From there, I headed to the iconic Diamond He

## Step 2: Judge the model answers

In this notebook, we demonstrate the recommended "single-answer" grading mode, where the judge assigns (for each turn) a score on a scale of 10. There are two additional grading options, where the judged model is compared pairwise to a single baseline model (`pairwise-baseline`) or multiple baseline models (`pairwise-all`) and win rates are generated. For more details, please read FastChat's [other grading options](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge#other-grading-options) section.

The command to invoke the GPT-4 judge to score each answer (single-answer grading) is shown below. Note that the `echo -ne '\n'` prefix is required because we are invoking the shell via a notebook and that script (`gen_judgment.py`) requires human verification by pressing "Enter". Piping the `\n` character into the script emulates pressing "Enter" right after executing `gen_judgment.py`. Also, note that we are only judging the first `NUM_EXAMPLES` examples, since this notebook is sample code. 

Additional arguments to consider (more details [here](https://github.com/lm-sys/FastChat/blob/1cd4b74fa00d1a60852ea9c88e4cc4fc070e4512/fastchat/llm_judge/gen_judgment.py#L170)):
- You can change the location of the judgement file by setting `--judge-file=<file path>`.
- You can enable multiple concurrent API calls to the judge by setting `--parallel=<number of concurrent API calls>` (default is 1).
- You can use a different judge model by setting `--judge-model=<judge model name>` (default is `gpt-4`). This option is not documented and might not be very informative if you are interested in generating comparative results, since the reference model is the default model. 
- You can update the model that generated the reference answers by `--baseline-model=<judge model name>` (default is `gpt-3.5-turbo`). This option is also not documented, since the reference answers are used for comparative analysis. 
- You can test judgement for only a subset of the answers by setting `--first-n=<number of answers to judge>`. This flag is mainly used for debugging purposes; you can use it to reduce your judgment costs when testing the MT Bench framework. 

In [9]:
os.chdir(LLM_JUDGE_FOLDER)

! echo -ne '\n' \
    | python gen_judgment.py \
        --model-list $MODEL_DISPLAY_NAME \
        --first-n $NUM_EXAMPLES

Stats:
{
    "bench_name": "mt_bench",
    "mode": "single",
    "judge": "gpt-4",
    "baseline": null,
    "model_list": [
        "my_model"
    ],
    "total_num_questions": 3,
    "total_num_matches": 6,
    "output_path": "data/mt_bench/model_judgment/gpt-4_single.jsonl"
}
  0%|                                                     | 0/6 [00:00<?, ?it/s]question: 81, turn: 1, model: my_model, score: 10, judge: ('gpt-4', 'single-v1')
 17%|███████▌                                     | 1/6 [00:06<00:31,  6.25s/it]question: 82, turn: 1, model: my_model, score: 10, judge: ('gpt-4', 'single-v1')
 33%|███████████████                              | 2/6 [00:12<00:24,  6.12s/it]question: 83, turn: 1, model: my_model, score: 9, judge: ('gpt-4', 'single-v1')
 50%|██████████████████████▌                      | 3/6 [00:15<00:14,  4.83s/it]question: 81, turn: 2, model: my_model, score: 1, judge: ('gpt-4', 'single-v1-multi-turn')
 67%|██████████████████████████████               | 4/6 [00:18<00:0

Inspect the judgments and the scores for each model answer. 
The default output filename is `gpt-4_single.jsonl`.

In [10]:
JUDGE_RESULTS_JSONL = (
    f"{LLM_JUDGE_FOLDER}/data/mt_bench/model_judgment/gpt-4_single.jsonl"
)

with jsonlines.open(JUDGE_RESULTS_JSONL) as jsonl_examples:
    for example in jsonl_examples:
        print(
            f"question={example['question_id']} "
            f"turn={example['turn']} "
            f"score={example['score']}\n"
            f"judgement: {example['judgment'][:300]}..."
        )

question=81 turn=1 score=10
judgement: The assistant's response is highly relevant, accurate, and detailed. It provides a comprehensive and engaging draft for a travel blog post about a recent trip to Hawaii. The assistant highlights cultural experiences, must-see attractions, and even provides some useful tips for future travelers. The ...
question=82 turn=1 score=10
judgement: The assistant's response is highly relevant, accurate, and detailed. It provides a professional and concise draft of an email that addresses the user's request. The assistant has included all the specific points the user wanted to ask about, such as data analysis, presentation style, and clarity of ...
question=83 turn=1 score=9
judgement: The assistant's response is highly relevant, accurate, and detailed. It provides a clear and comprehensive outline for a blog post comparing two smartphone models. The assistant covers all the key points requested by the user, including features, performance, and user experie

Retrieve your aggregate judgment score (with per-turn breakdown), as shown below.

In [11]:
! python show_result.py --model-list $MODEL_DISPLAY_NAME

Mode: single
Input file: data/mt_bench/model_judgment/gpt-4_single.jsonl

########## First turn ##########
                  score
model    turn          
my_model 1     9.666667

########## Second turn ##########
                  score
model    turn          
my_model 2     5.333333

########## Average ##########
          score
model          
my_model    7.5


Alternatively, you can programmatically calculate the judgement score as follows. 

In [12]:
import pandas as pd

df_judge_results = pd.read_json(JUDGE_RESULTS_JSONL, lines=True)
df_judge_results = df_judge_results.loc[
    (df_judge_results["model"] == MODEL_DISPLAY_NAME)
    & (df_judge_results["score"] != -1)
]
overall_score = df_judge_results["score"].mean()
print(overall_score)

2.1666666666666665


## [Optional] Retain your configuration for reproducibility

In order to be able to repro your evaluation run in the future, do not forget to save the configuration of your evaluation, together with your evaluation metrics. 

In [13]:
import datetime
import json

import git

evaluation_config_dict = {
    "fast_chat_repo": {
        "repo_tag": str(git.Repo(FAST_CHAT_REPO).tags[-1]),
        "commit_hash": git.Repo(FAST_CHAT_REPO).head.commit.hexsha,
    },
    "configs": {
        "model_path": MODEL_PATH,
        "model_id": MODEL_DISPLAY_NAME,
    },
    "timestamp": str(datetime.datetime.now()),
    "eval_metrics": {"score": overall_score},
}

evaluation_config_json = json.dumps(evaluation_config_dict, indent=2)
with open("./evaluation_config.json", "w") as output_file:
    output_file.write(evaluation_config_json)