# Evaluation with Alpaca Eval 2.0

This notebook discusses how you can run E2E evaluations for your trained model, using Oumi inference for generating the responses, and [Alpaca Eval 2.0](https://github.com/tatsu-lab/alpaca_eval) for automatically calculating the win-rates vs. GPT4 Turbo (or other reference models of your choice).

## Prerequisites and Configuration

First, start by installing the [Alpaca Eval package](https://pypi.org/project/alpaca-eval/). 


In [None]:
pip install -U -q alpaca_eval

When comparing your model's responses vs. the reference responses to calculate the win rates, an annotator (judge) is needed. By default, the annotator is set to GPT4 Turbo (annotator config: [weighted_alpaca_eval_gpt4_turbo](https://github.com/tatsu-lab/alpaca_eval?tab=readme-ov-file#alpacaeval-20)). To access the latest GPT-4 models, including GPT4 Turbo, an Open API key is required. Details on creating an OpenAI account and generating a key can be found at [OpenAI's quickstart webpage](https://platform.openai.com/docs/quickstart).

In [2]:
import os

os.environ["OPENAI_API_KEY"] = ""  # Set your OpenAI API key here

<b>⚠️ Cost considerations</b>: The cost of running a standard Alpaca evaluation 2.0 (with [weighted_alpaca_eval_gpt4_turbo](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/README.md) config) and annotating 805 examples with GPT4 Turbo is <b>$3.5</b>. However, the sample code of this notebook only annotates 3 (of 805 examples) and costs less than <b>0.5¢</b>.

In [3]:
NUM_EXAMPLES = 3  # Replace with 805 for full dataset evaluation.

Define your model and the max number of tokens it supports (to be used during generation). You can point to any model in HuggingFace, provide a path to a local folder that contains your model, or any other model format that Oumi inference supports. Also, please provide a (human friendly) display name for your model, to be used when displayed in leaderboards. 


In [4]:
MODEL_NAME = "bartowski/Llama-3.2-1B-Instruct-GGUF"
MODEL_DISPLAY_NAME = "MyLlamaTestModel"
MODEL_MAX_TOKENS = 8192

## Step 1: Retrieve Alpaca dataset

Alpaca Eval 2.0 requires model responses for the [tatsu-lab/alpaca_eval](https://huggingface.co/datasets/tatsu-lab/alpaca_eval) dataset.

In [None]:
from oumi.datasets.evaluation import AlpacaEvalDataset

alpaca_dataset = AlpacaEvalDataset(dataset_name="tatsu-lab/alpaca_eval").conversations()

Since this notebook contains sample code, we will only run inference for the first `NUM_EXAMPLES` (of 805) from the dataset. 

In [None]:
alpaca_dataset = alpaca_dataset[:NUM_EXAMPLES]  # For testing purposes, reduce examples.

for index, conversation in enumerate(alpaca_dataset):
    print(index, conversation.messages)

## Step 2: Run inference

First, define all the relevant parameters and configs required for inference.

In [7]:
from oumi.core.configs import GenerationParams, InferenceConfig, ModelParams

generation_params = GenerationParams(max_new_tokens=MODEL_MAX_TOKENS)
model_params = ModelParams(model_name=MODEL_NAME, model_max_length=MODEL_MAX_TOKENS)
inference_config = InferenceConfig(model=model_params, generation=generation_params)

Then, choose an inference engine that your model is compatible with. For more information on this, see Oumi's [inference documentation](https://oumi.ai/docs/latest/user_guides/infer/infer.html). 

In [None]:
from oumi.inference import LlamaCppInferenceEngine

inference_engine = LlamaCppInferenceEngine(model_params)

Next, run inference to get responses from your model for the prompts contained in the `alpaca_dataset`.

In [None]:
responses = inference_engine.infer(alpaca_dataset, inference_config)

Then, convert the responses from Oumi format (list of `Conversation`s) to Alpaca format (list of `dict`s, where the data is contained under the keys `instruction` and `output`). Create a DataFrame from the data and add a new column "`generator`", which captures the human-readable name of the model the responses were produced with. 

In [10]:
import pandas as pd

from oumi.datasets.evaluation import utils

responses_json = utils.conversations_to_alpaca_format(responses)
responses_df = pd.DataFrame(responses_json)
responses_df["generator"] = MODEL_DISPLAY_NAME

Your DataFrame should look as follows.

In [None]:
responses_df

## Step 3: Run Alpaca Eval 2.0

You can kick off evaluations as shown below. 

The default annotator for Alpaca Eval 2.0 is <b>GPT-4 Turbo</b>. While Alpaca Eval 1.0 was using a binary preference, Alpaca Eval 2.0 uses the logprobs to compute a continuous preference, resulting in a <b>weighted</b> win-rate. The default annotator config of Alpaca Eval 2.0 is thus `weighted_alpaca_eval_gpt4_turbo`. There is an option to use other annotators (judges) as well; see the [Annotators configs](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/README.md) page for details and relevant costs. However, the Alpaca 2.0 leaderboard is established with GPT4 Turbo as the reference annotator. Using other annotators is less informative if you are interested in generating comparative results. 

In [None]:
from alpaca_eval import evaluate

ANNOTATORS_CONFIG = "weighted_alpaca_eval_gpt4_turbo"

df_leaderboard, annotations = evaluate(
    model_outputs=responses_df,
    annotators_config=ANNOTATORS_CONFIG,
    is_return_instead_of_print=True,
)

## Step 4: Inspect the metrics

Once the evaluation process completes, you can inspect the metrics produced, as shown below.

In [None]:
metrics = df_leaderboard.loc[MODEL_DISPLAY_NAME]

print(f"Metrics for `{MODEL_DISPLAY_NAME}`")
for metric, value in metrics.items():
    print(f" - {metric}={value}")

## [Optional] Retain your configuration for reproducibility

In order to be able to repro your evaluation run in the future, do not forget to save the configuration of your evaluation, together with your evaluation metrics. 

In [15]:
import json
from importlib.metadata import version

evaluation_config_dict = {
    "packages": {
        "alpaca_eval": version("alpaca_eval"),
        "oumi": version("oumi"),
    },
    "configs": {
        "inference_config": str(inference_config),
        "annotators_config": ANNOTATORS_CONFIG,
    },
    "eval_metrics": metrics.to_dict(),
}

evaluation_config_json = json.dumps(evaluation_config_dict, indent=2)
with open("./output/evaluation_config.json", "w") as output_file:
    output_file.write(evaluation_config_json)