<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Evaluation with AlpacaEval 2.0.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

üëã Welcome to Open Universal Machine Intelligence (Oumi)!

üöÄ Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

ü§ù Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

‚≠ê If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Evaluation with AlpacaEval 2.0

This notebook demonstrates how to run end-to-end evaluations for your trained model with [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval). AlpacaEval is an LLM-based automatic evaluation suite that is fast, cheap, replicable, and validated against 20K human annotations.

Evaluating with AlpacaEval is a 2-step process:
1. **Inference**: Generate model responses for 805 AlpacaEval prompts using Oumi's inference engine
2. **Judgement**: Use GPT-4 Turbo as a judge to compare your model's responses against reference responses and calculate win rates

**Resources:**
- [AlpacaEval V2.0 Paper](https://arxiv.org/abs/2404.04475)
- [AlpacaEval Dataset](https://huggingface.co/datasets/tatsu-lab/alpaca_eval)
- [Leaderboard](https://tatsu-lab.github.io/alpaca_eval/)
- [Official Repository](https://github.com/tatsu-lab/alpaca_eval)

## Prerequisites and Configuration

First, install the required packages. The `alpaca_eval` package requires Python >= 3.10.

In [None]:
! pip install -q oumi alpaca_eval pandas

AlpacaEval uses GPT-4 Turbo as the default judge. To access GPT-4 models, an OpenAI API key is required. Details on creating an OpenAI account and generating a key can be found at [OpenAI's quickstart webpage](https://platform.openai.com/docs/quickstart).

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""  # NOTE: Set your OpenAI API key here

**‚ö†Ô∏è Cost considerations**: AlpacaEval 2.0 uses GPT-4 Turbo to judge 805 examples. Please visit [OpenAI's pricing](https://openai.com/api/pricing/) page for current costs. Since this notebook is sample code, we will only evaluate a small subset of examples to reduce costs.

In [None]:
NUM_EXAMPLES = 5  # Set to None to evaluate all 805 examples

Configure your model. You can use a HuggingFace model ID or a path to a local model.

In [None]:
MODEL_PATH = "HuggingFaceTB/SmolLM2-135M-Instruct"
MODEL_DISPLAY_NAME = "my_model"

## Step 1: Load the AlpacaEval Dataset

Load the AlpacaEval dataset using Oumi's `AlpacaEvalDataset` class. This dataset contains 805 open-ended prompts for evaluating instruction-following capabilities.

In [None]:
from oumi.datasets import AlpacaEvalDataset

# Load the dataset
dataset = AlpacaEvalDataset()

print(f"Dataset size: {len(dataset)} examples")

# Preview a sample prompt using the conversation() method
sample_conv = dataset.conversation(0)
print("\nSample prompt:")
print(sample_conv.messages[0].content)

## Step 2: Extract Prompts for Inference

Extract the prompts from the dataset conversations to prepare for inference.

In [None]:
# Determine number of examples to evaluate
num_to_evaluate = NUM_EXAMPLES if NUM_EXAMPLES else len(dataset)

# Extract prompts from conversations
prompts = [dataset.conversation(i).messages[0].content for i in range(num_to_evaluate)]

print(f"Extracted {len(prompts)} prompts for evaluation")
print("\nFirst few prompts:")
for i, prompt in enumerate(prompts[:3]):
    print(f"  {i + 1}. {prompt[:80]}...")

## Step 3: Run Inference with Oumi

Generate model responses for the AlpacaEval prompts using Oumi's inference capabilities.

In [None]:
from oumi import infer
from oumi.core.configs import (
    GenerationParams,
    InferenceConfig,
    ModelParams,
)

# Configure inference
config = InferenceConfig(
    model=ModelParams(
        model_name=MODEL_PATH,
        trust_remote_code=True,
    ),
    generation=GenerationParams(
        max_new_tokens=2048,
        temperature=0.7,
        top_p=0.9,
    ),
)

print(f"Running inference on {len(prompts)} prompts...")
print(f"Model: {MODEL_PATH}")

In [None]:
# Run inference - this returns Conversation objects with both prompt and response
responses = infer(config, prompts)

print(f"\nGenerated {len(responses)} responses")

In [None]:
# Inspect a sample response
print("Sample response:")
print(f"Prompt: {responses[0].messages[0].content[:200]}...")
print(f"\nResponse: {responses[0].messages[-1].content[:500]}...")

## Step 4: Format Responses for AlpacaEval

Convert the Oumi conversation format to the AlpacaEval format required by the evaluation framework.

In [None]:
import pandas as pd

from oumi.datasets.evaluation.utils import conversations_to_alpaca_format

# Convert to AlpacaEval format
alpaca_format_responses = conversations_to_alpaca_format(responses)

# Add generator name to each response (required by AlpacaEval)
for response in alpaca_format_responses:
    response["generator"] = MODEL_DISPLAY_NAME

# Convert to DataFrame for AlpacaEval
responses_df = pd.DataFrame(alpaca_format_responses)

print(f"Formatted {len(responses_df)} responses for AlpacaEval")
print(f"\nColumns: {list(responses_df.columns)}")
responses_df.head()

## Step 5: Run AlpacaEval Judgment

Use the AlpacaEval framework to judge your model's responses against reference responses. The judge (GPT-4 Turbo by default) compares each response and calculates win rates.

In [None]:
import alpaca_eval

# Set AlpacaEval 2.0 configuration
os.environ["IS_ALPACA_EVAL_2"] = "True"

# Run evaluation
# Note: This will make API calls to the judge model (GPT-4 Turbo)
print("Running AlpacaEval judgment...")
print("This may take a few minutes depending on the number of examples.\n")

result = alpaca_eval.evaluate(
    model_outputs=responses_df,
    annotators_config="weighted_alpaca_eval_gpt4_turbo",
    is_return_instead_of_print=True,
    max_instances=num_to_evaluate,
)

print("\nEvaluation complete!")

## Step 6: View Results

Examine the evaluation results including win rates and other metrics.

In [None]:
# Display results
if result is not None:
    print("=" * 50)
    print("AlpacaEval 2.0 Results")
    print("=" * 50)
    print(result)
else:
    print("Results were printed above.")

## [Optional] Save Results for Reproducibility

Save the configuration and results for future reference.

In [None]:
import datetime
import json

# Save configuration and results
evaluation_config_dict = {
    "model": {
        "model_path": MODEL_PATH,
        "model_display_name": MODEL_DISPLAY_NAME,
    },
    "alpaca_eval": {
        "version": "2.0",
        "annotator": "weighted_alpaca_eval_gpt4_turbo",
        "num_examples": num_to_evaluate,
    },
    "timestamp": str(datetime.datetime.now()),
}

# Save to file
output_path = "alpaca_eval_config.json"
with open(output_path, "w") as f:
    json.dump(evaluation_config_dict, f, indent=2)

print(f"Configuration saved to {output_path}")

In [None]:
# Save model responses
responses_path = f"{MODEL_DISPLAY_NAME}_alpaca_eval_responses.json"
responses_df.to_json(responses_path, orient="records", indent=2)
print(f"Responses saved to {responses_path}")

# üß≠ What's Next?

Congrats on finishing this notebook! Feel free to check out our other [notebooks](https://github.com/oumi-ai/oumi/tree/main/notebooks) in the [Oumi GitHub](https://github.com/oumi-ai/oumi), and give us a star! You can also join the Oumi community over on [Discord](https://discord.gg/oumi).

üì∞ Want to keep up with news from Oumi? Subscribe to our [Substack](https://blog.oumi.ai/) and [Youtube](https://www.youtube.com/@Oumi_AI)!

‚ö° Interested in building custom AI in hours, not months? Apply to get [early access](https://oumi-ai.typeform.com/early-access) to the Oumi Platform, or [chat with us](https://calendly.com/d/ctcx-nps-47m/chat-with-us-get-early-access-to-the-oumi-platform) to learn more!