<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/configs/projects/halloumi/halloumi_eval_notebook.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Evaluating LLMs as Hallucination Classifiers

This notebook demonstrates how we evaluated numerous Large Language Models (LLMs) as hallucination classifiers. 
To see our full evaluation results and learn more about our [groundedness benchmark](https://huggingface.co/datasets/oumi-ai/oumi-groundedness-benchmark), 
please see our [technical overview](https://oumi.ai/blog/posts/introducing-halloumi)!

## Prerequisites and Environment

### Install Oumi

First, let's install Oumi. You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html).


In [None]:
%pip install oumi

### API Keys

To be able to make remote API calls to Open AI, Gemini, and Anthropic you will need the following keys.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""  # Set your OpenAI API key here.
os.environ["GEMINI_API_KEY"] = ""  # Set your Gemini API key here.
os.environ["ANTHROPIC_API_KEY"] = ""  # Set your  Anthropic API key here.

# Set your GCP project id and region, to be able to query LLaMA 405B in Vertex.
REGION = ""  # Set your GCP region here.
PROJECT_ID = ""  # Set your GCP project id here.

### Limit the Number of Examples

This notebook limits the total number of examples to 10, for testing purposes. To use the full dataset, set the number of examples to `None`.

In [None]:
NUM_EXAMPLES = 10  # Replace with `None` for full dataset evaluation.

## Evaluation Dataset

You can load the test split of [Oumi's Groundedness Benchmark](https://huggingface.co/datasets/oumi-ai/oumi-groundedness-benchmark) as follows. 

In [None]:
from oumi.datasets.sft.prompt_response import PromptResponseDataset

groundedness_dataset = PromptResponseDataset(
    hf_dataset_path="oumi-ai/oumi-groundedness-benchmark",
    response_column="",
    split="test",
)

The following code snippet extracts the labels of the dataset and converts them from `str`s (`SUPPORTED` or `UNSUPPORTED`) to `int`s (`0` or `1`).

In [None]:
groundedness_labels_str = groundedness_dataset.data.label.tolist()
groundedness_labels = [
    0 if label == "SUPPORTED" else 1 for label in groundedness_labels_str
]

## Evaluation

### Prediction Extraction

The following function extracts the prediction from a free-text response generated by a LLM.

In [None]:
def _extract_prediction(response):
    """Returns: 0 if supported, 1 if unsupported."""
    is_unsupported = "<|unsupported|>" in response or "<unsupported>" in response
    return 1 if is_unsupported else 0

### Custom Evaluation Function

The following code snippet demonstrates how we register a custom evaluation function that produces the F1 score and the Balanced Accuracy score, given a set of prompts and their corresponding labels. The prompts have been provided in Oumi format (a list of `Conversation`s), while the labels are a list of `int`s (`0` or `1`). An inference engine is also instantiated by Oumi, based on the provided configuration, discussed below.

In [None]:
from oumi.core.evaluation.metrics import bacc_score, f1_score
from oumi.core.registry import register_evaluation_function


@register_evaluation_function("hallucination_classification")
def hallucination_classification(inference_engine, conversations, labels):
    """Hallucination classification evaluation function."""
    # Run inference to append the model response into each conversation.
    conversations = inference_engine.infer(conversations)

    y_true, y_pred = [], []
    for conversation, label in zip(conversations, labels):
        # Extract the model response from the conversation.
        response = conversation.last_message()

        # Extract the prediction from the response.
        prediction = _extract_prediction(response.content)

        # Record the predictions together with their labels.
        y_pred.append(prediction)
        y_true.append(label)

    # Compute relevant metrics and return them.
    return {
        "f1": f1_score(y_true, y_pred, average="macro"),
        "bacc": bacc_score(y_true, y_pred),
    }

### Evaluation Configurations

The following configurations represent the models we have evaluated.

In [None]:
model_names = [
    "halloumi 8b",
    # Uncomment any models you wish to evaluate - you can evaluate multiple at once.
    # "gpt_4o",
    # "o1_preview",
    # "gemini_pro",
    # "llama_405b",
    # "sonnet",
]

configs = {
    "gpt_4o": """
      model:
        model_name: "gpt-4o"

      inference_engine: OPENAI

      inference_remote_params:
        api_key_env_varname: "OPENAI_API_KEY"
        max_retries: 3
        num_workers: 100
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: hallucination_classification
      """,
    "o1_preview": """
      model:
        model_name: "o1-preview"

      inference_engine: OPENAI

      inference_remote_params:
        api_key_env_varname: "OPENAI_API_KEY"
        max_retries: 3
        num_workers: 100
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 1.0

      tasks:
        - evaluation_backend: custom
          task_name: hallucination_classification
      """,
    "gemini_pro": """
      model:
        model_name: "gemini-1.5-pro"

      inference_engine: GOOGLE_GEMINI

      inference_remote_params:
        api_key_env_varname: "GEMINI_API_KEY"
        max_retries: 3
        num_workers: 2
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: hallucination_classification
      """,
    "llama_405b": f"""
      model:
        model_name: "meta/llama-3.1-405b-instruct-maas"

      inference_engine: GOOGLE_VERTEX

      inference_remote_params:
        api_url: "https://{REGION}-aiplatform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi/chat/completions"
        max_retries: 3
        num_workers: 10
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: hallucination_classification
      """,
    "sonnet": """
      model:
        model_name: "claude-3-5-sonnet-latest"

      inference_engine: ANTHROPIC

      inference_remote_params:
        api_key_env_varname: "ANTHROPIC_API_KEY"
        max_retries: 3
        num_workers: 5
        politeness_policy: 65
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: hallucination_classification
      """,
    "halloumi 8b": """
      model:
        model_name: "halloumi"

      inference_engine: REMOTE

      inference_remote_params:
        api_url: "https://api.oumi.ai/chat/completions"
        max_retries: 3
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: hallucination_classification
      """,
}

### Running the Evaluation

Putting everything together: Below, we demonstrate how to run the evaluation for each model sequentially. For each model, we first create the evaluation configuration, convert the dataset prompts into Oumi format (i.e., `conversation`s), and then initiate the evaluation.

In [None]:
from oumi.core.configs import EvaluationConfig
from oumi.core.evaluation import Evaluator

results = {model_name: {} for model_name in model_names}

for model_name in model_names:
    # Create the evaluation config from the YAML string.
    config_yaml: str = configs[model_name]
    config = EvaluationConfig.from_str(config_yaml)

    # Each model has a different (but very similar) prompt, thus we set the prompt
    # column to relevant model name, before running the evaluation.
    groundedness_dataset.prompt_column = f"{model_name} prompt"

    # Convert the prompts into Oumi conversations.
    conversations = groundedness_dataset.conversations()
    if NUM_EXAMPLES:
        conversations = conversations[:NUM_EXAMPLES]

    # Run the evaluation.
    evaluator = Evaluator()
    evaluator_out = evaluator.evaluate(
        config=config,
        conversations=conversations,
        labels=groundedness_labels,
    )

    # Record the results.
    results[model_name] = evaluator_out[0].get_results()

### Inspect the Results

Next, we inspect the performance metrics for each evaluated model. 

In [None]:
for model_name in model_names:
    print(
        f"{model_name}: BACC = {results[model_name]['bacc'].value:0.1%}"
        f" ± {results[model_name]['bacc'].ci:0.1%}"
    )

for model_name in model_names:
    print(
        f"{model_name}: F1 = {results[model_name]['f1'].value:0.1%}"
        f" ± {results[model_name]['f1'].ci:0.1%}"
    )

To wrap things up, we discussed how to use Oumi's custom evaluation framework to evaluate multiple open and closed source LLMs as hallucination classifiers, using the [newly-introduced groundedness benchmark](https://huggingface.co/datasets/oumi-ai/oumi-groundedness-benchmark).

If you'd like to try inference on HallOumi, please also check out our [inference notebook](https://github.com/oumi-ai/oumi/blob/main/configs/projects/halloumi/halloumi_inference_notebook.ipynb).