<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Evaluation with AlpacaEval 2.0.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Evaluation of a Hallucination Classifier 

This notebook demonstrates how to evaluate a Large Language Model (LLM) classifier using Oumi's custom evaluation. In particular, we focus on a binary hallucination classifier for open-book question-answering (Q&A). Given a context document and a hypothesis, the classifier’s task is to determine whether the hypothesis is grounded in the context (i.e., not hallucinated). The context can be thought of as a text snippet retrieved using Retrieval-Augmented Generation (RAG), while the hypothesis represents an LLM-generated response. 

The underlying model used for this assessment is a generative LLM, which produces responses in free text. As a result, the prompt must explicitly specify the desired format for the response, and post-processing logic is needed to extract the prediction from the model's free-form response.

## Prerequisites and Environment

❗**NOTICE:** We recommend running this notebook on a GPU. If running on Google Colab, you can use the free T4 GPU runtime (Colab Menu: `Runtime` -> `Change runtime type`).

### Oumi Installation

First, let's install Oumi. You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

In [1]:
%pip install oumi

### Tutorial Directory Setup

Next, we will create a directory for the tutorial, to store the evaluation configuration and the experimental results.

In [2]:
from pathlib import Path

tutorial_dir = "custom_eval_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)

### Limit the Number of Examples

This notebook limits the total number of examples to 10, for testing purposes. The dataset we will use (ANLI) contains 1,200 examples. To use the full dataset, set the number of examples to `None`.

In [3]:
NUM_EXAMPLES = 10  # Replace with `None` for full dataset evaluation.

## Evaluation Dataset

### Base Data: ANLI

We will use the [Adversarial Natural Language Inference (ANLI)](https://arxiv.org/abs/1910.14599) dataset by Meta for our evaluations. This is an adversarial NLI dataset collected via an iterative, adversarial human-and-model-in-the-loop procedure. It consists of a `context` (premise), a `hypothesis`, and a `label`. The `context` is a short snippet of text that is considered the "ground truth", serving as our reference to validate whether a hypothesis is supported or not. The `hypothesis` is a claim that should be grounded based on the `context`, while the `label` indicates whether the hypothesis is a valid entailment (`0`), neutral (`1`), or a contradiction (`2`), based on the `context`. For our experimental analysis, if the hypothesis is not a valid entailment, we consider it hallucinated.

In [4]:
import datasets

anli_dataset = datasets.load_dataset("facebook/anli", split="test_r3")

print(anli_dataset[0])

{'uid': 'b0e63408-53af-4b46-b33d-bf5ba302949f', 'premise': "It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles is available below. Stay tuned for more rumors and don't forget to follow us on Twitter.", 'hypothesis': 'The day of the passage is usually when Christians praise the lord together', 'label': 0, 'reason': "Sunday is considered Lord's Day"}


### Designing the Prompt

For ANLI to be understood by an LLM, we need to covert the list of (`context`, `hypothesis`) tuples to a list of prompts, providing the model with an actionable request. When designing LLM prompts, each prompt typically consists of a static part and a dynamic part (the actual content). The static text describes the input and output formats to the model, as well as the criteria to generate the desired answer. The content is the `context` and `hypothesis` fields coming from ANLI. Below, you can see the template we used to generate the prompts for our dataset.

In [5]:
prompt_template = """You will be given a premise and a hypothesis.
Determine if the hypothesis is <|supported|> or <|unsupported|> based on the premise.

premise: {premise}

hypothesis: {hypothesis}

You are allowed to think out loud.
Ensure that your final answer is formatted as <|supported|> or <|unsupported|>.
"""

### Creating the Evaluation Dataset

The code snippet below demonstrates how we convert the ANLI dataset to an evaluation dataset (a list of Oumi `Conversation`s). Each conversation consists of a `conversation_id` (which is set to ANLI's `uid`), a single user `message` (added in the `messages` list) which contains the prompt, and metadata. We retain the ANLI label in the conversations' metadata, after we convert it to a binary label: `0` is corresponding to a supported hypothesis, `1` is corresponding to an unsupported hypothesis (hallucination).

In [6]:
from oumi.datasets import TextSftJsonLinesDataset


# Convert the ANLI label to a binary label.
def _convert_label_to_binary(label):
    if label == 0:
        return 0  # supported
    elif label in [1, 2]:
        return 1  # unsupported (hallucination)
    else:
        raise ValueError(f"Invalid label: {label}")


# Convert the ANLI dataset into a list of conversations (oumi.core.types.Conversation).
evaluation_dataset = []
for anli_example in anli_dataset:
    prompt = prompt_template.format(
        premise=anli_example["premise"], hypothesis=anli_example["hypothesis"]
    )
    message = {"role": "user", "content": prompt}
    metadata = {"label": _convert_label_to_binary(anli_example["label"])}
    conversation = {
        "conversation_id": anli_example["uid"],
        "messages": [message],
        "metadata": metadata,
    }
    evaluation_dataset.append(conversation)

# Limit the number of examples, if requested.
if NUM_EXAMPLES:
    evaluation_dataset = evaluation_dataset[:NUM_EXAMPLES]

# Read the dataset into an Oumi class.
MY_DATASET = TextSftJsonLinesDataset(data=evaluation_dataset)

[2025-03-12 11:09:11,968][oumi][rank0][pid:90050][MainThread][INFO]][base_map_dataset.py:91] Creating map dataset (type: TextSftJsonLinesDataset)... dataset_name: 'custom'


## Evaluation

This section discusses how we can leverage Oumi's custom evaluation to run inference on the prompts, extract the predictions from the model responses, and finally calculate metrics to assess our classification quality.

### Step 1: Specify How to Extract the Prediction

Given our prompt design (discussed earlier in this notebook), the response is expected to contain either the `<|supported|>` or `<|unsupported|>` tag. The simplest extraction we can perform is look up for either of these tags in the response. Note that if both or none of the tags is included in the response, that indicates that our model's assessment is ambiguous. 

The function below returns `0` if the hypothesis is supported, `1` if the hypothesis is NOT supported, and `-1` when the LLM's response is ambiguous (i.e., no assessment is available).

In [7]:
def _extract_prediction(response):
    is_unsupported = "<|unsupported|>" in response
    is_supported = "<|supported|>" in response
    if is_unsupported == is_supported:
        return -1
    return 0 if is_supported else 1

### Step 2: Define the Custom Evaluation Function

Our evaluation framework supports custom evaluation functions, which can be registered with the decorator `@register_evaluation_function`, and retrieved in our evaluation config by directly referencing the registered function name (`my_custom_evaluation`) in the `task_name` (see `YAML` code in Step 3).

Below, we define a custom evaluation function and register it as `my_custom_evaluation`. The evaluation function first runs inference, using the `inference_engine` that has been instantiated based on the `YAML` configuration (`inference_engine: NATIVE`; see Step 3). During inference, the engine will append the LLM's (assistant's) response into each conversation, as its last message. Thus, we extract the predictions from the last messages and store them in a list (`y_pred`). We also store the corresponding labels in a list (`y_true`). The labels can be found in the conversations' metadata, as discussed earlier (see section: Create the Evaluation Dataset).

Once we have the lists of predictions (`y_pred`) and labels (`y_true`), we can compute any relevant metrics (F1, Precision, Recall, BACC, etc). In this notebook we choose the Balanced Accuracy (BACC) from the sklearn library. The final step is to return all computed metrics in a dict.

Note that registering an evaluation function has been designed with flexibility as the primary goal. There are **no requirements on what inputs to use**; for example: whether to use Oumi's inference engine, or leverage the input dataset. These are only provided for user convenience. In practice, you can **use your own inference** or not use one. You also don't have to pass a dataset into this function; you can **read data from any source you want**: a file, a database, your custom function, or not use a dataset during evaluation. 

In [8]:
from sklearn.metrics import balanced_accuracy_score

from oumi.core.registry import register_evaluation_function


@register_evaluation_function("my_custom_evaluation")
def my_custom_evaluation(inference_engine, dataset):
    """Custom evaluation function registered as `my_custom_evaluation`."""
    # Run inference to generate the model responses.
    conversations = inference_engine.infer(dataset.conversations())

    y_true, y_pred = [], []
    for conversation in conversations:
        # Extract the assistant's (LLM's) response from the conversation.
        response = conversation.last_message()

        # Extract the prediction from the response.
        prediction = _extract_prediction(response.content)

        # Record the valid predictions (!= -1) together with their labels.
        if prediction != -1:
            y_pred.append(prediction)
            y_true.append(conversation.metadata["label"])

    # Compute any relevant metrics (such as Balanced Accuracy).
    bacc = balanced_accuracy_score(y_true, y_pred)

    return {"bacc": bacc}

### Step 3: Set the Evaluation Configuration

The easiest way to configure our custom evaluation is by setting a `YAML` file, as shown below. This file identifies the model to evaluate, sets our generation parameters, the inference engine to use, and the output directory. Most importantly, it points to a single evaluation task (under `tasks`) describing the custom task to run:
- `evaluation_backend`: `custom`, indicating a custom evaluation.
- `task_name`: `my_custom_evaluation`, pointing to the name of the custom evaluation function that we registered (see Step 2).

Note: Evaluating **any open or closed source model** is a **simple config change**!
- This example is evaluating SmolLM2; a small (135B) model that we use for testing purposes. 
- For open-source or custom models, just set the `model_name` to your desired HuggingFace model ID, or point to a local path. 
- For closed-source models, besides setting the `model_name` (e.g. `gpt-4o`), the only additional things you need to do is to set the `inference_engine` to the [relevant type](https://oumi.ai/docs/en/latest/user_guides/infer/configuration.html#engine-selection) (`OPENAI`, `ANTHROPIC`, `GOOGLE_GEMINI`, `DEEPSEEK`, etc) and set an API key (if needed) in [remote params](https://oumi.ai/docs/en/latest/user_guides/infer/configuration.html#remote-configuration). Visit [our documentation](https://oumi.ai/docs/en/latest/user_guides/infer/inference_engines.html) for more details. 

In [9]:
yaml_content = """
model:
  model_name: "HuggingFaceTB/SmolLM2-135M-Instruct"
  model_max_length: 2048
  torch_dtype_str: "bfloat16"
  trust_remote_code: True

generation:
  batch_size: 1

tasks:
  # List of tasks to evaluate (only a single task in this case).
  - evaluation_backend: custom
    task_name: my_custom_evaluation

inference_engine: NATIVE

output_dir: "tutorial/output"
"""

with open(f"{tutorial_dir}/custom_eval.yaml", "w") as f:
    f.write(yaml_content)

### Step 4: Run the Evaluation

Putting everything together: Once we have constructed the dataset, registered an evaluation function, and set the evaluation config, we can run the evaluation as shown below.

In [10]:
from oumi.core.configs import EvaluationConfig
from oumi.core.evaluation import Evaluator

config = EvaluationConfig.from_yaml(str(Path(tutorial_dir) / "custom_eval.yaml"))

evaluator = Evaluator()
results = evaluator.evaluate(config, dataset=MY_DATASET)

[2025-03-12 11:09:36,709][oumi][rank0][pid:90050][MainThread][INFO]][models.py:208] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-03-12 11:09:36,850][oumi][rank0][pid:90050][MainThread][INFO]][models.py:276] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> to instantiate model.
[2025-03-12 11:09:37,497][oumi][rank0][pid:90050][MainThread][INFO]][models.py:482] Using the model's built-in chat template for model 'HuggingFaceTB/SmolLM2-135M-Instruct'.
[2025-03-12 11:09:37,518][oumi][rank0][pid:90050][MainThread][INFO]][native_text_inference_engine.py:140] Setting EOS token id to `2`


Generating Model Responses: 100%|██████████| 10/10 [00:12<00:00,  1.21s/it]


### Step 5: Inspect the Results

The code below shows how we can inspect the list of `results`. Here, we have a single result returned by the custom evaluation function, since our evaluation config only defined one task. 

We note that though BACC is perfect (`1.0`), SmolLM2 135M is actually NOT a great hallucination classifier; the positive outcome is merely because we are only sampling 10 examples. We selected SmolLM2 in this notebook because of its small size, to optimize for execution speed. We strongly recommend larger models for accurate classifications.

In [11]:
custom_task_results: dict = results[0].get_results()

print("BACC:", custom_task_results["bacc"])
print("Execution duration in sec:", results[0].elapsed_time_sec)

BACC: 1.0
Execution duration in sec: 12
