<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/configs/projects/halloumi/halloumi_inference_notebook.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# HallOumi Inference

This notebook demonstrates how you can run inference locally for HallOumi 8B. For more details on HallOumi, please read our [GitHub documentation](https://github.com/oumi-ai/oumi/blob/main/configs/projects/halloumi/README.md) and our [blog post](https://oumi.ai/blog/posts/introducing-halloumi).

## Prerequisites

Install Oumi, so that you can use our inference engines. You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html). If you're running this notebook on a CUDA-compatible GPU and want to use vLLM for inference, make sure to install the optional Oumi `[gpu]` dependencies.


In [None]:
%pip install oumi
# %pip install oumi[gpu]

Install the `nltk` library and download `punkt_tab`. These are needed for sentence splitting.

In [None]:
%pip install nltk

In [None]:
import nltk

nltk.download("punkt_tab")

## Helper Functions

The following function is used to create the prompt for HallOumi from a `context` document, a `request` to a language model, and its corresponding `response`. HallOumi's objective is to determine whether the language model hallucinated, meaning that the `response` cannot be grounded to the provided `context`.

In [1]:
from nltk.tokenize import sent_tokenize


def create_prompt(context: str, request: str, response: str) -> str:
    """Generates a prompt for the generative HallOumi model."""

    def _split_into_sentences(text: str) -> list[str]:
        sentences = sent_tokenize(text.strip())
        return [s.strip() for s in sentences if s.strip()]

    def _annotate_sentences(sentences: list[str], annotation_char: str) -> str:
        annotated_sentences = []
        for idx, sentence in enumerate(sentences, start=1):
            annotated_sentences.append(
                f"<|{annotation_char}{idx}|><{sentence}><end||{annotation_char}>"
            )
        return "".join(annotated_sentences)

    # Context: Split it into sentences and annotate them.
    context_sentences = _split_into_sentences(context)
    annotated_context_sentences = _annotate_sentences(context_sentences, "s")
    annotated_context = f"<|context|>{annotated_context_sentences}<end||context>"

    # Request: Annotate the request.
    annotated_request = f"<|request|><{request.strip()}><end||request>"

    # Response: Split it into sentences and annotate them.
    response_sentences = _split_into_sentences(response)
    annotated_response_sentences = _annotate_sentences(response_sentences, "r")
    annotated_response = f"<|response|>{annotated_response_sentences}<end||response>"

    # Combine all parts into the final prompt.
    return f"{annotated_context}{annotated_request}{annotated_response}"

The following function is used to extract a list of `Claim`s from HallOumi's response. The `Claim` class encapsulates the prediction (`supported`), the `rationale`, a list of `subclaims` that the claim consists of, and their corresponding `citations`.

In [2]:
import contextlib
from dataclasses import dataclass, field


@dataclass
class Claim:
    claim_id: int = -1
    claim_string: str = ""
    subclaims: list[str] = field(default_factory=list)
    citations: list[int] = field(default_factory=list)
    rationale: str = ""
    supported: bool = True


def get_claims_from_response(response: str) -> list[Claim]:
    """Extracts claims from the response string."""

    def _get_claim_id_from_subsegment(subsegment: str) -> int:
        claim_id_part = subsegment.split("|")[1]
        claim_id_no_r = claim_id_part.lstrip("r")
        return int(claim_id_no_r)

    def _get_claim_citations_from_subsegment(subsegment: str) -> list[int]:
        citation_segments = subsegment.split(",")
        citations = []
        for citation_segment in citation_segments:
            citation = citation_segment.replace("|", "").replace("s", "").strip()
            if "-" in citation:
                start, end = map(int, citation.split("-"))
                citations.extend(range(start, end + 1))
            elif "to" in citation:
                start, end = map(int, citation.split("to"))
                citations.extend(range(start, end + 1))
            else:
                with contextlib.suppress(ValueError):
                    citation_int = int(citation)
                    citations.append(citation_int)
        return citations

    def _get_claim_from_segment(segment: str) -> Claim:
        claim_segments = segment.split("><")
        claim = Claim()
        claim.claim_id = _get_claim_id_from_subsegment(claim_segments[0])
        claim.claim_string = claim_segments[1]

        subclaims = []
        claim_progress_index = 3  # start parsing subclaims from index 3
        for i in range(claim_progress_index, len(claim_segments)):
            subsegment = claim_segments[i]
            if subsegment.startswith("end||subclaims"):
                claim_progress_index = i + 1
                break
            subclaims.append(subsegment)

        citation_index = -1
        rationale_index = -1
        label_index = -1

        for i in range(claim_progress_index, len(claim_segments)):
            subsegment = claim_segments[i]
            if subsegment.startswith("|cite|"):
                citation_index = i + 1
            elif subsegment.startswith("|explain|"):
                rationale_index = i + 1
            elif subsegment.startswith("|supported|") or subsegment.startswith(
                "|unsupported|"
            ):
                label_index = i

        claim.subclaims = subclaims
        claim.citations = (
            _get_claim_citations_from_subsegment(claim_segments[citation_index])
            if citation_index != -1
            else []
        )
        claim.rationale = (
            claim_segments[rationale_index] if rationale_index != -1 else ""
        )
        claim.supported = (
            claim_segments[label_index].startswith("|supported|")
            if label_index != -1
            else True
        )
        return claim

    segments = response.split("<end||r>")
    claims = []
    for segment in segments:
        if segment.strip():
            claim = _get_claim_from_segment(segment)
            claims.append(claim)
    return claims

## Inference Walkthrough

### Dataset

Let's start by defining a toy dataset, where each example consists of a `context` document, a `request` to the language model, and the model's `reponse`. 

In [3]:
toy_dataset = [
    {
        "context": "Today is a sunny day. The weather is nice.",
        "request": "What is the weather like today?",
        "response": "The weather is sunny.",
    },
    {
        "context": "James is a software engineer. He works at a tech company.",
        "request": "What does James do for a living?",
        "response": "He is a hardware engineer. He loves his job.",
    },
]

We then convert these examples to a list of prompts.

In [4]:
prompts = []
for example in toy_dataset:
    prompt = create_prompt(
        context=example["context"],
        request=example["request"],
        response=example["response"],
    )
    prompts.append(prompt)

### Running Inference

Next step is to instantiate an inference config. You can use the remote config (`remote_config_str`) below to call our API. If you want to download the model and run inference locally in your machine, use the local config (`local_config_str`) instead. If you do NOT have a GPU in your local machine, you will need to use the `NATIVE` engine and the inference will be really slow. However, if you have a CUDA-compatible GPU, set the engine to `VLLM` instead, to take full advantage of the GPU and speed up your inference.

In [5]:
from oumi.core.configs import InferenceConfig

remote_config_str = """
model:
    model_name: "halloumi"

generation:
    max_new_tokens: 8192
    temperature: 0.0

remote_params:
    api_url: "https://api.oumi.ai/chat/completions"
    max_retries: 3
    connection_timeout: 300

engine: REMOTE
"""

local_config_str = """
model:
    model_name: "oumi-ai/HallOumi-8B"
    model_max_length: 8192
    trust_remote_code: true

generation:
    max_new_tokens: 8192
    temperature: 0.0

engine: NATIVE  # Set to VLLM, if you have a CUDA-compatible GPU.
"""

inference_config = InferenceConfig.from_str(local_config_str)

Using this inference config, run inference on the prompts, as shown below.

In [6]:
from oumi import infer

inference_results = infer(
    config=inference_config,
    inputs=prompts,
)

INFO 04-03 21:30:36 [__init__.py:256] Automatically detected platform cpu.
[2025-04-03 21:30:37,123][oumi][rank0][pid:3776][MainThread][INFO]][models.py:208] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-04-03 21:30:37,633][oumi][rank0][pid:3776][MainThread][INFO]][models.py:276] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> to instantiate model.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



[2025-04-03 21:30:48,103][oumi][rank0][pid:3776][MainThread][INFO]][models.py:482] Using the model's built-in chat template for model 'oumi-ai/HallOumi-8B'.
[2025-04-03 21:30:48,119][oumi][rank0][pid:3776][MainThread][INFO]][native_text_inference_engine.py:140] Setting EOS token id to `128009`


Generating Model Responses: 100%|██████████| 2/2 [33:43<00:00, 1011.80s/it]


### Inspecting the results

Once inference completes, the last step is to iterate on the inference results: get the responses and extract the predictions, sub-claims, citations, and rationales.

In [7]:
for result_index, result in enumerate(inference_results):
    # The model's response is the last message of the result (a `Conversation` object).
    response = str(result.last_message().content)

    claims = get_claims_from_response(response)
    for claim_index, claim in enumerate(claims):
        print(f"[example={result_index}, claim={claim_index}]: `{claim.claim_string}`")
        print(f"  - Supported? {claim.supported}")
        print(f"  - Sub-claims: {claim.subclaims}")
        print(f"  - Citations: {claim.citations}")
        print(f"  - Rationale: {claim.rationale}\n")

[example=0, claim=0]: `The weather is sunny.`
  - Supported? True
  - Sub-claims: ['The current weather conditions are being described.', 'The description of the weather is that it is sunny.']
  - Citations: [1]
  - Rationale: The first sentence explicitly states that "Today is a sunny day", which directly supports the claim that the weather is sunny.

[example=1, claim=0]: `He is a hardware engineer. `
  - Supported? False
  - Sub-claims: ['James works in the field of engineering.', 'James specifically works with hardware.']
  - Citations: [1]
  - Rationale: The document actually states that James is a software engineer, not a hardware engineer.

[example=1, claim=1]: `He loves his job.`
  - Supported? False
  - Sub-claims: ['James has a positive sentiment towards his job.', 'James enjoys his work.']
  - Citations: []
  - Rationale: There is no information provided in the context about James's feelings towards his job.

