<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Bulk Inference

This notebook demonstrates how to run bulk inference against various LLM APIss

# Prerequisites

## Machine Requirements

Since we're calling APIs remotely, no GPU or special hardware is needed. You may need internet access when calling an API that isn't accessible via your local network.

## Oumi Installation

First, let's install Oumi. You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html).


In [None]:
%pip install oumi

## API Keys

Different model providers each have their own API keys which must be set when calling each one.

In [None]:
import os

from oumi.core.configs import InferenceConfig
from oumi.core.types import Conversation, Message, Role

# Set up API keys for different providers
if (
    not os.environ.get("ANTHROPIC_API_KEY")
    or os.environ.get("ANTHROPIC_API_KEY") == "<MY_ANTHROPIC_API_KEY>"
):
    os.environ["ANTHROPIC_API_KEY"] = "<MY_ANTHROPIC_API_KEY>"

if (
    not os.environ.get("OPENAI_API_KEY")
    or os.environ.get("OPENAI_API_KEY") == "<MY_OPENAI_API_KEY>"
):
    os.environ["OPENAI_API_KEY"] = "<MY_OPENAI_API_KEY>"

if (
    not os.environ.get("GEMINI_API_KEY")
    or os.environ.get("GEMINI_API_KEY") == "<MY_GEMINI_API_KEY>"
):
    os.environ["GEMINI_API_KEY"] = "<MY_GEMINI_API_KEY>"

# Display which API keys are configured
anthropic_key = os.environ.get("ANTHROPIC_API_KEY")
openai_key = os.environ.get("OPENAI_API_KEY")
google_key = os.environ.get("GEMINI_API_KEY")

print(f"Using Anthropic API Key: '{anthropic_key}'")
print(f"Using OpenAI API Key: '{openai_key}'")
print(f"Using Google API Key: '{google_key}'")

### Setting up the config file

Note: in this section we are writing the config file to the current working directory.

An alternative option is to initialize the params classes directly: `ModelParams`, `GenerationParams`.

In [None]:
config_path = "api_tutorial_inference_config.yaml"

In [None]:
%%writefile api_tutorial_inference_config.yaml

model:
  model_name: "claude-sonnet-4-20250514"
  # model_name: "gpt-4o-mini"
  # model_name: "gemini-2.0-flash"

generation:
  max_new_tokens: 512

remote_params:
  num_workers: 5 # max number of workers to run in parallel
  politeness_policy: 60 # wait 60 seconds before sending next request

engine: "ANTHROPIC"
# engine: "OPENAI"
# engine: "GOOGLE_GEMINI"

### Load the model and the inference engine

In [None]:
from oumi.builders.inference_engines import build_inference_engine

config = InferenceConfig.from_yaml(config_path)
model_params = config.model
remote_params = config.remote_params
engine_type = config.engine
if not engine_type:
    print("Check your config file to ensure you have an engine type specified.")
    exit()

inference_engine = build_inference_engine(engine_type, model_params, remote_params)

### Preprocessing our inputs

The inference engine expects a list of conversations, where each conversation is a list of messages.

See the [Conversation](https://github.com/oumi-ai/oumi/blob/38b3d2b27407be5fc9be5a1dd88f9ad518f3491c/src/oumi/core/types/turn.py#L109) class for more details.

Tip: you can visualize how the conversation is rendered as a prompt with the following:

```python
inference_engine.apply_chat_template(conversation, tokenize=False)
```

In [None]:
from datasets import load_dataset

ds = load_dataset("fka/awesome-chatgpt-prompts", split="train")

prompts = [sample["prompt"] for sample in ds]  # type: ignore

conversations = [
    Conversation(
        messages=[
            Message(role=Role.USER, content=prompt),
        ]
    )
    for prompt in prompts
]

print(len(conversations))
print(conversations[0])

In [None]:
# Limit the number of conversations to 10 for testing

inference_conversations = conversations[:10]

In [None]:
print(f"Running inference for {len(inference_conversations)} conversations")

generations = inference_engine.infer(
    input=inference_conversations,
    inference_config=config,
)

In [None]:
for conversation in generations:
    print(repr(conversation))
    print()

### Inference Features
#### Parallel Inference
Under the hood, oumi will parallelize your inference, up to a maximum of `num_workers` requests.

For example, if `num_workers = 5`, oumi will process 5 requests at a time.

Make sure to feed all your prompts to the engine at once for maximum throughput.

#### Rate Limiting
In addition to parallization, oumi offers support for rate limiting through `politeness_policy`.

For example, if `politeness_policy = 60`, then each worker will wait 60s before submitting its next request.

This is useful for matching the request-per-minute quota limits set by various API providers, ensuring that your inference job succeeds.

#### Adaptive Throughput
By default, oumi will adapt its parallelization (num_workers) based on error rate, up to the maximum set in `num_workers`.

To start, oumi will run inference at 50% of `num_workers`, then scale up as requests are made successfully.

One inference begins getting failed requests, inference enters a `backoff` state, where the number of active workers is reduced by some fraction.

After entering `backoff`, if inference continues to experience a high error rate, oumi continues to `backoff`.

After a period of stability and successful requests, oumi enters a `warmup` state and begins increasing the number of active workers up to the maximum `num_workers`.

This option can be disabled by setting `use_adaptive_concurrency = False` in the `RemoteParams`.

#### Progress Saving
Whether you specify an output directory or not, oumi automatically saves results as it receives them to your local disk, ensuring that your inference job loses no progress in the event the API goes down.

Additionally, if you rerun inference with the same dataset, model, and generation parameters, oumi will resume from where you left off!

If you need to access the file, it will be saved in `output_path/scratch`.

If the config doesn't have a specified output path, it will instead be in your home directory under `your_home/.cache/oumi/tmp`.