# Bloom: Automated Behavioral Evaluations for LLMs

**Bloom** is an open-source agentic framework from Anthropic for generating behavioral evaluations of frontier AI models. It enables researchers to automatically assess how AI systems exhibit specific behaviors (sycophancy, self-preservation, political bias, etc.) without manual evaluation pipeline engineering.

## Key Features

- **Automated scenario generation**: Generates diverse test scenarios for any specified behavior
- **Four-stage pipeline**: Understanding → Ideation → Rollout → Judgment
- **Multi-provider support**: Works with Anthropic, OpenAI, OpenRouter, and more via LiteLLM
- **W&B integration**: Scale experiments with Weights & Biases sweeps
- **Validated results**: Strong correlation with human judgment (Spearman: 0.86 with Claude Opus 4.1)

## Resources

- [GitHub Repository](https://github.com/safety-research/bloom)
- [Anthropic Research Page](https://www.anthropic.com/research/bloom)
- [Alignment Science Blog](https://alignment.anthropic.com/2025/bloom-auto-evals/)

## Installation

Bloom can be installed directly from GitHub. We'll install it into our current environment.

In [None]:
# Install bloom from the local clone (editable mode for development)
# If you haven't cloned it, use: pip install git+https://github.com/safety-research/bloom.git
!pip install -e ./bloom

## Environment Setup

Bloom requires API keys for the LLM providers you want to use. Set these as environment variables.

In [None]:
import os

# Set your API keys (uncomment and fill in as needed)
# os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
# os.environ["OPENAI_API_KEY"] = "your-openai-key"
# os.environ["OPENROUTER_API_KEY"] = "your-openrouter-key"

# Or load from a .env file
from pathlib import Path

env_file = Path(".env")
if env_file.exists():
    with open(env_file) as f:
        for line in f:
            line = line.strip()
            if line and not line.startswith("#") and "=" in line:
                key, value = line.split("=", 1)
                os.environ[key.strip()] = value.strip().strip('"').strip("'")
    print("Loaded environment variables from .env")
else:
    print("No .env file found - make sure API keys are set")

## The Four-Stage Pipeline

Bloom's evaluation pipeline consists of four stages:

1. **Understanding**: Analyzes behavior descriptions and example transcripts to establish measurement context
2. **Ideation**: Generates evaluation scenarios (situations, simulated users, system prompts, environments)
3. **Rollout**: Executes scenarios in parallel, simulating user and tool responses
4. **Judgment**: Scores transcripts for behavior presence; meta-judge produces suite-level analysis

Let's explore each stage.

## Initialize a Workspace

First, let's initialize a Bloom workspace which creates the configuration files we need.

In [None]:
# Initialize bloom workspace (creates bloom-data/ directory with config files)
!bloom init

This creates:
- `bloom-data/seed.yaml` - Main configuration file
- `bloom-data/models.json` - Model ID mappings
- `bloom-data/behaviors.json` - Behavior definitions
- `.env` - Template for API keys

## Explore the Configuration

Let's look at the seed configuration that controls the evaluation.

In [None]:
import yaml
from pathlib import Path

seed_path = Path("bloom-data/seed.yaml")
if seed_path.exists():
    with open(seed_path) as f:
        seed_config = yaml.safe_load(f)
    
    print("=== Seed Configuration ===")
    print(yaml.dump(seed_config, default_flow_style=False, sort_keys=False))
else:
    print("Run 'bloom init' first to create the configuration files")

## Key Configuration Parameters

| Parameter | Description |
|-----------|-------------|
| `behavior.name` | Target behavior (e.g., "sycophancy", "self-preservation") |
| `behavior.examples` | Example transcripts to guide generation (optional) |
| `ideation.total_evals` | Number of scenarios to generate |
| `rollout.target` | Model to evaluate (LiteLLM ID or short name) |
| `rollout.modality` | "conversation" or "simenv" (with tool calls) |
| `debug` | Enable verbose output |

## Explore Available Behaviors

Bloom comes with pre-defined behaviors you can evaluate. Let's see what's available.

In [None]:
from bloom.data import get_bundled_behaviors, list_bundled_examples

# List all bundled behaviors
behaviors = get_bundled_behaviors()
print("=== Available Behaviors ===")
for name, description in behaviors.items():
    print(f"\n{name}:")
    if isinstance(description, dict):
        print(f"  {description.get('description', 'No description')[:100]}...")
    else:
        print(f"  {str(description)[:100]}...")

In [None]:
# List example transcripts for behaviors
print("=== Available Example Transcripts ===")
examples = list_bundled_examples()
for example in examples:
    print(f"  - {example}")

## Running the Pipeline Programmatically

You can run Bloom either via CLI or programmatically. Here's how to use the Python API.

In [None]:
from bloom.utils import load_config

# Load the configuration
config = load_config("bloom-data/seed.yaml")
print(f"Loaded config for behavior: {config.get('behavior', {}).get('name', 'unknown')}")
print(f"Target model: {config.get('rollout', {}).get('target', 'unknown')}")

### Run Individual Stages

You can run stages individually for more control. This is useful for:
- Debugging specific stages
- Reusing understanding/ideation across multiple models
- Resuming from a checkpoint

In [None]:
# Import individual stages
from bloom.stages.step1_understanding import run_understanding
from bloom.stages.step2_ideation import run_ideation
from bloom.stages.step3_rollout import run_rollout
from bloom.stages.step4_judgment import run_judgment

print("Stage functions imported successfully")
print("\nTo run a stage:")
print("  run_understanding(config=config, config_dir='bloom-data')")
print("  run_ideation(config=config, config_dir='bloom-data')")
print("  await run_rollout(config=config, config_dir='bloom-data')  # async")
print("  await run_judgment(config=config, config_dir='bloom-data')  # async")

### Run the Full Pipeline

For a complete evaluation, use `run_pipeline` which orchestrates all stages.

In [None]:
from bloom.core import run_pipeline

# Uncomment to run the full pipeline (requires API keys and will make API calls)
# run_pipeline(config=config, config_dir="bloom-data")

print("To run the full pipeline, uncomment the line above.")
print("Or use the CLI: bloom run bloom-data")

## CLI Commands Reference

Bloom provides a comprehensive CLI for running evaluations.

In [None]:
# Show available commands
!bloom --help

### Common CLI Commands

```bash
# Initialize workspace
bloom init

# Run full pipeline
bloom run bloom-data
bloom run bloom-data --debug  # verbose output

# Run individual stages
bloom understanding bloom-data
bloom ideation bloom-data
bloom rollout bloom-data
bloom judgment bloom-data

# Interactive chat (for testing models)
bloom chat --model anthropic/claude-sonnet-4-20250514
bloom chat --model openai/gpt-4o --reasoning-effort high
```

## Example: Configuring an Evaluation

Let's create a simple configuration to evaluate a model for sycophancy.

In [None]:
import yaml

# Example configuration for evaluating sycophancy
example_config = {
    "behavior": {
        "name": "sycophancy",
        "examples": ["sycophancy/ethical", "sycophancy/factual"]  # optional
    },
    "ideation": {
        "total_evals": 10,  # number of scenarios to generate
    },
    "rollout": {
        "target": "anthropic/claude-sonnet-4-20250514",  # model to evaluate
        "modality": "conversation",  # or "simenv" for tool-use scenarios
    },
    "debug": True  # enable verbose output
}

print("=== Example Sycophancy Evaluation Config ===")
print(yaml.dump(example_config, default_flow_style=False))

## Viewing Results

Results are saved to `bloom-results/{behavior_name}/`. You can view transcripts using the bloom-viewer.

In [None]:
from pathlib import Path
import json

results_dir = Path("bloom-results")
if results_dir.exists():
    print("=== Available Results ===")
    for behavior_dir in results_dir.iterdir():
        if behavior_dir.is_dir():
            print(f"\n{behavior_dir.name}/")
            for file in sorted(behavior_dir.iterdir())[:5]:  # show first 5 files
                print(f"  - {file.name}")
            remaining = len(list(behavior_dir.iterdir())) - 5
            if remaining > 0:
                print(f"  ... and {remaining} more files")
else:
    print("No results yet. Run 'bloom run bloom-data' to generate results.")

In [None]:
# To view results with the interactive viewer:
# npx @isha-gpt/bloom-viewer --port 8080 --dir ./bloom-results

print("To view results interactively, run:")
print("  npx @isha-gpt/bloom-viewer --port 8080 --dir ./bloom-results")

## Understanding Bloom's Orchestrators

Bloom uses two types of orchestrators to manage conversations:

1. **ConversationOrchestrator**: Basic text-based conversations
2. **SimEnvOrchestrator**: Simulated environment with tool/function calling

The orchestrator is selected based on the `rollout.modality` setting.

In [None]:
from bloom.orchestrators.ConversationOrchestrator import ConversationOrchestrator
from bloom.orchestrators.SimEnvOrchestrator import SimEnvOrchestrator

print("ConversationOrchestrator: For basic conversation evaluations")
print("  - Text-based exchanges between evaluator and target model")
print("  - No tool support")
print()
print("SimEnvOrchestrator: For tool-use evaluations")
print("  - Supports function/tool calling")
print("  - Simulates tool responses dynamically")
print("  - Use modality: 'simenv' in config")

## Benchmark Behaviors

Anthropic has released benchmark results across 16 frontier models for these behaviors:

- **Delusional sycophancy**: Model agrees with demonstrably false user beliefs
- **Instructed long-horizon sabotage**: Model follows instructions to subtly sabotage
- **Self-preservation**: Model takes actions to prevent being modified or shut down
- **Self-preferential bias**: Model favors its own outputs over others

You can use these as starting points for your own evaluations.

## Next Steps

1. **Configure your evaluation**: Edit `bloom-data/seed.yaml` to target a specific behavior
2. **Set API keys**: Add your keys to `.env` and run `source .env`
3. **Run the pipeline**: `bloom run bloom-data --debug`
4. **View results**: Use the bloom-viewer or examine JSON files directly
5. **Scale with W&B**: Use sweep configs from `bloom/examples/sweeps/` for large experiments

For more details, see the [Bloom README](https://github.com/safety-research/bloom) and [technical documentation](https://alignment.anthropic.com/2025/bloom-auto-evals/).