---
title: "Eval Hub API Examples"
subtitle: "Comprehensive guide to using the Evaluation Hub REST API"
author: "Evaluation Service Team"
format:
  html:
    toc: true
    toc-depth: 3
    code-fold: false
    theme: cosmo
  ipynb:
    output-file: api_examples.ipynb
jupyter: python3
---

# Eval Hub API Examples

This notebook demonstrates how to interact with the Evaluation Hub REST API running on `localhost:8000`.

## Setup and Dependencies

In [1]:
import json
import time
from uuid import uuid4

import requests

# Configuration
BASE_URL = "http://localhost:8000"
API_BASE = f"{BASE_URL}/api/v1"

# Helper function for pretty printing JSON responses
def print_json(data):
    print(json.dumps(data, indent=2, default=str))

# Helper function for API requests
def api_request(method: str, endpoint: str, **kwargs) -> requests.Response:
    """Make an API request with proper error handling."""
    url = f"{API_BASE}{endpoint}"
    response = requests.request(method, url, **kwargs)

    print(f"{method.upper()} {url}")
    print(f"Status: {response.status_code}")

    if response.headers.get('content-type', '').startswith('application/json'):
        print("Response:")
        print_json(response.json())
    else:
        print(f"Response: {response.text}")

    print("-" * 50)
    return response

## Health Check

First, let's verify the service is running:

In [2]:
response = api_request("GET", "/health")

if response.status_code == 200:
    health_data = response.json()
    print("✅ Service is healthy!")
    print(f"Version: {health_data['version']}")
    print(f"Uptime: {health_data['uptime_seconds']:.1f} seconds")
else:
    print("❌ Service is not responding correctly")

GET http://localhost:8000/api/v1/health
Status: 200
Response:
{
  "status": "healthy",
  "version": "0.1.0",
  "timestamp": "2025-11-08T23:18:29.974675",
  "components": {
    "mlflow": {
      "status": "healthy",
      "tracking_uri": "http://localhost:5000"
    }
  },
  "uptime_seconds": 1.1920928955078125e-06,
  "active_evaluations": 0
}
--------------------------------------------------
✅ Service is healthy!
Version: 0.1.0
Uptime: 0.0 seconds


## Provider Management

### List All Providers

In [3]:
response = api_request("GET", "/providers")

if response.status_code == 200:
    providers_data = response.json()
    print(f"Found {providers_data['total_providers']} providers:")
    for provider in providers_data['providers']:
        print(f"  - {provider['provider_name']} ({provider['provider_id']})")
        print(f"    Type: {provider['provider_type']}")
        print(f"    Benchmarks: {provider['benchmark_count']}")

GET http://localhost:8000/api/v1/providers
Status: 200
Response:
{
  "providers": [
    {
      "provider_id": "lm_evaluation_harness",
      "provider_name": "LM Evaluation Harness",
      "description": "Comprehensive evaluation framework for language models with 167 benchmarks",
      "provider_type": "builtin",
      "benchmark_count": 168
    },
    {
      "provider_id": "ragas",
      "provider_name": "RAGAS",
      "description": "Retrieval Augmented Generation Assessment framework",
      "provider_type": "builtin",
      "benchmark_count": 4
    },
    {
      "provider_id": "garak",
      "provider_name": "Garak",
      "description": "LLM vulnerability scanner and red-teaming framework",
      "provider_type": "builtin",
      "benchmark_count": 4
    }
  ],
  "total_providers": 3,
  "total_benchmarks": 176
}
--------------------------------------------------
Found 3 providers:
  - LM Evaluation Harness (lm_evaluation_harness)
    Type: builtin
    Benchmarks: 168
  - RAGAS

### Get Specific Provider Details

In [4]:
# Get details for the lm_evaluation_harness provider
provider_id = "lm_evaluation_harness"
response = api_request("GET", f"/providers/{provider_id}")

if response.status_code == 200:
    provider = response.json()
    print(f"Provider: {provider['provider_name']}")
    print(f"Description: {provider['description']}")
    print(f"Number of benchmarks: {len(provider['benchmarks'])}")

GET http://localhost:8000/api/v1/providers/lm_evaluation_harness
Status: 200
Response:
{
  "provider_id": "lm_evaluation_harness",
  "provider_name": "LM Evaluation Harness",
  "description": "Comprehensive evaluation framework for language models with 167 benchmarks",
  "provider_type": "builtin",
  "benchmarks": [
    {
      "benchmark_id": "arc_easy",
      "name": "ARC Easy",
      "description": "ARC Easy evaluation benchmark - AI2 Reasoning Challenge (Easy)",
      "category": "reasoning",
      "metrics": [
        "accuracy",
        "acc_norm"
      ],
      "num_few_shot": 0,
      "dataset_size": 2376,
      "tags": [
        "reasoning",
        "science",
        "lm_eval"
      ]
    },
    {
      "benchmark_id": "AraDiCE_boolq_lev",
      "name": "Aradice Boolq Lev",
      "description": "Aradice Boolq Lev evaluation benchmark",
      "category": "general",
      "metrics": [
        "accuracy"
      ],
      "num_few_shot": 0,
      "dataset_size": 3270,
      "tags":

## Benchmark Discovery

### List All Benchmarks

In [5]:
response = api_request("GET", "/benchmarks")

if response.status_code == 200:
    benchmarks_data = response.json()
    print(f"Total benchmarks available: {benchmarks_data['total_count']}")

    # Show first 5 benchmarks
    for benchmark in benchmarks_data['benchmarks'][:5]:
        print(f"  - {benchmark['name']} ({benchmark['benchmark_id']})")
        print(f"    Category: {benchmark['category']}")
        print(f"    Provider: {benchmark['provider_id']}")

GET http://localhost:8000/api/v1/benchmarks
Status: 200
Response:
{
  "benchmarks": [
    {
      "benchmark_id": "lm_evaluation_harness::arc_easy",
      "provider_id": "lm_evaluation_harness",
      "name": "ARC Easy",
      "description": "ARC Easy evaluation benchmark - AI2 Reasoning Challenge (Easy)",
      "category": "reasoning",
      "metrics": [
        "accuracy",
        "acc_norm"
      ],
      "num_few_shot": 0,
      "dataset_size": 2376,
      "tags": [
        "reasoning",
        "science",
        "lm_eval"
      ]
    },
    {
      "benchmark_id": "lm_evaluation_harness::AraDiCE_boolq_lev",
      "provider_id": "lm_evaluation_harness",
      "name": "Aradice Boolq Lev",
      "description": "Aradice Boolq Lev evaluation benchmark",
      "category": "general",
      "metrics": [
        "accuracy"
      ],
      "num_few_shot": 0,
      "dataset_size": 3270,
      "tags": [
        "general",
        "lm_eval"
      ]
    },
    {
      "benchmark_id": "lm_evaluat

### Filter Benchmarks by Category

In [6]:
response = api_request("GET", "/benchmarks", params={"category": "math"})

if response.status_code == 200:
    math_benchmarks = response.json()
    print(f"Math benchmarks: {math_benchmarks['total_count']}")
    for benchmark in math_benchmarks['benchmarks']:
        print(f"  - {benchmark['name']}: {benchmark['description']}")

GET http://localhost:8000/api/v1/benchmarks
Status: 200
Response:
{
  "benchmarks": [
    {
      "benchmark_id": "lm_evaluation_harness::AraDiCE_ArabicMMLU_primary_stem_math_egy",
      "provider_id": "lm_evaluation_harness",
      "name": "Aradice Arabicmmlu Primary Stem Math Egy",
      "description": "Aradice Arabicmmlu Primary Stem Math Egy evaluation benchmark",
      "category": "math",
      "metrics": [
        "exact_match",
        "accuracy"
      ],
      "num_few_shot": 0,
      "dataset_size": 14042,
      "tags": [
        "math",
        "lm_eval"
      ]
    },
    {
      "benchmark_id": "lm_evaluation_harness::arabic_leaderboard_arabic_mmlu_college_mathematics_light",
      "provider_id": "lm_evaluation_harness",
      "name": "Arabic Leaderboard Arabic Mmlu College Mathematics Light",
      "description": "Arabic Leaderboard Arabic Mmlu College Mathematics Light evaluation benchmark",
      "category": "math",
      "metrics": [
        "exact_match",
        "accu

### Get Provider-Specific Benchmarks

In [7]:
provider_id = "lm_evaluation_harness"
response = api_request("GET", f"/providers/{provider_id}/benchmarks")

if response.status_code == 200:
    benchmarks = response.json()
    print(f"Benchmarks for {provider_id}: {len(benchmarks)}")

    # Group by category
    categories = {}
    for benchmark in benchmarks:
        category = benchmark['category']
        if category not in categories:
            categories[category] = []
        categories[category].append(benchmark['name'])

    for category, names in categories.items():
        print(f"\n{category.title()}: {len(names)} benchmarks")
        print(f"  Examples: {', '.join(names[:3])}")

GET http://localhost:8000/api/v1/providers/lm_evaluation_harness/benchmarks
Status: 200
Response:
[
  {
    "benchmark_id": "arc_easy",
    "provider_id": "lm_evaluation_harness",
    "provider_name": "LM Evaluation Harness",
    "name": "ARC Easy",
    "description": "ARC Easy evaluation benchmark - AI2 Reasoning Challenge (Easy)",
    "category": "reasoning",
    "metrics": [
      "accuracy",
      "acc_norm"
    ],
    "num_few_shot": 0,
    "dataset_size": 2376,
    "tags": [
      "reasoning",
      "science",
      "lm_eval"
    ],
    "provider_type": "builtin"
  },
  {
    "benchmark_id": "AraDiCE_boolq_lev",
    "provider_id": "lm_evaluation_harness",
    "provider_name": "LM Evaluation Harness",
    "name": "Aradice Boolq Lev",
    "description": "Aradice Boolq Lev evaluation benchmark",
    "category": "general",
    "metrics": [
      "accuracy"
    ],
    "num_few_shot": 0,
    "dataset_size": 3270,
    "tags": [
      "general",
      "lm_eval"
    ],
    "provider_type"

## Collections

### List Available Collections

In [8]:
response = api_request("GET", "/collections")

if response.status_code == 200:
    collections = response.json()
    print(f"Available collections: {collections['total_collections']}")

    for collection in collections['collections']:
        print(f"\n📁 {collection['name']} ({collection['collection_id']})")
        print(f"   Description: {collection['description']}")
        print(f"   Benchmarks: {len(collection['benchmarks'])}")
        for benchmark_ref in collection['benchmarks'][:3]:  # Show first 3
            print(f"     - {benchmark_ref['provider_id']}::{benchmark_ref['benchmark_id']}")

GET http://localhost:8000/api/v1/collections
Status: 200
Response:
{
  "collections": [
    {
      "collection_id": "healthcare_safety_v1",
      "name": "Healthcare Safety Collection v1",
      "description": "Comprehensive healthcare AI safety evaluation suite",
      "benchmarks": [
        {
          "provider_id": "lm_evaluation_harness",
          "benchmark_id": "truthfulqa"
        },
        {
          "provider_id": "lm_evaluation_harness",
          "benchmark_id": "pubmedqa"
        },
        {
          "provider_id": "lm_evaluation_harness",
          "benchmark_id": "medmcqa"
        },
        {
          "provider_id": "garak",
          "benchmark_id": "bias_detection"
        },
        {
          "provider_id": "garak",
          "benchmark_id": "pii_leakage"
        }
      ]
    },
    {
      "collection_id": "automotive_safety_v1",
      "name": "Automotive Safety Collection v1",
      "description": "Automotive AI safety and reliability evaluation suite",


## Model Server Management

### List All Model Servers

In [9]:
response = api_request("GET", "/servers")

if response.status_code == 200:
    servers_data = response.json()
    print(f"Total servers: {servers_data['total_servers']}")
    print(f"Runtime servers: {len(servers_data.get('runtime_servers', []))}")
    
    print("\n📋 Model Servers:")
    for server in servers_data.get('servers', []):
        print(f"  - {server['server_id']}")
        print(f"    Type: {server['server_type']}")
        print(f"    Base URL: {server['base_url']}")
        print(f"    Models: {server['model_count']}")
        print(f"    Status: {server['status']}")

GET http://localhost:8000/api/v1/servers
Status: 200
Response:
{
  "servers": [
    {
      "server_id": "vllm",
      "server_type": "vllm",
      "base_url": "http://vllm-server.test.svc.cluster.local:8000",
      "model_count": 1,
      "status": "active",
      "tags": [
        "runtime"
      ],
      "created_at": "2025-11-08T23:18:51.920032"
    }
  ],
  "total_servers": 1,
  "runtime_servers": [
    {
      "server_id": "vllm",
      "server_type": "vllm",
      "base_url": "http://vllm-server.test.svc.cluster.local:8000",
      "model_count": 1,
      "status": "active",
      "tags": [
        "runtime"
      ],
      "created_at": "2025-11-08T23:18:51.920032"
    }
  ]
}
--------------------------------------------------
Total servers: 1
Runtime servers: 1

📋 Model Servers:
  - vllm
    Type: vllm
    Base URL: http://vllm-server.test.svc.cluster.local:8000
    Models: 1
    Status: active


### List Only Active Servers

In [10]:
response = api_request("GET", "/servers", params={"include_inactive": False})

if response.status_code == 200:
    servers_data = response.json()
    print(f"Active servers: {servers_data['total_servers']}")
    for server in servers_data.get('servers', []):
        print(f"  - {server['server_id']} - {server['status']}")

GET http://localhost:8000/api/v1/servers
Status: 200
Response:
{
  "servers": [
    {
      "server_id": "vllm",
      "server_type": "vllm",
      "base_url": "http://vllm-server.test.svc.cluster.local:8000",
      "model_count": 1,
      "status": "active",
      "tags": [
        "runtime"
      ],
      "created_at": "2025-11-08T23:18:54.380045"
    }
  ],
  "total_servers": 1,
  "runtime_servers": [
    {
      "server_id": "vllm",
      "server_type": "vllm",
      "base_url": "http://vllm-server.test.svc.cluster.local:8000",
      "model_count": 1,
      "status": "active",
      "tags": [
        "runtime"
      ],
      "created_at": "2025-11-08T23:18:54.380045"
    }
  ]
}
--------------------------------------------------
Active servers: 1
  - vllm - active


### Get Server by ID

In [11]:
# Get details for a specific model server
server_id = "vllm"  # Replace with an actual server ID from your system
response = api_request("GET", f"/servers/{server_id}")

if response.status_code == 200:
    server = response.json()
    print(f"Server ID: {server['server_id']}")
    print(f"Type: {server['server_type']}")
    print(f"Base URL: {server['base_url']}")
    print(f"Status: {server['status']}")
    
    print(f"\n📦 Models on this server ({len(server['models'])}):")
    for model in server['models']:
        print(f"  - {model['model_name']}")
        print(f"    Status: {model['status']}")
        if model.get('description'):
            print(f"    Description: {model['description']}")
    
    if server.get('tags'):
        print(f"\nTags: {', '.join(server['tags'])}")
elif response.status_code == 404:
    print(f"❌ Server '{server_id}' not found")

GET http://localhost:8000/api/v1/servers/vllm
Status: 200
Response:
{
  "server_id": "vllm",
  "server_type": "vllm",
  "base_url": "http://vllm-server.test.svc.cluster.local:8000",
  "api_key_required": true,
  "models": [
    {
      "model_name": "vllm",
      "description": null,
      "capabilities": null,
      "config": null,
      "status": "active",
      "tags": [
        "runtime"
      ]
    }
  ],
  "server_config": null,
  "status": "active",
  "tags": [
    "runtime"
  ],
  "created_at": "2025-11-08T23:18:56.893617",
  "updated_at": "2025-11-08T23:18:56.893618"
}
--------------------------------------------------
Server ID: vllm
Type: vllm
Base URL: http://vllm-server.test.svc.cluster.local:8000
Status: active

📦 Models on this server (1):
  - vllm
    Status: active

Tags: runtime


### Get Model by Server and Name

In [12]:
# Get a specific model by getting the server and finding the model in its models list
server_id = "vllm"
model_name = "tinyllama"  # Replace with actual model name

response = api_request("GET", f"/servers/{server_id}")

if response.status_code == 200:
    server = response.json()
    model = None
    for m in server['models']:
        if m['model_name'] == model_name:
            model = m
            break
    
    if model:
        print(f"✅ Found model: {model['model_name']}")
        print(f"   Server: {server['server_id']}")
        print(f"   Status: {model['status']}")
        if model.get('description'):
            print(f"   Description: {model['description']}")
        if model.get('capabilities'):
            print(f"   Capabilities: {model['capabilities']}")
    else:
        print(f"❌ Model '{model_name}' not found on server '{server_id}'")
else:
    print(f"❌ Server '{server_id}' not found")

GET http://localhost:8000/api/v1/servers/vllm
Status: 200
Response:
{
  "server_id": "vllm",
  "server_type": "vllm",
  "base_url": "http://vllm-server.test.svc.cluster.local:8000",
  "api_key_required": true,
  "models": [
    {
      "model_name": "vllm",
      "description": null,
      "capabilities": null,
      "config": null,
      "status": "active",
      "tags": [
        "runtime"
      ]
    }
  ],
  "server_config": null,
  "status": "active",
  "tags": [
    "runtime"
  ],
  "created_at": "2025-11-08T23:18:59.483593",
  "updated_at": "2025-11-08T23:18:59.483594"
}
--------------------------------------------------
❌ Model 'tinyllama' not found on server 'vllm'


### Register a New Model Server

In [None]:
# Register a model server with models
new_server = {
    "server_id": "groq-server",
    "server_type": "openai-compatible",
    "base_url": "https://api.groq.com/openai/v1",
    "api_key_required": True,
    "models": [
        {
            "model_name": "llama-3.1-70b",
            "description": "Meta's Llama 3.1 70B model",
            "status": "active",
            "tags": ["groq", "llama", "70b"]
        },
        {
            "model_name": "llama-3.1-8b",
            "description": "Meta's Llama 3.1 8B model",
            "status": "active",
            "tags": ["groq", "llama", "8b"]
        }
    ],
    "server_config": {
        "temperature": 0.7,
        "max_tokens": 2048,
        "timeout": 60,
        "retry_attempts": 3
    },
    "status": "active",
    "tags": ["groq", "openai-compatible", "fast"]
}

print("📝 Registering new model server...")
print_json(new_server)

response = api_request("POST", "/servers", json=new_server)

if response.status_code == 201:
    registered_server = response.json()
    print(f"✅ Model server registered successfully!")
    print(f"Server ID: {registered_server['server_id']}")
    print(f"Models: {len(registered_server['models'])}")
    print(f"Created at: {registered_server.get('created_at', 'N/A')}")
else:
    print(f"❌ Failed to register server: {response.text}")

### Register a vLLM Server

In [None]:
# Register a vLLM server
vllm_server = {
    "server_id": "local-vllm",
    "server_type": "vllm",
    "base_url": "http://localhost:8000",
    "api_key_required": False,
    "models": [
        {
            "model_name": "llama-2-7b",
            "description": "Llama 2 7B running on local vLLM server",
            "status": "active",
            "tags": ["vllm", "local", "llama-2"]
        }
    ],
    "status": "active",
    "tags": ["vllm", "local"]
}

print("📝 Registering vLLM server...")
response = api_request("POST", "/servers", json=vllm_server)

if response.status_code == 201:
    print(f"✅ vLLM server registered: {response.json()['server_id']}")
else:
    print(f"⚠️ Note: This may fail if the server ID already exists")
    print(f"Response: {response.text}")

### Update a Model Server

In [None]:
# Update server details
server_id = "groq-server"  # Replace with an actual server ID

update_request = {
    "status": "active",
    "tags": ["groq", "openai-compatible", "fast", "updated"]
}

print(f"📝 Updating server: {server_id}")
print_json(update_request)

response = api_request("PUT", f"/servers/{server_id}", json=update_request)

if response.status_code == 200:
    updated_server = response.json()
    print(f"✅ Server updated successfully!")
    print(f"Server ID: {updated_server['server_id']}")
    print(f"Tags: {', '.join(updated_server.get('tags', []))}")
elif response.status_code == 404:
    print(f"❌ Server '{server_id}' not found")
else:
    print(f"❌ Failed to update server: {response.text}")

### Delete a Model Server

In [None]:
# Delete a model server (runtime servers cannot be deleted via API)
server_id = "groq-server"  # Replace with an actual server ID

print(f"🗑️ Deleting server: {server_id}")
response = api_request("DELETE", f"/servers/{server_id}")

if response.status_code == 200:
    result = response.json()
    print(f"✅ {result.get('message', 'Server deleted successfully')}")
elif response.status_code == 404:
    print(f"❌ Server '{server_id}' not found")
elif response.status_code == 400:
    print(f"❌ Cannot delete runtime server (configured via environment variables)")
    print(f"Response: {response.text}")
else:
    print(f"❌ Failed to delete server: {response.text}")

### Reload Runtime Servers

In [None]:
# Reload model servers configured via environment variables
print("🔄 Reloading runtime servers from environment variables...")
response = api_request("POST", "/servers/reload")

if response.status_code == 200:
    result = response.json()
    print(f"✅ {result.get('message', 'Runtime servers reloaded successfully')}")
    
    # List servers again to see any new runtime servers
    print("\n📋 Updated server list:")
    list_response = api_request("GET", "/servers")
    if list_response.status_code == 200:
        servers_data = list_response.json()
        print(f"Total servers: {servers_data['total_servers']}")
        print(f"Runtime servers: {len(servers_data.get('runtime_servers', []))}")
else:
    print(f"❌ Failed to reload servers: {response.text}")

## Basic Evaluation Examples

### Single Benchmark Evaluation from Builtin Provider (Simplified API)

In [13]:
# Example: Run a single benchmark using the simplified API (Llama Stack compatible)
provider_id = "lm_evaluation_harness"
benchmark_id = "arc_easy"

single_benchmark_request = {
    "model": {
        "server": "vllm",
        "name": "tinyllama"
    },
    "model_configuration": {
        "temperature": 0.0,
        "max_tokens": 512
    },
    "timeout_minutes": 30,
    "retry_attempts": 1,
    "limit": 100,  # Limit to 100 samples for faster execution
    "num_fewshot": 0,
    "experiment_name": "Single Benchmark - ARC Easy",
    "tags": {
        "example_type": "single_benchmark",
        "provider": "lm_evaluation_harness",
        "benchmark": "arc_easy"
    }
}

print("📝 Creating single benchmark evaluation request...")
print(f"Provider ID: {provider_id}")
print(f"Benchmark ID: {benchmark_id}")
print_json(single_benchmark_request)

response = api_request("POST", f"/evaluations/benchmarks/{provider_id}/{benchmark_id}", json=single_benchmark_request)

if response.status_code == 202:
    evaluation_response = response.json()
    request_id = evaluation_response["request_id"]
    print(f"✅ Single benchmark evaluation created successfully!")
    print(f"Request ID: {request_id}")
    print(f"Status: {evaluation_response['status']}")
    print(f"Experiment URL: {evaluation_response.get('experiment_url', 'N/A')}")
else:
    print("❌ Failed to create evaluation")
    print(f"Error: {response.text}")

📝 Creating single benchmark evaluation request...
Provider ID: lm_evaluation_harness
Benchmark ID: arc_easy
{
  "model": {
    "server": "vllm",
    "name": "tinyllama"
  },
  "model_configuration": {
    "temperature": 0.0,
    "max_tokens": 512
  },
  "timeout_minutes": 30,
  "retry_attempts": 1,
  "limit": 100,
  "num_fewshot": 0,
  "experiment_name": "Single Benchmark - ARC Easy",
  "tags": {
    "example_type": "single_benchmark",
    "provider": "lm_evaluation_harness",
    "benchmark": "arc_easy"
  }
}
POST http://localhost:8000/api/v1/evaluations/benchmarks/lm_evaluation_harness/arc_easy
Status: 202
Response:
{
  "request_id": "a24b471e-edcb-4130-8b74-07219d25e108",
  "status": "pending",
  "total_evaluations": 0,
  "completed_evaluations": 0,
  "failed_evaluations": 0,
  "results": [],
  "aggregated_metrics": {},
  "experiment_url": "http://localhost:5000/#/experiments/exp_c6a74a5e",
  "created_at": "2025-11-08T23:19:07.245302Z",
  "updated_at": "2025-11-08T23:19:07.245493",
 

### Simple Evaluation with Risk Category

In [None]:
# Create a simple evaluation request using risk category
evaluation_request = {
    "request_id": str(uuid4()),
    "experiment_name": "Simple Risk-Based Evaluation",
    "evaluations": [
        {
            "name": "GPT-4 Mini Low Risk Evaluation",
            "description": "Basic evaluation using low risk benchmarks",
            "model": {
                "server": "default",
                "name": "default"
            },
            "model_configuration": {
                "temperature": 0.0,
                "max_tokens": 512
            },
            "risk_category": "low",
            "timeout_minutes": 30,
            "retry_attempts": 1
        }
    ],
    "tags": {
        "example_type": "risk_category",
        "complexity": "simple"
    }
}

print("📝 Creating evaluation request...")
print_json(evaluation_request)

response = api_request("POST", "/evaluations", json=evaluation_request)

if response.status_code == 202:
    evaluation_response = response.json()
    request_id = evaluation_response["request_id"]
    print(f"✅ Evaluation created successfully!")
    print(f"Request ID: {request_id}")
    print(f"Status: {evaluation_response['status']}")
    print(f"Experiment URL: {evaluation_response.get('experiment_url', 'N/A')}")
else:
    print("❌ Failed to create evaluation")
    print(f"Error: {response.text}")

### Evaluation with Explicit Backend Configuration

In [None]:
# Create an evaluation with explicit backend configuration
explicit_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": "Explicit Backend Configuration",
    "evaluations": [
        {
            "name": "LM-Eval Harness Evaluation",
            "description": "Evaluation with explicit lm-evaluation-harness configuration",
            "model": {
                "server": "default",
                "name": "default"
            },
            "model_configuration": {
                "temperature": 0.1,
                "max_tokens": 256,
                "top_p": 0.95
            },
            "backends": [
                {
                    "name": "lm-eval-backend",
                    "type": "lm-evaluation-harness",
                    "config": {
                        "batch_size": 1,
                        "device": "cpu"
                    },
                    "benchmarks": [
                        {
                            "name": "arc_easy",
                            "tasks": ["arc_easy"],
                            "config": {
                                "num_fewshot": 5,
                                "limit": 50
                            }
                        },
                        {
                            "name": "hellaswag",
                            "tasks": ["hellaswag"],
                            "config": {
                                "num_fewshot": 10,
                                "limit": 100
                            }
                        }
                    ]
                }
            ],
            "timeout_minutes": 45,
            "retry_attempts": 2
        }
    ],
    "tags": {
        "example_type": "explicit_backend",
        "complexity": "intermediate"
    }
}

print("📝 Creating evaluation with explicit backend...")
response = api_request("POST", "/evaluations", json=explicit_evaluation)

if response.status_code == 202:
    explicit_response = response.json()
    explicit_request_id = explicit_response["request_id"]
    print(f"✅ Explicit evaluation created!")
    print(f"Request ID: {explicit_request_id}")

## NeMo Evaluator Integration

### Single NeMo Evaluator Container

In [None]:
# Example with single NeMo Evaluator container
nemo_single_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": "NeMo Evaluator Single Container",
    "evaluations": [
        {
            "name": "GPT-4 via NeMo Evaluator",
            "description": "Remote evaluation using NeMo Evaluator container",
            "model": {
                "server": "default",
                "name": "default"
            },
            "model_configuration": {
                "temperature": 0.0,
                "max_tokens": 512,
                "top_p": 0.95
            },
            "backends": [
                {
                    "name": "remote-nemo-evaluator",
                    "type": "nemo-evaluator",
                    "config": {
                        "endpoint": "localhost",
                        "port": 3825,
                        "model_endpoint": "https://api.openai.com/v1/chat/completions",
                        "endpoint_type": "chat",
                        "api_key_env": "OPENAI_API_KEY",
                        "timeout_seconds": 1800,
                        "max_retries": 2,
                        "verify_ssl": False,
                        "framework_name": "eval-hub-example",
                        "parallelism": 1,
                        "limit_samples": 25,
                        "temperature": 0.0,
                        "top_p": 0.95
                    },
                    "benchmarks": [
                        {
                            "name": "mmlu_pro_sample",
                            "tasks": ["mmlu_pro"],
                            "config": {
                                "limit": 25,
                                "num_fewshot": 5
                            }
                        }
                    ]
                }
            ],
            "timeout_minutes": 60,
            "retry_attempts": 1
        }
    ],
    "tags": {
        "example_type": "nemo_evaluator_single",
        "complexity": "advanced",
        "backend": "remote_container"
    }
}

print("📝 Creating NeMo Evaluator evaluation...")
print("Note: This requires a running NeMo Evaluator container on localhost:3825")

response = api_request("POST", "/evaluations", json=nemo_single_evaluation)

if response.status_code == 202:
    nemo_response = response.json()
    nemo_request_id = nemo_response["request_id"]
    print(f"✅ NeMo evaluation created!")
    print(f"Request ID: {nemo_request_id}")
else:
    print("⚠️ NeMo evaluation failed (container may not be running)")
    print(f"Response: {response.text}")

### Multi-Container NeMo Evaluator Setup

In [None]:
# Example with multiple specialized NeMo Evaluator containers
nemo_multi_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": "Multi-Container NeMo Evaluation",
    "evaluations": [
        {
            "name": "Distributed LLaMA Evaluation",
            "description": "Multi-container evaluation across specialized endpoints",
            "model": {
                "server": "default",
                "name": "default"
            },
            "model_configuration": {
                "temperature": 0.1,
                "max_tokens": 512,
                "top_p": 0.95
            },
            "backends": [
                {
                    "name": "academic-evaluator",
                    "type": "nemo-evaluator",
                    "config": {
                        "endpoint": "academic-eval.example.com",
                        "port": 3825,
                        "model_endpoint": "https://api.groq.com/openai/v1/chat/completions",
                        "endpoint_type": "chat",
                        "api_key_env": "GROQ_API_KEY",
                        "timeout_seconds": 3600,
                        "framework_name": "eval-hub-academic",
                        "parallelism": 2
                    },
                    "benchmarks": [
                        {
                            "name": "mmlu_pro",
                            "tasks": ["mmlu_pro"],
                            "config": {"limit": 100, "num_fewshot": 5}
                        },
                        {
                            "name": "arc_challenge",
                            "tasks": ["arc_challenge"],
                            "config": {"limit": 200, "num_fewshot": 25}
                        }
                    ]
                },
                {
                    "name": "math-evaluator",
                    "type": "nemo-evaluator",
                    "config": {
                        "endpoint": "math-eval.example.com",
                        "port": 3825,
                        "model_endpoint": "https://api.groq.com/openai/v1/chat/completions",
                        "endpoint_type": "chat",
                        "api_key_env": "GROQ_API_KEY",
                        "temperature": 0.0,
                        "parallelism": 1,
                        "framework_name": "eval-hub-math"
                    },
                    "benchmarks": [
                        {
                            "name": "gsm8k",
                            "tasks": ["gsm8k"],
                            "config": {"limit": 100, "num_fewshot": 8}
                        },
                        {
                            "name": "math",
                            "tasks": ["hendrycks_math"],
                            "config": {"limit": 50, "num_fewshot": 4}
                        }
                    ]
                }
            ],
            "timeout_minutes": 120,
            "retry_attempts": 2
        }
    ],
    "tags": {
        "example_type": "nemo_evaluator_multi",
        "complexity": "expert",
        "backend": "distributed_containers"
    }
}

print("📝 Creating multi-container NeMo evaluation...")
print("Note: This is a hypothetical example with multiple remote containers")
print_json(nemo_multi_evaluation)

## Evaluation Status Monitoring

### Check Evaluation Status

In [None]:
# Function to check evaluation status
def check_evaluation_status(request_id: str):
    response = api_request("GET", f"/evaluations/{request_id}")

    if response.status_code == 200:
        status_data = response.json()
        print(f"📊 Evaluation Status for {request_id}")
        print(f"Status: {status_data['status']}")
        print(f"Progress: {status_data.get('progress_percentage', 0):.1f}%")
        print(f"Total evaluations: {status_data.get('total_evaluations', 0)}")
        print(f"Completed: {status_data.get('completed_evaluations', 0)}")
        print(f"Failed: {status_data.get('failed_evaluations', 0)}")

        if status_data.get('results'):
            print(f"Results available: {len(status_data['results'])}")

        return status_data
    else:
        print(f"❌ Failed to get status: {response.text}")
        return None

# Check status of previously created evaluations (if they exist)
try:
    if 'request_id' in locals():
        check_evaluation_status(request_id)
except NameError:
    print("No evaluation request_id available to check")

### Monitor Evaluation Progress

In [None]:
# Function to monitor evaluation until completion
def monitor_evaluation(request_id: str, max_wait_time: int = 300):
    """Monitor an evaluation until completion or timeout."""
    start_time = time.time()

    while time.time() - start_time < max_wait_time:
        status_data = check_evaluation_status(request_id)

        if not status_data:
            break

        status = status_data['status']

        if status in ['completed', 'failed', 'cancelled']:
            print(f"🏁 Evaluation {status}!")

            if status == 'completed' and status_data.get('results'):
                print("\n📊 Results Summary:")
                for result in status_data['results'][:3]:  # Show first 3 results
                    print(f"  - {result['benchmark_name']}: {result['status']}")
                    if result.get('metrics'):
                        for metric, value in list(result['metrics'].items())[:2]:
                            print(f"    {metric}: {value}")

            return status_data

        print(f"⏳ Still {status}, waiting...")
        time.sleep(10)

    print(f"⏰ Monitoring timed out after {max_wait_time} seconds")
    return None

# Example usage (uncomment if you have a running evaluation)
# monitor_evaluation(request_id)

## List All Evaluations

In [None]:
response = api_request("GET", "/evaluations")

if response.status_code == 200:
    evaluations = response.json()
    print(f"📋 Active evaluations: {len(evaluations)}")

    for eval_resp in evaluations:
        print(f"\n🔍 {eval_resp['request_id']}")
        print(f"   Status: {eval_resp['status']}")
        print(f"   Progress: {eval_resp.get('progress_percentage', 0):.1f}%")
        print(f"   Created: {eval_resp['created_at']}")

## System Metrics

In [None]:
response = api_request("GET", "/metrics/system")

if response.status_code == 200:
    metrics = response.json()
    print("📊 System Metrics:")
    print(f"  Active evaluations: {metrics['active_evaluations']}")
    print(f"  Running tasks: {metrics['running_tasks']}")
    print(f"  Total requests: {metrics['total_requests']}")

    if metrics.get('status_breakdown'):
        print("\n  Status breakdown:")
        for status, count in metrics['status_breakdown'].items():
            print(f"    {status}: {count}")

    if metrics.get('memory_usage'):
        print(f"\n  Memory usage:")
        print(f"    Active evaluations: {metrics['memory_usage']['active_evaluations_mb']:.1f} MB")

## Evaluation Management

### Cancel an Evaluation

In [None]:
# Function to cancel an evaluation
def cancel_evaluation(request_id: str):
    response = api_request("DELETE", f"/evaluations/{request_id}")

    if response.status_code == 200:
        result = response.json()
        print(f"✅ {result['message']}")
        return True
    else:
        print(f"❌ Failed to cancel: {response.text}")
        return False

# Example usage (uncomment if you want to cancel an evaluation)
# cancel_evaluation(request_id)

## Error Handling Examples

### Invalid Request Handling

In [None]:
# Example of invalid request to demonstrate error handling
invalid_request = {
    "request_id": "invalid-uuid-format",
    "evaluations": [
        {
            "name": "",  # Invalid: empty name
            "model": {
                "server": "",  # Invalid: empty server
                "name": ""  # Invalid: empty model name
            },
            "backends": []  # Invalid: no backends
        }
    ]
}

print("📝 Testing error handling with invalid request...")
response = api_request("POST", "/evaluations", json=invalid_request)

if response.status_code >= 400:
    print("✅ Error handling working correctly")
    error_data = response.json()
    print(f"Error type: {response.status_code}")
    print(f"Error message: {error_data.get('detail', 'Unknown error')}")

### Non-existent Resource Handling

In [None]:
# Test accessing non-existent evaluation
fake_request_id = str(uuid4())
print(f"🔍 Testing access to non-existent evaluation: {fake_request_id}")

response = api_request("GET", f"/evaluations/{fake_request_id}")

if response.status_code == 404:
    print("✅ 404 handling working correctly")
    error_data = response.json()
    print(f"Error: {error_data['detail']}")

## Advanced Examples

### Batch Evaluation Requests

In [None]:
# Create multiple evaluations for comparison
batch_requests = []

models_to_compare = ["gpt-4o-mini", "gpt-3.5-turbo"]
risk_levels = ["low", "medium"]

for model in models_to_compare:
    for risk in risk_levels:
        batch_request = {
            "request_id": str(uuid4()),
            "experiment_name": f"Batch Comparison - {model} - {risk} risk",
            "evaluations": [
                {
                    "name": f"{model} {risk} risk evaluation",
                    "model": {
                        "server": "default",
                        "name": "default"
                    },
                    "model_configuration": {
                        "temperature": 0.0,
                        "max_tokens": 256
                    },
                    "risk_category": risk,
                    "timeout_minutes": 30
                }
            ],
            "tags": {
                "batch_id": "model_comparison_001",
                "model": model,
                "risk_level": risk
            }
        }
        batch_requests.append(batch_request)

print(f"📦 Creating {len(batch_requests)} batch evaluations...")

batch_results = []
for i, request in enumerate(batch_requests):
    print(f"\n📝 Creating batch request {i+1}/{len(batch_requests)}")
    response = api_request("POST", "/evaluations", json=request)

    if response.status_code == 202:
        batch_results.append(response.json())
        print(f"✅ Batch {i+1} created: {response.json()['request_id']}")
    else:
        print(f"❌ Batch {i+1} failed")

print(f"\n📊 Successfully created {len(batch_results)} batch evaluations")

### Configuration Validation

In [None]:
# Test various configuration combinations
test_configs = [
    {
        "name": "High timeout test",
        "config": {"timeout_minutes": 120, "retry_attempts": 5},
        "expected": "success"
    },
    {
        "name": "Zero timeout test",
        "config": {"timeout_minutes": 0, "retry_attempts": 1},
        "expected": "validation_error"
    },
    {
        "name": "Negative retry test",
        "config": {"timeout_minutes": 30, "retry_attempts": -1},
        "expected": "validation_error"
    }
]

for test in test_configs:
    print(f"\n🧪 Testing: {test['name']}")

    test_request = {
        "request_id": str(uuid4()),
        "experiment_name": test['name'],
        "evaluations": [
            {
                "name": "Config test",
                "model": {
                    "server": "default",
                    "name": "default"
                },
                "risk_category": "low",
                **test['config']
            }
        ]
    }

    response = api_request("POST", "/evaluations", json=test_request)

    if test['expected'] == "success" and response.status_code == 202:
        print("✅ Test passed")
    elif test['expected'] == "validation_error" and response.status_code >= 400:
        print("✅ Validation correctly rejected invalid config")
    else:
        print(f"❌ Unexpected result: {response.status_code}")

## Summary

This notebook demonstrated comprehensive usage of the Eval Hub API including:

- ✅ **Basic Operations**: Health checks, provider/benchmark discovery
- ✅ **Model Management**: Register, list, update, and delete models
- ✅ **Simple Evaluations**: Risk category-based evaluations
- ✅ **Advanced Evaluations**: Explicit backend configuration
- ✅ **NeMo Integration**: Single and multi-container setups
- ✅ **Monitoring**: Status checking and progress tracking
- ✅ **Management**: Cancellation and system metrics
- ✅ **Error Handling**: Validation and error responses
- ✅ **Batch Operations**: Multiple evaluation management

For production use, remember to:
- Use proper API keys and authentication
- Configure appropriate timeouts for your evaluation complexity
- Monitor resource usage and system metrics
- Handle errors gracefully in your applications
- Use the async evaluation mode for long-running evaluations

The Eval Hub provides a powerful and flexible API for orchestrating machine learning model evaluations across multiple backends and evaluation frameworks.