---
title: "Eval Hub API Examples"
subtitle: "Comprehensive guide to using the Evaluation Hub REST API"
author: "Evaluation Service Team"
format:
  html:
    toc: true
    toc-depth: 3
    code-fold: false
    theme: cosmo
  ipynb:
    output-file: api_examples.ipynb
jupyter: python3
---

# Eval Hub API Examples

This notebook demonstrates how to interact with the Evaluation Hub REST API running on `localhost:8000`.

## Setup and Dependencies

In [None]:
import json
import time
from uuid import uuid4

import requests

# Configuration
BASE_URL = "http://localhost:8000"
API_BASE = f"{BASE_URL}/api/v1"

# Helper function for pretty printing JSON responses
def print_json(data):
    print(json.dumps(data, indent=2, default=str))

# Helper function for API requests
def api_request(method: str, endpoint: str, **kwargs) -> requests.Response:
    """Make an API request with proper error handling."""
    url = f"{API_BASE}{endpoint}"
    response = requests.request(method, url, **kwargs)

    print(f"{method.upper()} {url}")
    print(f"Status: {response.status_code}")

    if response.headers.get('content-type', '').startswith('application/json'):
        print("Response:")
        print_json(response.json())
    else:
        print(f"Response: {response.text}")

    print("-" * 50)
    return response

## Health Check

First, let's verify the service is running:

In [None]:
response = api_request("GET", "/health")

if response.status_code == 200:
    health_data = response.json()
    print("‚úÖ Service is healthy!")
    print(f"Version: {health_data['version']}")
    print(f"Uptime: {health_data['uptime_seconds']:.1f} seconds")
else:
    print("‚ùå Service is not responding correctly")

## Provider Management

### List All Providers

In [None]:
response = api_request("GET", "/providers")

if response.status_code == 200:
    providers_data = response.json()
    print(f"Found {providers_data['total_providers']} providers:")
    for provider in providers_data['providers']:
        print(f"  - {provider['provider_name']} ({provider['provider_id']})")
        print(f"    Type: {provider['provider_type']}")
        print(f"    Benchmarks: {provider['benchmark_count']}")

### Get Specific Provider Details

In [None]:
# Get details for the lm_evaluation_harness provider
provider_id = "lm_evaluation_harness"
response = api_request("GET", f"/providers/{provider_id}")

if response.status_code == 200:
    provider = response.json()
    print(f"Provider: {provider['provider_name']}")
    print(f"Description: {provider['description']}")
    print(f"Number of benchmarks: {len(provider['benchmarks'])}")

## Benchmark Discovery

### List All Benchmarks

In [None]:
response = api_request("GET", "/benchmarks")

if response.status_code == 200:
    benchmarks_data = response.json()
    print(f"Total benchmarks available: {benchmarks_data['total_count']}")

    # Show first 5 benchmarks
    for benchmark in benchmarks_data['benchmarks'][:5]:
        print(f"  - {benchmark['name']} ({benchmark['benchmark_id']})")
        print(f"    Category: {benchmark['category']}")
        print(f"    Provider: {benchmark['provider_id']}")

### Filter Benchmarks by Category

In [None]:
response = api_request("GET", "/benchmarks", params={"category": "math"})

if response.status_code == 200:
    math_benchmarks = response.json()
    print(f"Math benchmarks: {math_benchmarks['total_count']}")
    for benchmark in math_benchmarks['benchmarks']:
        print(f"  - {benchmark['name']}: {benchmark['description']}")

### Get Provider-Specific Benchmarks

In [None]:
provider_id = "lm_evaluation_harness"
response = api_request("GET", f"/providers/{provider_id}/benchmarks")

if response.status_code == 200:
    benchmarks = response.json()
    print(f"Benchmarks for {provider_id}: {len(benchmarks)}")

    # Group by category
    categories = {}
    for benchmark in benchmarks:
        category = benchmark['category']
        if category not in categories:
            categories[category] = []
        categories[category].append(benchmark['name'])

    for category, names in categories.items():
        print(f"\n{category.title()}: {len(names)} benchmarks")
        print(f"  Examples: {', '.join(names[:3])}")

## Collections

### List Available Collections

In [None]:
response = api_request("GET", "/collections")

if response.status_code == 200:
    collections = response.json()
    print(f"Available collections: {collections['total_collections']}")

    for collection in collections['collections']:
        print(f"\nüìÅ {collection['name']} ({collection['collection_id']})")
        print(f"   Description: {collection['description']}")
        print(f"   Benchmarks: {len(collection['benchmarks'])}")
        for benchmark_ref in collection['benchmarks'][:3]:  # Show first 3
            print(f"     - {benchmark_ref['provider_id']}::{benchmark_ref['benchmark_id']}")

### Create a Custom Collection

In [None]:
# Create a collection of available lm-evaluation-harness benchmarks for coding and reasoning evaluation
coding_reasoning_collection = {
    "collection_id": "coding_reasoning_v1",
    "name": "Coding & Reasoning Collection v1",
    "description": "A curated collection of coding and reasoning benchmarks using available lm-evaluation-harness tasks",
    "tags": ["coding", "reasoning", "v1"],
    "benchmarks": [
        {
            "provider_id": "lm_evaluation_harness",
            "benchmark_id": "arc_easy",
            "weight": 1.5,
            "config": {
                "num_fewshot": 25,
                "limit": 100
            }
        },
        {
            "provider_id": "lm_evaluation_harness",
            "benchmark_id": "humaneval",
            "weight": 2.0,  # Higher weight for coding benchmark
            "config": {
                "num_fewshot": 0,
                "limit": 50
            }
        },
        {
            "provider_id": "lm_evaluation_harness",
            "benchmark_id": "mbpp",
            "weight": 2.0,
            "config": {
                "num_fewshot": 0,
                "limit": 50
            }
        },
        {
            "provider_id": "lm_evaluation_harness",
            "benchmark_id": "bbh",
            "weight": 1.5,  # Big-bench hard for reasoning
            "config": {
                "num_fewshot": 3,
                "limit": 100
            }
        }
    ],
    "metadata": {
        "created_by": "evaluation_team",
        "use_case": "coding_reasoning_assessment",
        "difficulty": "intermediate_to_hard",
        "estimated_duration_minutes": 30
    }
}

print("üìù Creating custom coding & reasoning collection...")
print_json(coding_reasoning_collection)

response = api_request("POST", "/collections", json=coding_reasoning_collection)

if response.status_code == 201:
    created_collection = response.json()
    print("‚úÖ Collection created successfully!")
    print(f"Collection ID: {created_collection['collection_id']}")
    print(f"Total benchmarks: {len(created_collection['benchmarks'])}")
    print(f"Created at: {created_collection.get('created_at', 'N/A')}")

    # Store collection ID for later use
    coding_reasoning_collection_id = created_collection['collection_id']
else:
    print(f"‚ùå Failed to create collection: {response.text}")
    coding_reasoning_collection_id = "coding_reasoning_v1"  # Fallback for examples

### Create a Language Understanding Collection

In [None]:
# Create a collection focused on language understanding and modeling
language_collection = {
    "collection_id": "language_understanding_v1",
    "name": "Language Understanding Collection v1",
    "description": "Collection of language modeling and comprehension benchmarks",
    "tags": ["language", "understanding", "comprehension"],
    "benchmarks": [
        {
            "provider_id": "lm_evaluation_harness",
            "benchmark_id": "lambada_openai",
            "weight": 1.0,
            "config": {
                "num_fewshot": 0,
                "limit": 200
            }
        },
        {
            "provider_id": "lm_evaluation_harness",
            "benchmark_id": "blimp",
            "weight": 1.5,  # Grammar and linguistic knowledge
            "config": {
                "num_fewshot": 0,
                "limit": 100
            }
        },
        {
            "provider_id": "lm_evaluation_harness",
            "benchmark_id": "arc_easy",  # For basic reasoning
            "weight": 1.0,
            "config": {
                "num_fewshot": 25,
                "limit": 100
            }
        }
    ],
    "metadata": {
        "created_by": "nlp_team",
        "use_case": "language_understanding_assessment",
        "difficulty": "beginner_to_intermediate",
        "focus_areas": ["language_modeling", "grammar", "comprehension"]
    }
}

print("üìù Creating language understanding collection...")
response = api_request("POST", "/collections", json=language_collection)

if response.status_code == 201:
    language_collection_id = response.json()['collection_id']
    print(f"‚úÖ Language collection created: {language_collection_id}")
else:
    print("‚ö†Ô∏è Language collection creation failed (may already exist)")
    language_collection_id = "language_understanding_v1"  # Fallback

### List Collections (Including New Ones)

In [None]:
# Refresh the collections list to see our new collections
response = api_request("GET", "/collections")

if response.status_code == 200:
    collections = response.json()
    print(f"üìÅ Total collections available: {collections['total_collections']}")

    # Show all collections with details
    for collection in collections['collections']:
        print(f"\nüìÅ {collection['name']}")
        print(f"   ID: {collection['collection_id']}")
        print(f"   Provider: {collection['provider_id']}")
        print(f"   Benchmarks: {len(collection['benchmarks'])}")
        print(f"   Tags: {', '.join(collection.get('tags', []))}")
        if collection.get('metadata', {}).get('difficulty'):
            print(f"   Difficulty: {collection['metadata']['difficulty']}")

### Get Specific Collection Details

In [None]:
# Get detailed information about our coding & reasoning collection
collection_id = coding_reasoning_collection_id
response = api_request("GET", f"/collections/{collection_id}")

if response.status_code == 200:
    collection = response.json()
    print(f"üìã Collection: {collection['name']}")
    print(f"Description: {collection['description']}")
    print(f"Provider: {collection['provider_id']}")

    print(f"\nüéØ Benchmarks ({len(collection['benchmarks'])}):")
    total_weight = sum(b.get('weight', 1.0) for b in collection['benchmarks'])

    for benchmark in collection['benchmarks']:
        weight = benchmark.get('weight', 1.0)
        weight_pct = (weight / total_weight) * 100
        print(f"  - {benchmark['benchmark_id']} (weight: {weight}, {weight_pct:.1f}%)")
        if benchmark.get('config'):
            config = benchmark['config']
            print(f"    Config: {config.get('num_fewshot', 0)} shots, limit {config.get('limit', 'unlimited')}")

    if collection.get('metadata'):
        metadata = collection['metadata']
        print("\nüìä Metadata:")
        print(f"  Estimated duration: {metadata.get('estimated_duration_minutes', 'unknown')} minutes")
        print(f"  Difficulty: {metadata.get('difficulty', 'unknown')}")
        print(f"  Use case: {metadata.get('use_case', 'unknown')}")
elif response.status_code == 404:
    print(f"‚ùå Collection '{collection_id}' not found")

## Collection-Based Evaluations

### Execute Evaluation Using a Collection

In [None]:
# Create an evaluation request using our coding & reasoning collection
# Now with native collection_id support in eval-hub!

collection_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": f"Coding & Reasoning Collection Evaluation - {coding_reasoning_collection_id}",
    "evaluations": [
        {
            "name": "TinyLlama Coding & Reasoning",
            "description": f"Evaluation using {coding_reasoning_collection_id} collection with automatic expansion",
            "model": {
                "server": "vllm",  # Use the vLLM server from our Kubernetes setup
                "name": "tinyllama",
                "configuration": {
                    "temperature": 0.0,  # Deterministic for benchmarking
                    "max_tokens": 512,
                    "top_p": 0.95
                }
            },
            "collection_id": coding_reasoning_collection_id,  # ‚ú® Native collection support!
            "timeout_minutes": 60,  # Allow more time for collection execution
            "retry_attempts": 1
        }
    ],
    "tags": {
        "evaluation_type": "collection",
        "collection_id": coding_reasoning_collection_id,
        "model_family": "llama",
        "evaluation_scope": "coding_reasoning"
    }
}

print("üìù Creating collection-based evaluation...")
print(f"Collection ID: {coding_reasoning_collection_id}")
print("‚ú® Using native collection_id support - automatic expansion!")

print_json(collection_evaluation)

response = api_request("POST", "/evaluations", json=collection_evaluation)

if response.status_code == 202:
    collection_eval_response = response.json()
    collection_request_id = collection_eval_response["request_id"]
    print("‚úÖ Collection evaluation created successfully!")
    print(f"Request ID: {collection_request_id}")
    print(f"Status: {collection_eval_response['status']}")
    print(f"Experiment URL: {collection_eval_response.get('experiment_url', 'N/A')}")

    # Automatic collection expansion process:
    # 1. ‚úÖ Eval-hub automatically looks up the collection by ID
    # 2. ‚úÖ Extracts all benchmarks from the collection
    # 3. ‚úÖ Groups benchmarks by provider
    # 4. ‚úÖ Creates appropriate backend configurations
    # 5. üîÑ Will execute with proper weights and configurations
    print("\n‚ú® Native Collection Processing:")
    print(f"  ‚úÖ Collection ID: {coding_reasoning_collection_id}")
    print("  ‚úÖ Automatic backend expansion by eval-hub")
    print("  ‚úÖ Benchmark configs and weights preserved")
    print("  üîÑ Execution: Creating CR and running evaluation")
else:
    print("‚ùå Failed to create collection evaluation")
    print(f"Error: {response.text}")
    collection_request_id = None

### Execute Multiple Collections in Parallel

In [None]:
# Create evaluations for both collections to compare different reasoning approaches
collections_comparison = []

for collection_id, collection_name in [
    (coding_reasoning_collection_id, "Coding & Reasoning"),
    (language_collection_id, "Language Understanding")
]:
    comparison_eval = {
        "request_id": str(uuid4()),
        "experiment_name": f"{collection_name} Collection - Model Comparison",
        "evaluations": [
            {
                "name": f"TinyLlama {collection_name} Evaluation",
                "description": f"Comparative evaluation using {collection_id} collection",
                "model": {
                    "server": "vllm",
                    "name": "tinyllama",
                    "configuration": {
                        "temperature": 0.0,
                        "max_tokens": 512
                    }
                },
                "collection_id": collection_id,  # Just reference the collection!
                "timeout_minutes": 90,
                "retry_attempts": 1
            }
        ],
        "tags": {
            "evaluation_type": "collection_comparison",
            "collection_id": collection_id,
            "comparison_group": "coding_vs_language",
            "model": "tinyllama"
        }
    }
    collections_comparison.append(comparison_eval)

print(f"üì¶ Creating {len(collections_comparison)} collection comparison evaluations...")

comparison_request_ids = []
for i, eval_request in enumerate(collections_comparison):
    collection_name = ["Coding & Reasoning", "Language Understanding"][i]
    print(f"\nüìù Creating {collection_name} collection evaluation...")

    response = api_request("POST", "/evaluations", json=eval_request)

    if response.status_code == 202:
        comparison_response = response.json()
        comparison_request_ids.append(comparison_response["request_id"])
        print(f"‚úÖ {collection_name} evaluation created: {comparison_response['request_id']}")
    else:
        print(f"‚ùå Failed to create {collection_name} evaluation")

print(f"\nüìä Created {len(comparison_request_ids)} collection comparisons")
for i, req_id in enumerate(comparison_request_ids):
    collection_type = ["Coding & Reasoning", "Language Understanding"][i]
    print(f"  - {collection_type}: {req_id}")

## Collection Results Management

### Monitor Collection Evaluation Progress

In [None]:
# Function specifically for monitoring collection-based evaluations
def monitor_collection_evaluation(request_id: str, collection_id: str):
    """Monitor a collection-based evaluation with collection-specific details."""
    print(f"üîç Monitoring collection evaluation: {collection_id}")
    print(f"Request ID: {request_id}")

    response = api_request("GET", f"/evaluations/{request_id}")

    if response.status_code == 200:
        status_data = response.json()
        print("\nüìä Collection Evaluation Status:")
        print(f"Status: {status_data['status']}")
        print(f"Progress: {status_data.get('progress_percentage', 0):.1f}%")

        # Collection-specific information
        if status_data.get('collection_id'):
            print(f"Collection: {status_data['collection_id']}")

        # Show benchmark-level progress if available
        if status_data.get('results'):
            print(f"\nüìã Benchmark Progress ({len(status_data['results'])} completed):")
            for result in status_data['results']:
                benchmark_name = result.get('benchmark_name', 'unknown')
                result_status = result.get('status', 'unknown')
                print(f"  - {benchmark_name}: {result_status}")

                # Show key metrics if available
                if result.get('metrics'):
                    metrics = result['metrics']
                    key_metrics = []
                    for metric_name, metric_value in list(metrics.items())[:2]:  # Show first 2 metrics
                        if isinstance(metric_value, (int, float)):
                            key_metrics.append(f"{metric_name}: {metric_value:.3f}")
                    if key_metrics:
                        print(f"    Metrics: {', '.join(key_metrics)}")

        return status_data
    else:
        print(f"‚ùå Failed to get collection evaluation status: {response.text}")
        return None

# Monitor our collection evaluation if it exists
if 'collection_request_id' in locals() and collection_request_id:
    monitor_collection_evaluation(collection_request_id, coding_reasoning_collection_id)
else:
    print("No active collection evaluation to monitor")

### Retrieve Complete Collection Results

In [None]:
# Function to get comprehensive collection results
def get_collection_results(request_id: str, format_for_analysis: bool = True):
    """
    Retrieve and format results for a collection-based evaluation.

    Args:
        request_id: The evaluation request ID
        format_for_analysis: Whether to format results for analysis
    """
    print("üìä Retrieving collection evaluation results...")

    response = api_request("GET", f"/evaluations/{request_id}")

    if response.status_code != 200:
        print(f"‚ùå Failed to retrieve results: {response.text}")
        return None

    eval_data = response.json()

    if eval_data.get('status') != 'completed':
        print(f"‚è≥ Evaluation not completed. Status: {eval_data.get('status')}")
        return None

    collection_id = eval_data.get('collection_id')
    print("‚úÖ Collection evaluation completed!")
    print(f"Collection ID: {collection_id}")
    print(f"Total benchmarks: {len(eval_data.get('results', []))}")

    # Aggregate collection-level metrics
    results = eval_data.get('results', [])

    if not results:
        print("No results found")
        return eval_data

    print("\nüìã Collection Results Summary:")

    # Calculate weighted average scores based on collection benchmark weights
    total_weighted_score = 0
    total_weight = 0
    benchmark_scores = {}

    for result in results:
        benchmark_name = result.get('benchmark_name', 'unknown')
        benchmark_status = result.get('status', 'unknown')

        print(f"\n  üìä {benchmark_name}: {benchmark_status}")

        if result.get('metrics'):
            metrics = result['metrics']

            # Extract primary accuracy metric (common across lm-eval-harness benchmarks)
            primary_score = None
            for metric_name in ['acc', 'acc_norm', 'exact_match', 'score']:
                if metric_name in metrics:
                    primary_score = metrics[metric_name]
                    break

            if primary_score is not None:
                if isinstance(primary_score, dict) and 'value' in primary_score:
                    score_value = primary_score['value']
                else:
                    score_value = primary_score

                benchmark_scores[benchmark_name] = score_value
                print(f"    Primary score: {score_value:.3f}")

                # Get benchmark weight from collection (default 1.0)
                weight = 1.0  # Default weight
                # Note: In a real implementation, you'd look up the weight from the collection definition

                total_weighted_score += score_value * weight
                total_weight += weight

            # Show additional metrics
            other_metrics = []
            for metric_name, metric_value in metrics.items():
                if metric_name not in ['acc', 'acc_norm', 'exact_match', 'score']:
                    if isinstance(metric_value, (int, float)):
                        other_metrics.append(f"{metric_name}: {metric_value:.3f}")
                    elif isinstance(metric_value, dict) and 'value' in metric_value:
                        other_metrics.append(f"{metric_name}: {metric_value['value']:.3f}")

            if other_metrics:
                print(f"    Other metrics: {', '.join(other_metrics[:3])}")  # Show first 3

    # Calculate collection-level aggregate score
    if total_weight > 0:
        collection_avg_score = total_weighted_score / total_weight
        print(f"\nüéØ Collection Aggregate Score: {collection_avg_score:.3f}")
        print(f"   (Weighted average across {len(benchmark_scores)} benchmarks)")

    if format_for_analysis:
        # Format results for further analysis
        analysis_format = {
            "collection_id": collection_id,
            "evaluation_id": request_id,
            "status": eval_data['status'],
            "completed_at": eval_data.get('updated_at'),
            "aggregate_score": collection_avg_score if 'collection_avg_score' in locals() else None,
            "benchmark_scores": benchmark_scores,
            "benchmark_count": len(results),
            "successful_benchmarks": len([r for r in results if r.get('status') == 'completed']),
            "failed_benchmarks": len([r for r in results if r.get('status') == 'failed']),
            "raw_results": results
        }

        print("\nüìã Analysis Format Summary:")
        print(f"  Successful: {analysis_format['successful_benchmarks']}/{analysis_format['benchmark_count']}")
        print(f"  Success rate: {(analysis_format['successful_benchmarks']/analysis_format['benchmark_count']*100):.1f}%")

        return analysis_format

    return eval_data

# Example usage with a completed evaluation
if 'collection_request_id' in locals() and collection_request_id:
    print(f"üìä Attempting to retrieve results for: {collection_request_id}")
    collection_results = get_collection_results(collection_request_id)
else:
    print("üìù No collection evaluation request ID available for result retrieval")
    print("üìñ Example of what collection results would look like:")

    # Show example collection results structure
    example_collection_results = {
        "collection_id": "academic_reasoning_v1",
        "evaluation_id": "12345678-1234-1234-1234-123456789012",
        "status": "completed",
        "aggregate_score": 0.742,
        "benchmark_scores": {
            "arc_easy": 0.753,
            "arc_challenge": 0.462,
            "hellaswag": 0.789,
            "mmlu": 0.654
        },
        "benchmark_count": 4,
        "successful_benchmarks": 4,
        "failed_benchmarks": 0
    }

    print_json(example_collection_results)

### Compare Collection Performance

In [None]:
# Function to compare results across multiple collection evaluations
def compare_collection_results(request_ids: list, collection_names: list = None):
    """Compare results across multiple collection evaluations."""

    if collection_names is None:
        collection_names = [f"Collection {i+1}" for i in range(len(request_ids))]

    print(f"üìä Comparing {len(request_ids)} collection evaluations...")

    comparison_data = []

    for i, request_id in enumerate(request_ids):
        collection_name = collection_names[i]
        print(f"\nüîç Retrieving results for {collection_name}...")

        results = get_collection_results(request_id, format_for_analysis=True)

        if results:
            comparison_data.append({
                "name": collection_name,
                "request_id": request_id,
                "collection_id": results.get('collection_id'),
                "aggregate_score": results.get('aggregate_score'),
                "benchmark_scores": results.get('benchmark_scores', {}),
                "success_rate": results.get('successful_benchmarks', 0) / max(results.get('benchmark_count', 1), 1),
                "benchmark_count": results.get('benchmark_count', 0)
            })

    if not comparison_data:
        print("‚ùå No valid results to compare")
        return None

    print("\nüìä Collection Performance Comparison:")
    print(f"{'Collection':<25} {'Aggregate':<10} {'Success Rate':<12} {'Benchmarks':<10}")
    print("-" * 60)

    for data in comparison_data:
        aggregate = f"{data['aggregate_score']:.3f}" if data['aggregate_score'] else "N/A"
        success_rate = f"{data['success_rate']*100:.1f}%" if data['success_rate'] else "N/A"
        benchmarks = str(data['benchmark_count'])

        print(f"{data['name']:<25} {aggregate:<10} {success_rate:<12} {benchmarks:<10}")

    # Show benchmark-by-benchmark comparison if there are common benchmarks
    all_benchmarks = set()
    for data in comparison_data:
        all_benchmarks.update(data['benchmark_scores'].keys())

    if all_benchmarks:
        print("\nüìã Benchmark-by-Benchmark Comparison:")

        for benchmark in sorted(all_benchmarks):
            print(f"\n  {benchmark}:")
            for data in comparison_data:
                score = data['benchmark_scores'].get(benchmark)
                score_str = f"{score:.3f}" if score is not None else "N/A"
                print(f"    {data['name']:<20}: {score_str}")

    return comparison_data

# Example usage with comparison request IDs
if 'comparison_request_ids' in locals() and comparison_request_ids:
    collection_comparison = compare_collection_results(
        comparison_request_ids,
        ["Coding & Reasoning", "Language Understanding"]
    )
else:
    print("üìù No comparison evaluations available")
    print("üìñ This would compare performance across different collections")

### Export Collection Results

In [None]:
# Function to export collection results for external analysis
def export_collection_results(results_data: dict, filename: str = None):
    """Export collection results to JSON file for external analysis."""

    if filename is None:
        collection_id = results_data.get('collection_id', 'unknown')
        timestamp = time.strftime("%Y%m%d_%H%M%S")
        filename = f"collection_results_{collection_id}_{timestamp}.json"

    # Prepare export format
    export_data = {
        "export_metadata": {
            "export_timestamp": time.strftime("%Y-%m-%d %H:%M:%S UTC"),
            "eval_hub_version": "v1",
            "format_version": "1.0"
        },
        "collection_evaluation": results_data
    }

    # Write to file
    import json
    with open(filename, 'w') as f:
        json.dump(export_data, f, indent=2, default=str)

    print(f"üíæ Collection results exported to: {filename}")
    print("üìä Export contains:")
    print(f"  - Collection ID: {results_data.get('collection_id', 'N/A')}")
    print(f"  - Benchmarks: {results_data.get('benchmark_count', 0)}")
    print(f"  - Aggregate score: {results_data.get('aggregate_score', 'N/A')}")

    return filename

# Example export
example_export_data = {
    "collection_id": "coding_reasoning_v1",
    "aggregate_score": 0.678,
    "benchmark_count": 4,
    "benchmark_scores": {"arc_easy": 0.753, "humaneval": 0.645, "mbpp": 0.672, "bbh": 0.642}
}

print("üìù Example collection results export:")
export_filename = export_collection_results(example_export_data)

## Model Server Management

### List All Model Servers

In [None]:
response = api_request("GET", "/servers")

if response.status_code == 200:
    servers_data = response.json()
    print(f"Total servers: {servers_data['total_servers']}")
    print(f"Runtime servers: {len(servers_data.get('runtime_servers', []))}")

    print("\nüìã Model Servers:")
    for server in servers_data.get('servers', []):
        print(f"  - {server['server_id']}")
        print(f"    Type: {server['server_type']}")
        print(f"    Base URL: {server['base_url']}")
        print(f"    Models: {server['model_count']}")
        print(f"    Status: {server['status']}")

### List Only Active Servers

In [None]:
response = api_request("GET", "/servers", params={"include_inactive": False})

if response.status_code == 200:
    servers_data = response.json()
    print(f"Active servers: {servers_data['total_servers']}")
    for server in servers_data.get('servers', []):
        print(f"  - {server['server_id']} - {server['status']}")

### Get Server by ID

In [None]:
# Get details for a specific model server
server_id = "vllm"  # Replace with an actual server ID from your system
response = api_request("GET", f"/servers/{server_id}")

if response.status_code == 200:
    server = response.json()
    print(f"Server ID: {server['server_id']}")
    print(f"Type: {server['server_type']}")
    print(f"Base URL: {server['base_url']}")
    print(f"Status: {server['status']}")

    print(f"\nüì¶ Models on this server ({len(server['models'])}):")
    for model in server['models']:
        print(f"  - {model['model_name']}")
        print(f"    Status: {model['status']}")
        if model.get('description'):
            print(f"    Description: {model['description']}")

    if server.get('tags'):
        print(f"\nTags: {', '.join(server['tags'])}")
elif response.status_code == 404:
    print(f"‚ùå Server '{server_id}' not found")

### Get Model by Server and Name

In [None]:
# Get a specific model by getting the server and finding the model in its models list
server_id = "vllm"
model_name = "vllm"  # Replace with actual model name

response = api_request("GET", f"/servers/{server_id}")

if response.status_code == 200:
    server = response.json()
    model = None
    for m in server['models']:
        if m['model_name'] == model_name:
            model = m
            break

    if model:
        print(f"‚úÖ Found model: {model['model_name']}")
        print(f"   Server: {server['server_id']}")
        print(f"   Status: {model['status']}")
        if model.get('description'):
            print(f"   Description: {model['description']}")
        if model.get('capabilities'):
            print(f"   Capabilities: {model['capabilities']}")
    else:
        print(f"‚ùå Model '{model_name}' not found on server '{server_id}'")
else:
    print(f"‚ùå Server '{server_id}' not found")

### Register a New Model Server

In [None]:
# Register a model server with models
new_server = {
    "server_id": "groq-server",
    "server_type": "openai-compatible",
    "base_url": "https://api.groq.com/openai/v1",
    "api_key_required": True,
    "models": [
        {
            "model_name": "llama-3.1-70b",
            "description": "Meta's Llama 3.1 70B model",
            "status": "active",
            "tags": ["groq", "llama", "70b"]
        },
        {
            "model_name": "llama-3.1-8b",
            "description": "Meta's Llama 3.1 8B model",
            "status": "active",
            "tags": ["groq", "llama", "8b"]
        }
    ],
    "server_config": {
        "temperature": 0.7,
        "max_tokens": 2048,
        "timeout": 60,
        "retry_attempts": 3
    },
    "status": "active",
    "tags": ["groq", "openai-compatible", "fast"]
}

print("üìù Registering new model server...")
print_json(new_server)

response = api_request("POST", "/servers", json=new_server)

if response.status_code == 201:
    registered_server = response.json()
    print("‚úÖ Model server registered successfully!")
    print(f"Server ID: {registered_server['server_id']}")
    print(f"Models: {len(registered_server['models'])}")
    print(f"Created at: {registered_server.get('created_at', 'N/A')}")
else:
    print(f"‚ùå Failed to register server: {response.text}")

### Register a vLLM Server

In [None]:
# Register a vLLM server
vllm_server = {
    "server_id": "local-vllm",
    "server_type": "vllm",
    "base_url": "http://localhost:8000",
    "api_key_required": False,
    "models": [
        {
            "model_name": "llama-2-7b",
            "description": "Llama 2 7B running on local vLLM server",
            "status": "active",
            "tags": ["vllm", "local", "llama-2"]
        }
    ],
    "status": "active",
    "tags": ["vllm", "local"]
}

print("üìù Registering vLLM server...")
response = api_request("POST", "/servers", json=vllm_server)

if response.status_code == 201:
    print(f"‚úÖ vLLM server registered: {response.json()['server_id']}")
else:
    print("‚ö†Ô∏è Note: This may fail if the server ID already exists")
    print(f"Response: {response.text}")

## Basic Evaluation Examples

### Single Benchmark Evaluation from Builtin Provider (Simplified API)

In [None]:
# Example: Run a single benchmark using the simplified API (Llama Stack compatible)
provider_id = "lm_evaluation_harness"
benchmark_id = "arc_easy"

single_benchmark_request = {
    "model": {
        "server": "vllm",
        "name": "tinyllama"
    },
    "model_configuration": {
        "temperature": 0.0,
        "max_tokens": 512
    },
    "timeout_minutes": 30,
    "retry_attempts": 1,
    "limit": 100,  # Limit to 100 samples for faster execution
    "num_fewshot": 0,
    "experiment_name": "Single Benchmark - ARC Easy",
    "tags": {
        "example_type": "single_benchmark",
        "provider": "lm_evaluation_harness",
        "benchmark": "arc_easy"
    }
}

print("üìù Creating single benchmark evaluation request...")
print(f"Provider ID: {provider_id}")
print(f"Benchmark ID: {benchmark_id}")
print_json(single_benchmark_request)

response = api_request("POST", f"/evaluations/benchmarks/{provider_id}/{benchmark_id}", json=single_benchmark_request)

if response.status_code == 202:
    evaluation_response = response.json()
    request_id = evaluation_response["request_id"]
    print("‚úÖ Single benchmark evaluation created successfully!")
    print(f"Request ID: {request_id}")
    print(f"Status: {evaluation_response['status']}")
    print(f"Experiment URL: {evaluation_response.get('experiment_url', 'N/A')}")
else:
    print("‚ùå Failed to create evaluation")
    print(f"Error: {response.text}")

### Simple Evaluation with Risk Category

In [None]:
# Create a simple evaluation request using risk category
evaluation_request = {
    "request_id": str(uuid4()),
    "experiment_name": "Simple Risk-Based Evaluation",
    "evaluations": [
        {
            "name": "GPT-4 Mini Low Risk Evaluation",
            "description": "Basic evaluation using low risk benchmarks",
            "model": {
                "server": "default",
                "name": "default"
            },
            "model_configuration": {
                "temperature": 0.0,
                "max_tokens": 512
            },
            "risk_category": "low",
            "timeout_minutes": 30,
            "retry_attempts": 1
        }
    ],
    "tags": {
        "example_type": "risk_category",
        "complexity": "simple"
    }
}

print("üìù Creating evaluation request...")
print_json(evaluation_request)

response = api_request("POST", "/evaluations", json=evaluation_request)

if response.status_code == 202:
    evaluation_response = response.json()
    request_id = evaluation_response["request_id"]
    print("‚úÖ Evaluation created successfully!")
    print(f"Request ID: {request_id}")
    print(f"Status: {evaluation_response['status']}")
    print(f"Experiment URL: {evaluation_response.get('experiment_url', 'N/A')}")
else:
    print("‚ùå Failed to create evaluation")
    print(f"Error: {response.text}")

### Evaluation with Explicit Backend Configuration

In [None]:
# Create an evaluation with explicit backend configuration
explicit_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": "Explicit Backend Configuration",
    "evaluations": [
        {
            "name": "LM-Eval Harness Evaluation",
            "description": "Evaluation with explicit lm-evaluation-harness configuration",
            "model": {
                "server": "default",
                "name": "default"
            },
            "model_configuration": {
                "temperature": 0.1,
                "max_tokens": 256,
                "top_p": 0.95
            },
            "backends": [
                {
                    "name": "lm-eval-backend",
                    "type": "lm-evaluation-harness",
                    "config": {
                        "batch_size": 1,
                        "device": "cpu"
                    },
                    "benchmarks": [
                        {
                            "name": "arc_easy",
                            "tasks": ["arc_easy"],
                            "config": {
                                "num_fewshot": 5,
                                "limit": 50
                            }
                        },
                        {
                            "name": "hellaswag",
                            "tasks": ["hellaswag"],
                            "config": {
                                "num_fewshot": 10,
                                "limit": 100
                            }
                        }
                    ]
                }
            ],
            "timeout_minutes": 45,
            "retry_attempts": 2
        }
    ],
    "tags": {
        "example_type": "explicit_backend",
        "complexity": "intermediate"
    }
}

print("üìù Creating evaluation with explicit backend...")
response = api_request("POST", "/evaluations", json=explicit_evaluation)

if response.status_code == 202:
    explicit_response = response.json()
    explicit_request_id = explicit_response["request_id"]
    print("‚úÖ Explicit evaluation created!")
    print(f"Request ID: {explicit_request_id}")

## NeMo Evaluator Integration

### Single NeMo Evaluator Container

In [None]:
# Example with single NeMo Evaluator container
nemo_single_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": "NeMo Evaluator Single Container",
    "evaluations": [
        {
            "name": "GPT-4 via NeMo Evaluator",
            "description": "Remote evaluation using NeMo Evaluator container",
            "model": {
                "server": "default",
                "name": "default"
            },
            "model_configuration": {
                "temperature": 0.0,
                "max_tokens": 512,
                "top_p": 0.95
            },
            "backends": [
                {
                    "name": "remote-nemo-evaluator",
                    "type": "nemo-evaluator",
                    "config": {
                        "endpoint": "localhost",
                        "port": 3825,
                        "model_endpoint": "https://api.openai.com/v1/chat/completions",
                        "endpoint_type": "chat",
                        "api_key_env": "OPENAI_API_KEY",
                        "timeout_seconds": 1800,
                        "max_retries": 2,
                        "verify_ssl": False,
                        "framework_name": "eval-hub-example",
                        "parallelism": 1,
                        "limit_samples": 25,
                        "temperature": 0.0,
                        "top_p": 0.95
                    },
                    "benchmarks": [
                        {
                            "name": "mmlu_pro_sample",
                            "tasks": ["mmlu_pro"],
                            "config": {
                                "limit": 25,
                                "num_fewshot": 5
                            }
                        }
                    ]
                }
            ],
            "timeout_minutes": 60,
            "retry_attempts": 1
        }
    ],
    "tags": {
        "example_type": "nemo_evaluator_single",
        "complexity": "advanced",
        "backend": "remote_container"
    }
}

print("üìù Creating NeMo Evaluator evaluation...")
print("Note: This requires a running NeMo Evaluator container on localhost:3825")

response = api_request("POST", "/evaluations", json=nemo_single_evaluation)

if response.status_code == 202:
    nemo_response = response.json()
    nemo_request_id = nemo_response["request_id"]
    print("‚úÖ NeMo evaluation created!")
    print(f"Request ID: {nemo_request_id}")
else:
    print("‚ö†Ô∏è NeMo evaluation failed (container may not be running)")
    print(f"Response: {response.text}")

### Multi-Container NeMo Evaluator Setup

In [None]:
# Example with multiple specialized NeMo Evaluator containers
nemo_multi_evaluation = {
    "request_id": str(uuid4()),
    "experiment_name": "Multi-Container NeMo Evaluation",
    "evaluations": [
        {
            "name": "Distributed LLaMA Evaluation",
            "description": "Multi-container evaluation across specialized endpoints",
            "model": {
                "server": "default",
                "name": "default"
            },
            "model_configuration": {
                "temperature": 0.1,
                "max_tokens": 512,
                "top_p": 0.95
            },
            "backends": [
                {
                    "name": "academic-evaluator",
                    "type": "nemo-evaluator",
                    "config": {
                        "endpoint": "academic-eval.example.com",
                        "port": 3825,
                        "model_endpoint": "https://api.groq.com/openai/v1/chat/completions",
                        "endpoint_type": "chat",
                        "api_key_env": "GROQ_API_KEY",
                        "timeout_seconds": 3600,
                        "framework_name": "eval-hub-academic",
                        "parallelism": 2
                    },
                    "benchmarks": [
                        {
                            "name": "mmlu_pro",
                            "tasks": ["mmlu_pro"],
                            "config": {"limit": 100, "num_fewshot": 5}
                        },
                        {
                            "name": "arc_challenge",
                            "tasks": ["arc_challenge"],
                            "config": {"limit": 200, "num_fewshot": 25}
                        }
                    ]
                },
                {
                    "name": "math-evaluator",
                    "type": "nemo-evaluator",
                    "config": {
                        "endpoint": "math-eval.example.com",
                        "port": 3825,
                        "model_endpoint": "https://api.groq.com/openai/v1/chat/completions",
                        "endpoint_type": "chat",
                        "api_key_env": "GROQ_API_KEY",
                        "temperature": 0.0,
                        "parallelism": 1,
                        "framework_name": "eval-hub-math"
                    },
                    "benchmarks": [
                        {
                            "name": "gsm8k",
                            "tasks": ["gsm8k"],
                            "config": {"limit": 100, "num_fewshot": 8}
                        },
                        {
                            "name": "math",
                            "tasks": ["hendrycks_math"],
                            "config": {"limit": 50, "num_fewshot": 4}
                        }
                    ]
                }
            ],
            "timeout_minutes": 120,
            "retry_attempts": 2
        }
    ],
    "tags": {
        "example_type": "nemo_evaluator_multi",
        "complexity": "expert",
        "backend": "distributed_containers"
    }
}

print("üìù Creating multi-container NeMo evaluation...")
print("Note: This is a hypothetical example with multiple remote containers")
print_json(nemo_multi_evaluation)

## Evaluation Status Monitoring

### Check Evaluation Status

In [None]:
# Function to check evaluation status
def check_evaluation_status(request_id: str):
    response = api_request("GET", f"/evaluations/{request_id}")

    if response.status_code == 200:
        status_data = response.json()
        print(f"üìä Evaluation Status for {request_id}")
        print(f"Status: {status_data['status']}")
        print(f"Progress: {status_data.get('progress_percentage', 0):.1f}%")
        print(f"Total evaluations: {status_data.get('total_evaluations', 0)}")
        print(f"Completed: {status_data.get('completed_evaluations', 0)}")
        print(f"Failed: {status_data.get('failed_evaluations', 0)}")

        if status_data.get('results'):
            print(f"Results available: {len(status_data['results'])}")

        return status_data
    else:
        print(f"‚ùå Failed to get status: {response.text}")
        return None

# Check status of previously created evaluations (if they exist)
try:
    if 'request_id' in locals():
        check_evaluation_status(request_id)
except NameError:
    print("No evaluation request_id available to check")

### Monitor Evaluation Progress

In [None]:
# Function to monitor evaluation until completion
def monitor_evaluation(request_id: str, max_wait_time: int = 300):
    """Monitor an evaluation until completion or timeout."""
    start_time = time.time()

    while time.time() - start_time < max_wait_time:
        status_data = check_evaluation_status(request_id)

        if not status_data:
            break

        status = status_data['status']

        if status in ['completed', 'failed', 'cancelled']:
            print(f"üèÅ Evaluation {status}!")

            if status == 'completed' and status_data.get('results'):
                print("\nüìä Results Summary:")
                for result in status_data['results'][:3]:  # Show first 3 results
                    print(f"  - {result['benchmark_name']}: {result['status']}")
                    if result.get('metrics'):
                        for metric, value in list(result['metrics'].items())[:2]:
                            print(f"    {metric}: {value}")

            return status_data

        print(f"‚è≥ Still {status}, waiting...")
        time.sleep(10)

    print(f"‚è∞ Monitoring timed out after {max_wait_time} seconds")
    return None

# Example usage (uncomment if you have a running evaluation)
# monitor_evaluation(request_id)

## List All Evaluations

In [None]:
response = api_request("GET", "/evaluations")

if response.status_code == 200:
    evaluations = response.json()
    print(f"üìã Active evaluations: {len(evaluations)}")

    for eval_resp in evaluations:
        print(f"\nüîç {eval_resp['request_id']}")
        print(f"   Status: {eval_resp['status']}")
        print(f"   Progress: {eval_resp.get('progress_percentage', 0):.1f}%")
        print(f"   Created: {eval_resp['created_at']}")

## System Metrics

In [None]:
response = api_request("GET", "/metrics/system")

if response.status_code == 200:
    metrics = response.json()
    print("üìä System Metrics:")
    print(f"  Active evaluations: {metrics['active_evaluations']}")
    print(f"  Running tasks: {metrics['running_tasks']}")
    print(f"  Total requests: {metrics['total_requests']}")

    if metrics.get('status_breakdown'):
        print("\n  Status breakdown:")
        for status, count in metrics['status_breakdown'].items():
            print(f"    {status}: {count}")

    if metrics.get('memory_usage'):
        print("\n  Memory usage:")
        print(f"    Active evaluations: {metrics['memory_usage']['active_evaluations_mb']:.1f} MB")

## Evaluation Management

### Cancel an Evaluation

In [None]:
# Function to cancel an evaluation
def cancel_evaluation(request_id: str):
    response = api_request("DELETE", f"/evaluations/{request_id}")

    if response.status_code == 200:
        result = response.json()
        print(f"‚úÖ {result['message']}")
        return True
    else:
        print(f"‚ùå Failed to cancel: {response.text}")
        return False

# Example usage (uncomment if you want to cancel an evaluation)
# cancel_evaluation(request_id)

## Error Handling Examples

### Invalid Request Handling

In [None]:
# Example of invalid request to demonstrate error handling
invalid_request = {
    "request_id": "invalid-uuid-format",
    "evaluations": [
        {
            "name": "",  # Invalid: empty name
            "model": {
                "server": "",  # Invalid: empty server
                "name": ""  # Invalid: empty model name
            },
            "backends": []  # Invalid: no backends
        }
    ]
}

print("üìù Testing error handling with invalid request...")
response = api_request("POST", "/evaluations", json=invalid_request)

if response.status_code >= 400:
    print("‚úÖ Error handling working correctly")
    error_data = response.json()
    print(f"Error type: {response.status_code}")
    print(f"Error message: {error_data.get('detail', 'Unknown error')}")

### Non-existent Resource Handling

In [None]:
# Test accessing non-existent evaluation
fake_request_id = str(uuid4())
print(f"üîç Testing access to non-existent evaluation: {fake_request_id}")

response = api_request("GET", f"/evaluations/{fake_request_id}")

if response.status_code == 404:
    print("‚úÖ 404 handling working correctly")
    error_data = response.json()
    print(f"Error: {error_data['detail']}")

## Advanced Examples

### Batch Evaluation Requests

In [None]:
# Create multiple evaluations for comparison
batch_requests = []

models_to_compare = ["gpt-4o-mini", "gpt-3.5-turbo"]
risk_levels = ["low", "medium"]

for model in models_to_compare:
    for risk in risk_levels:
        batch_request = {
            "request_id": str(uuid4()),
            "experiment_name": f"Batch Comparison - {model} - {risk} risk",
            "evaluations": [
                {
                    "name": f"{model} {risk} risk evaluation",
                    "model": {
                        "server": "default",
                        "name": "default"
                    },
                    "model_configuration": {
                        "temperature": 0.0,
                        "max_tokens": 256
                    },
                    "risk_category": risk,
                    "timeout_minutes": 30
                }
            ],
            "tags": {
                "batch_id": "model_comparison_001",
                "model": model,
                "risk_level": risk
            }
        }
        batch_requests.append(batch_request)

print(f"üì¶ Creating {len(batch_requests)} batch evaluations...")

batch_results = []
for i, request in enumerate(batch_requests):
    print(f"\nüìù Creating batch request {i+1}/{len(batch_requests)}")
    response = api_request("POST", "/evaluations", json=request)

    if response.status_code == 202:
        batch_results.append(response.json())
        print(f"‚úÖ Batch {i+1} created: {response.json()['request_id']}")
    else:
        print(f"‚ùå Batch {i+1} failed")

print(f"\nüìä Successfully created {len(batch_results)} batch evaluations")

### Configuration Validation

In [None]:
# Test various configuration combinations
test_configs = [
    {
        "name": "High timeout test",
        "config": {"timeout_minutes": 120, "retry_attempts": 5},
        "expected": "success"
    },
    {
        "name": "Zero timeout test",
        "config": {"timeout_minutes": 0, "retry_attempts": 1},
        "expected": "validation_error"
    },
    {
        "name": "Negative retry test",
        "config": {"timeout_minutes": 30, "retry_attempts": -1},
        "expected": "validation_error"
    }
]

for test in test_configs:
    print(f"\nüß™ Testing: {test['name']}")

    test_request = {
        "request_id": str(uuid4()),
        "experiment_name": test['name'],
        "evaluations": [
            {
                "name": "Config test",
                "model": {
                    "server": "default",
                    "name": "default"
                },
                "risk_category": "low",
                **test['config']
            }
        ]
    }

    response = api_request("POST", "/evaluations", json=test_request)

    if test['expected'] == "success" and response.status_code == 202:
        print("‚úÖ Test passed")
    elif test['expected'] == "validation_error" and response.status_code >= 400:
        print("‚úÖ Validation correctly rejected invalid config")
    else:
        print(f"‚ùå Unexpected result: {response.status_code}")

## Retrieving and Formatting Results

### Get Evaluation Results in NeMo Evaluator Format

In [None]:
# Function to format eval-hub results to NeMo Evaluator compatible format
def format_to_nemo_evaluator(eval_hub_result):
    """
    Convert eval-hub result format to NeMo Evaluator EvaluationResult format.

    Expected NeMo format:
    {
        "tasks": {
            "task_name": {
                "metrics": {
                    "metric_name": {
                        "scores": {
                            "score_name": {
                                "value": float,
                                "stats": {
                                    "count": int,
                                    "sum": float,
                                    "mean": float,
                                    "stderr": float,
                                    "min": float,
                                    "max": float,
                                    "variance": float,
                                    "stddev": float
                                }
                            }
                        }
                    }
                }
            }
        },
        "groups": { ... }  # Same structure as tasks
    }
    """
    nemo_result = {
        "tasks": {},
        "groups": {}
    }

    # Extract benchmark results from eval-hub format
    if 'results' in eval_hub_result:
        for result in eval_hub_result['results']:
            benchmark_name = result.get('benchmark_name', 'unknown_benchmark')
            metrics = result.get('metrics', {})

            # Convert metrics to NeMo format
            nemo_metrics = {}
            for metric_name, metric_value in metrics.items():
                if isinstance(metric_value, (int, float)):
                    # Simple scalar metric
                    nemo_metrics[metric_name] = {
                        "scores": {
                            metric_name: {
                                "value": float(metric_value),
                                "stats": {
                                    "count": 1,
                                    "sum": float(metric_value),
                                    "mean": float(metric_value),
                                    "stderr": 0.0,
                                    "min": float(metric_value),
                                    "max": float(metric_value),
                                    "variance": 0.0,
                                    "stddev": 0.0
                                }
                            }
                        }
                    }
                elif isinstance(metric_value, dict) and 'value' in metric_value:
                    # Structured metric with stats
                    stats = metric_value.get('stats', {})
                    nemo_metrics[metric_name] = {
                        "scores": {
                            metric_name: {
                                "value": float(metric_value['value']),
                                "stats": {
                                    "count": stats.get('count', 1),
                                    "sum": stats.get('sum', metric_value['value']),
                                    "mean": stats.get('mean', metric_value['value']),
                                    "stderr": stats.get('stderr', 0.0),
                                    "min": stats.get('min', metric_value['value']),
                                    "max": stats.get('max', metric_value['value']),
                                    "variance": stats.get('variance', 0.0),
                                    "stddev": stats.get('stddev', 0.0)
                                }
                            }
                        }
                    }

            # Add to tasks
            nemo_result["tasks"][benchmark_name] = {"metrics": nemo_metrics}

            # Add to groups (using provider as group name)
            provider_id = result.get('provider_id', 'unknown_provider')
            if provider_id not in nemo_result["groups"]:
                nemo_result["groups"][provider_id] = {"metrics": {}}

            # Aggregate metrics at group level
            for metric_name, metric_data in nemo_metrics.items():
                if metric_name not in nemo_result["groups"][provider_id]["metrics"]:
                    nemo_result["groups"][provider_id]["metrics"][metric_name] = metric_data

    return nemo_result

# Example: Get results for a completed evaluation
def get_evaluation_results_nemo_format(request_id: str):
    """Get evaluation results and format them for NeMo Evaluator compatibility."""
    print(f"üîç Retrieving results for evaluation: {request_id}")

    response = api_request("GET", f"/evaluations/{request_id}")

    if response.status_code != 200:
        print(f"‚ùå Failed to get evaluation results: {response.text}")
        return None

    eval_data = response.json()

    # Check if evaluation is completed
    if eval_data.get('status') != 'completed':
        print(f"‚è≥ Evaluation not completed yet. Status: {eval_data.get('status')}")
        print(f"üìä Progress: {eval_data.get('progress_percentage', 0):.1f}%")
        return None

    print("‚úÖ Evaluation completed successfully!")
    print(f"üìä Total evaluations: {eval_data.get('total_evaluations', 0)}")
    print(f"‚úÖ Completed: {eval_data.get('completed_evaluations', 0)}")
    print(f"‚ùå Failed: {eval_data.get('failed_evaluations', 0)}")

    # Format to NeMo Evaluator structure
    nemo_formatted = format_to_nemo_evaluator(eval_data)

    print("\nüéØ Results formatted for NeMo Evaluator:")
    print_json(nemo_formatted)

    return nemo_formatted

# Example usage with a completed evaluation
# Replace with an actual request_id from a completed evaluation
example_request_id = "00000000-0000-0000-0000-000000000000"  # Placeholder

print("üìù Example: Retrieving evaluation results...")
print(f"Note: Replace '{example_request_id}' with actual request ID from completed evaluation")

# Simulated example of what the formatted result would look like
example_nemo_result = {
    "tasks": {
        "arc_easy": {
            "metrics": {
                "acc": {
                    "scores": {
                        "acc": {
                            "value": 0.7534,
                            "stats": {
                                "count": 2376,
                                "sum": 1790.0,
                                "mean": 0.7534,
                                "stderr": 0.0088,
                                "min": 0.0,
                                "max": 1.0,
                                "variance": 0.1856,
                                "stddev": 0.4307
                            }
                        }
                    }
                },
                "acc_norm": {
                    "scores": {
                        "acc_norm": {
                            "value": 0.7447,
                            "stats": {
                                "count": 2376,
                                "sum": 1769.0,
                                "mean": 0.7447,
                                "stderr": 0.0089,
                                "min": 0.0,
                                "max": 1.0,
                                "variance": 0.1902,
                                "stddev": 0.4361
                            }
                        }
                    }
                }
            }
        },
        "humaneval": {
            "metrics": {
                "pass_at_1": {
                    "scores": {
                        "pass_at_1": {
                            "value": 0.6451,
                            "stats": {
                                "count": 164,
                                "sum": 105.8,
                                "mean": 0.6451,
                                "stderr": 0.0374,
                                "min": 0.0,
                                "max": 1.0,
                                "variance": 0.229,
                                "stddev": 0.4784
                            }
                        }
                    }
                },
                "bleu": {
                    "scores": {
                        "bleu": {
                            "value": 0.1234,
                            "stats": {
                                "count": 164,
                                "sum": 20.24,
                                "mean": 0.1234,
                                "stderr": 0.0156,
                                "min": 0.0,
                                "max": 1.0,
                                "variance": 0.0399,
                                "stddev": 0.1998
                            }
                        }
                    }
                }
            }
        }
    },
    "groups": {
        "lm_evaluation_harness": {
            "metrics": {
                "avg_score": {
                    "scores": {
                        "avg_score": {
                            "value": 0.6993,
                            "stats": {
                                "count": 2,
                                "sum": 1.3986,
                                "mean": 0.6993,
                                "stderr": 0.0542,
                                "min": 0.6451,
                                "max": 0.7534,
                                "variance": 0.0058,
                                "stddev": 0.0765
                            }
                        }
                    }
                }
            }
        }
    }
}

print("\nüìÑ Example NeMo Evaluator formatted result:")
print_json(example_nemo_result)

### Save Results to File

In [None]:
# Function to save NeMo formatted results to file
def save_nemo_results(nemo_result, filename="nemo_evaluation_results.json"):
    """Save NeMo Evaluator formatted results to JSON file."""
    import json

    with open(filename, 'w') as f:
        json.dump(nemo_result, f, indent=2)

    print(f"üíæ Results saved to {filename}")

# Example usage
# save_nemo_results(example_nemo_result, "coding_reasoning_collection_results.json")

### Validate NeMo Format Compatibility

In [None]:
# Function to validate NeMo Evaluator format compatibility
def validate_nemo_format(result_dict):
    """
    Validate that the result dictionary conforms to NeMo Evaluator format.

    Returns: (is_valid: bool, errors: list)
    """
    errors = []

    # Check top-level structure
    if not isinstance(result_dict, dict):
        errors.append("Result must be a dictionary")
        return False, errors

    if "tasks" not in result_dict:
        errors.append("Missing required 'tasks' field")

    if "groups" not in result_dict:
        errors.append("Missing required 'groups' field")

    # Validate tasks structure
    if "tasks" in result_dict:
        tasks = result_dict["tasks"]
        if not isinstance(tasks, dict):
            errors.append("'tasks' must be a dictionary")
        else:
            for task_name, task_data in tasks.items():
                if not isinstance(task_data, dict):
                    errors.append(f"Task '{task_name}' must be a dictionary")
                    continue

                if "metrics" not in task_data:
                    errors.append(f"Task '{task_name}' missing 'metrics' field")
                    continue

                metrics = task_data["metrics"]
                for metric_name, metric_data in metrics.items():
                    if "scores" not in metric_data:
                        errors.append(f"Metric '{metric_name}' in task '{task_name}' missing 'scores'")
                        continue

                    scores = metric_data["scores"]
                    for score_name, score_data in scores.items():
                        if "value" not in score_data:
                            errors.append(f"Score '{score_name}' missing 'value'")
                        if "stats" not in score_data:
                            errors.append(f"Score '{score_name}' missing 'stats'")
                        elif not isinstance(score_data["stats"], dict):
                            errors.append(f"Score '{score_name}' stats must be a dictionary")

    is_valid = len(errors) == 0
    return is_valid, errors

# Validate the example result
is_valid, validation_errors = validate_nemo_format(example_nemo_result)

print(f"üîç NeMo format validation: {'‚úÖ Valid' if is_valid else '‚ùå Invalid'}")
if validation_errors:
    print("Validation errors:")
    for error in validation_errors:
        print(f"  - {error}")
else:
    print("‚úÖ All format requirements satisfied")

## Summary

This notebook demonstrated comprehensive usage of the Eval Hub API including:

- ‚úÖ **Basic Operations**: Health checks, provider/benchmark discovery
- ‚úÖ **Collection Management**: Create custom collections, list collections, and get detailed collection information
- ‚úÖ **Collection-Based Evaluations**: Execute evaluations using collections with automatic provider task aggregation
- ‚úÖ **Collection Results**: Monitor collection progress, retrieve aggregate results, and compare collection performance
- ‚úÖ **Model Management**: Register, list, update, and delete models
- ‚úÖ **Simple Evaluations**: Risk category-based evaluations
- ‚úÖ **Advanced Evaluations**: Explicit backend configuration
- ‚úÖ **NeMo Integration**: Single and multi-container setups
- ‚úÖ **Monitoring**: Status checking and progress tracking
- ‚úÖ **Management**: Cancellation and system metrics
- ‚úÖ **Error Handling**: Validation and error responses
- ‚úÖ **Batch Operations**: Multiple evaluation management
- ‚úÖ **Result Formatting**: NeMo Evaluator compatible result transformation and validation

For production use, remember to:
- Use proper API keys and authentication
- Configure appropriate timeouts for your evaluation complexity
- Monitor resource usage and system metrics
- Handle errors gracefully in your applications
- Use the async evaluation mode for long-running evaluations

The Eval Hub provides a powerful and flexible API for orchestrating machine learning model evaluations across multiple backends and evaluation frameworks.