# LightEval Framework Adapter - External Client Demo

**Architecture Overview:**
- üì¶ **Container**: Pre-built container running LightEval + EvalHub SDK adapter (exposes REST API)
- üíª **This Notebook**: External client making HTTP requests to the container

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    HTTP/REST    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   Jupyter Notebook  ‚îÇ ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂ ‚îÇ      Container          ‚îÇ
‚îÇ   (External Client) ‚îÇ                 ‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îÇ
‚îÇ                     ‚îÇ ‚óÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ ‚îÇ  ‚îÇ LightEval + Adapter ‚îÇ ‚îÇ
‚îÇ - Make requests     ‚îÇ    JSON API     ‚îÇ  ‚îÇ   (Port 8000)       ‚îÇ ‚îÇ
‚îÇ - Display results   ‚îÇ                 ‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îÇ
‚îÇ - Test endpoints    ‚îÇ                 ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## What This Demo Shows

1. **Connecting** to a pre-built LightEval adapter container
2. **Making HTTP requests** from this notebook to test all API endpoints
3. **Submitting evaluation jobs** and monitoring their progress
4. **Retrieving results** and handling different response types
5. **Testing error handling** and edge cases

## Prerequisites

- **Pre-built container**: Container must be already built and running
- **Python with `requests`**: For HTTP client functionality
- **Network access**: For communicating with container API
- **Container running on port 8000**: Default port configuration

### Building and Running the Container

If you haven't built the container yet, use these commands:

```bash
# Build the container (from evalhub-sdk root directory)
podman build -t evalhub/lighteval-adapter:latest -f examples/lighteval_adapter/Dockerfile .

# Run the container
podman run -d \
  --name lighteval-adapter \
  -p 8000:8000 \
  --health-cmd='curl -f http://localhost:8000/api/v1/health || exit 1' \
  --health-interval=30s \
  --health-timeout=10s \
  --health-start-period=30s \
  --health-retries=3 \
  evalhub/lighteval-adapter:latest

# Check container status
podman ps
```

**Note**: This notebook assumes the container is already built and running.

In [1]:
# üìã IMPORTANT: This notebook runs OUTSIDE the container!
#
# This notebook makes HTTP requests to a running LightEval adapter container.
# The container contains:
#   - LightEval evaluation framework
#   - EvalHub SDK adapter wrapper
#   - REST API server (FastAPI)
#
# This notebook acts as an external client testing the API.
# Make sure the container is already built and running before executing this notebook.

import json
import time
from typing import Any

import requests

print("External client notebook started")
print("Connecting to running containerized LightEval adapter")
print("All communication via HTTP/REST API")
print("Make sure container is running on port 8000!")

External client notebook started
Connecting to running containerized LightEval adapter
All communication via HTTP/REST API
Make sure container is running on port 8000!


## Configuration

In [2]:
# Configuration
CONTAINER_NAME = "lighteval-adapter"
IMAGE_NAME = "evalhub/lighteval-adapter"
IMAGE_TAG = "latest"
FULL_IMAGE_NAME = f"{IMAGE_NAME}:{IMAGE_TAG}"
CONTAINER_PORT = 8000
HOST_PORT = 8000
BASE_URL = f"http://localhost:{HOST_PORT}"
API_BASE = f"{BASE_URL}/api/v1"

print(f"Container: {CONTAINER_NAME}")
print(f"Image: {FULL_IMAGE_NAME}")
print(f"API Base URL: {API_BASE}")

# Helper function for making HTTP requests
def make_request(method: str, endpoint: str, data: dict[str, Any] | None = None) -> dict[str, Any]:
    """Make an HTTP API request to the containerized adapter and return the JSON response."""
    url = f"{API_BASE}{endpoint}"
    print(f"\nüåê External HTTP Request: {method.upper()} {url}")

    try:
        if method.lower() == 'get':
            response = requests.get(url, timeout=10)
        elif method.lower() == 'post':
            response = requests.post(url, json=data, timeout=10)
        elif method.lower() == 'delete':
            response = requests.delete(url, timeout=10)
        else:
            raise ValueError(f"Unsupported method: {method}")

        print(f"üì° Response Status: {response.status_code}")

        if response.headers.get('content-type', '').startswith('application/json'):
            result = response.json()
            print(f"üìã JSON Response: {json.dumps(result, indent=2)}")
            return result
        else:
            print(f"üìÑ Text Response: {response.text}")
            return {"text": response.text}

    except Exception as e:
        print(f"‚ùå Request Error: {e}")
        return {"error": str(e)}

Container: lighteval-adapter
Image: evalhub/lighteval-adapter:latest
API Base URL: http://localhost:8000/api/v1


In [3]:
# Container build commands (run these manually in terminal)
build_command = f"podman build -t {FULL_IMAGE_NAME} -f examples/lighteval_adapter/Dockerfile ."
run_command = f"podman run -d --name {CONTAINER_NAME} -p {HOST_PORT}:{CONTAINER_PORT} {FULL_IMAGE_NAME}"

print("To build and run the container, execute these commands in your terminal:")
print(f"1. Build: {build_command}")
print(f"2. Run: {run_command}")
print("3. Then continue with this notebook")

To build and run the container, execute these commands in your terminal:
1. Build: podman build -t evalhub/lighteval-adapter:latest -f examples/lighteval_adapter/Dockerfile .
2. Run: podman run -d --name lighteval-adapter -p 8000:8000 evalhub/lighteval-adapter:latest
3. Then continue with this notebook


In [4]:
# Wait for the container to be ready
print("Waiting for container to be ready...")
max_retries = 30
retry_count = 0

while retry_count < max_retries:
    try:
        response = requests.get(f"{API_BASE}/health", timeout=5)
        if response.status_code == 200:
            print("‚úÖ Container is ready!")
            break
    except requests.exceptions.RequestException:
        pass

    retry_count += 1
    print(f"Retry {retry_count}/{max_retries}...")
    time.sleep(2)

if retry_count >= max_retries:
    print("‚ùå Container failed to become ready")
    print("Tip: Check container logs with: podman logs lighteval-adapter")
    raise RuntimeError("Container not ready after maximum retries")

Waiting for container to be ready...
‚úÖ Container is ready!


In [5]:
# Check if the container is running
import subprocess


def check_container_status():
    """Check if the lighteval-adapter container is running."""
    try:
        result = subprocess.run(
            ["podman", "ps", "--filter", "name=lighteval-adapter", "--format", "{{.Names}} {{.Status}}"],
            capture_output=True,
            text=True,
            check=True
        )
        if "lighteval-adapter" in result.stdout:
            print(f"‚úÖ Container found: {result.stdout.strip()}")
            return True
        else:
            print("‚ùå Container 'lighteval-adapter' not found in running containers")
            return False
    except subprocess.CalledProcessError as e:
        print(f"‚ùå Failed to check container status: {e}")
        return False
    except FileNotFoundError:
        print("‚ùå Podman not found. Make sure podman is installed and in PATH.")
        return False

# Check container status
container_running = check_container_status()

if not container_running:
    print("\nTo start the container, run these commands:")
    print("podman run -d \\")
    print("  --name lighteval-adapter \\")
    print("  -p 8000:8000 \\")
    print("  --health-cmd='curl -f http://localhost:8000/api/v1/health || exit 1' \\")
    print("  --health-interval=30s \\")
    print("  evalhub/lighteval-adapter:latest")
    print("\nThen re-run this notebook.")

‚úÖ Container found: lighteval-adapter Up 10 seconds


In [6]:
# Wait for API to be fully ready (optional)
print("Waiting for API to be fully ready...")
max_retries = 10
retry_count = 0

while retry_count < max_retries:
    try:
        response = requests.get(f"{API_BASE}/health", timeout=5)
        if response.status_code == 200:
            print("‚úÖ API is ready!")
            break
    except requests.exceptions.RequestException:
        pass

    retry_count += 1
    print(f"Retry {retry_count}/{max_retries}...")
    time.sleep(1)

if retry_count >= max_retries:
    print("‚ùå API not responding after retries")
    print("Check container logs: podman logs lighteval-adapter")
else:
    print("üöÄ Ready to test LightEval adapter endpoints!")

Waiting for API to be fully ready...
‚úÖ API is ready!
üöÄ Ready to test LightEval adapter endpoints!


## Step 2: Test the Adapter Endpoints via HTTP

üåê **Now we'll test the running LightEval adapter from this external notebook**

The container is running the LightEval framework + EvalHub adapter on port 8000.  
This notebook acts as an external client making HTTP requests to test all endpoints.

In [7]:
# Get framework information
info_response = make_request('GET', '/info')

if 'framework_id' in info_response:
    print(f"‚úÖ Framework: {info_response['name']}")
    print(f"   Version: {info_response['version']}")
    print(f"   Supported benchmarks: {len(info_response.get('supported_benchmarks', []))}")
    print(f"   Supported models: {', '.join(info_response.get('supported_model_types', []))}")
else:
    print("‚ùå Failed to get framework info")


üåê External HTTP Request: GET http://localhost:8000/api/v1/info
üì° Response Status: 200
üìã JSON Response: {
  "framework_id": "lighteval",
  "name": "LightEval Framework Adapter",
  "version": "1.0.0",
  "description": "LightEval is a lightweight evaluation framework for language models",
  "supported_benchmarks": [
    {
      "benchmark_id": "hellaswag",
      "name": "HellaSwag",
      "description": "LightEval task: HellaSwag",
      "category": "Commonsense reasoning",
      "tags": [],
      "metrics": [
        "accuracy"
      ],
      "dataset_size": null,
      "supports_few_shot": true,
      "default_few_shot": null,
      "custom_config_schema": null
    },
    {
      "benchmark_id": "arc:easy",
      "name": "ARC Easy",
      "description": "LightEval task: ARC Easy",
      "category": "Scientific reasoning",
      "tags": [],
      "metrics": [
        "accuracy"
      ],
      "dataset_size": null,
      "supports_few_shot": true,
      "default_few_shot": null,

In [8]:
# Test initial connection to the container API
try:
    response = requests.get(f"{API_BASE}/health", timeout=5)
    if response.status_code == 200:
        print("‚úÖ Successfully connected to LightEval adapter API!")
        print(f"   API Base URL: {API_BASE}")
        print(f"   Health check: {response.status_code}")
    else:
        print(f"API responded with status code: {response.status_code}")
        print("Container may be starting up. Wait a moment and try again.")
except requests.exceptions.RequestException as e:
    print(f"‚ùå Failed to connect to API at {API_BASE}")
    print(f"Error: {e}")
    print("\nMake sure the container is running on port 8000.")

‚úÖ Successfully connected to LightEval adapter API!
   API Base URL: http://localhost:8000/api/v1
   Health check: 200


In [9]:
# List available benchmarks
benchmarks_response = make_request('GET', '/benchmarks')

if isinstance(benchmarks_response, list) and benchmarks_response:
    print(f"‚úÖ Found {len(benchmarks_response)} benchmarks:")
    for i, benchmark in enumerate(benchmarks_response[:5]):  # Show first 5
        print(f"   {i+1}. {benchmark['name']} ({benchmark['benchmark_id']})")
        print(f"      Category: {benchmark.get('category', 'N/A')}")
        print(f"      Metrics: {', '.join(benchmark.get('metrics', []))}")

    if len(benchmarks_response) > 5:
        print(f"   ... and {len(benchmarks_response) - 5} more")
else:
    print("‚ùå No benchmarks found")


üåê External HTTP Request: GET http://localhost:8000/api/v1/benchmarks
üì° Response Status: 200
üìã JSON Response: [
  {
    "benchmark_id": "hellaswag",
    "name": "HellaSwag",
    "description": "LightEval task: HellaSwag",
    "category": "Commonsense reasoning",
    "tags": [],
    "metrics": [
      "accuracy"
    ],
    "dataset_size": null,
    "supports_few_shot": true,
    "default_few_shot": null,
    "custom_config_schema": null
  },
  {
    "benchmark_id": "arc:easy",
    "name": "ARC Easy",
    "description": "LightEval task: ARC Easy",
    "category": "Scientific reasoning",
    "tags": [],
    "metrics": [
      "accuracy"
    ],
    "dataset_size": null,
    "supports_few_shot": true,
    "default_few_shot": null,
    "custom_config_schema": null
  },
  {
    "benchmark_id": "arc:challenge",
    "name": "ARC Challenge",
    "description": "LightEval task: ARC Challenge",
    "category": "Scientific reasoning",
    "tags": [],
    "metrics": [
      "accuracy"
    ]

### Health Check

In [11]:
# Test health endpoint
health_response = make_request('GET', '/health')

if health_response.get('status') == 'healthy':
    print("‚úÖ Adapter is healthy!")
else:
    print("‚ùå Adapter health check failed")


üåê External HTTP Request: GET http://localhost:8000/api/v1/health
üì° Response Status: 200
üìã JSON Response: {
  "status": "healthy",
  "framework_id": "lighteval",
  "version": "1.0.0",
  "dependencies": {
    "lighteval": {
      "status": "available"
    }
  },
  "memory_usage": null,
  "gpu_usage": null,
  "uptime_seconds": 3600.0,
  "last_evaluation_time": null,
  "metadata": {}
}
‚úÖ Adapter is healthy!


### Framework Information

In [12]:
# Get details for a specific benchmark
if isinstance(benchmarks_response, list) and benchmarks_response:
    test_benchmark_id = benchmarks_response[0]['benchmark_id']
    benchmark_detail = make_request('GET', f'/benchmarks/{test_benchmark_id}')

    if 'benchmark_id' in benchmark_detail:
        print(f"‚úÖ Benchmark details for '{test_benchmark_id}':")
        print(f"   Name: {benchmark_detail['name']}")
        print(f"   Description: {benchmark_detail.get('description', 'N/A')}")
        print(f"   Category: {benchmark_detail.get('category', 'N/A')}")
        print(f"   Metrics: {', '.join(benchmark_detail.get('metrics', []))}")
    else:
        print("‚ùå Failed to get benchmark details")


üåê External HTTP Request: GET http://localhost:8000/api/v1/benchmarks/hellaswag
üì° Response Status: 200
üìã JSON Response: {
  "benchmark_id": "hellaswag",
  "name": "HellaSwag",
  "description": "LightEval task: HellaSwag",
  "category": "Commonsense reasoning",
  "tags": [],
  "metrics": [
    "accuracy"
  ],
  "dataset_size": null,
  "supports_few_shot": true,
  "default_few_shot": null,
  "custom_config_schema": null
}
‚úÖ Benchmark details for 'hellaswag':
   Name: HellaSwag
   Description: LightEval task: HellaSwag
   Category: Commonsense reasoning
   Metrics: accuracy


## Step 4: Submit Evaluation Jobs

Now we'll submit some evaluation jobs to test the adapter functionality.

In [13]:
# Submit an evaluation job
if isinstance(benchmarks_response, list) and benchmarks_response:
    test_benchmark_id = benchmarks_response[0]['benchmark_id']

    evaluation_request = {
        "benchmark_id": test_benchmark_id,
        "model": {
            "name": "gpt2",
            "provider": "huggingface",
            "parameters": {
                "temperature": 0.1,
                "max_tokens": 100
            }
        },
        "num_examples": 10,
        "experiment_name": "demo_evaluation"
    }

    print(f"Submitting evaluation for benchmark: {test_benchmark_id}")
    job_response = make_request('POST', '/evaluations', evaluation_request)

    if 'job_id' in job_response:
        job_id = job_response['job_id']
        print("‚úÖ Job submitted successfully!")
        print(f"   Job ID: {job_id}")
        print(f"   Status: {job_response['status']}")
        print(f"   Submitted at: {job_response['submitted_at']}")
    else:
        print("‚ùå Failed to submit evaluation job")
        job_id = None
else:
    print("‚ùå No benchmarks available for testing")
    job_id = None

Submitting evaluation for benchmark: hellaswag

üåê External HTTP Request: POST http://localhost:8000/api/v1/evaluations
üì° Response Status: 200
üìã JSON Response: {
  "job_id": "58047dd4-711c-4a95-9bec-e366de422fea",
  "status": "pending",
  "evaluation_status": null,
  "request": {
    "benchmark_id": "hellaswag",
    "model": {
      "name": "gpt2",
      "provider": "huggingface",
      "parameters": {
        "temperature": 0.1,
        "max_tokens": 100
      },
      "device": null,
      "batch_size": null
    },
    "num_examples": 10,
    "num_few_shot": null,
    "random_seed": 42,
    "benchmark_config": {},
    "experiment_name": "demo_evaluation",
    "tags": {},
    "priority": 0
  },
  "submitted_at": "2025-12-15T10:00:23.920291Z",
  "started_at": null,
  "completed_at": null,
  "progress": null,
  "current_step": null,
  "total_steps": null,
  "completed_steps": null,
  "error_message": null,
  "error_details": null,
  "estimated_duration": null,
  "actual_duration

### Monitor Job Progress

In [14]:
# Monitor job progress
if job_id:
    print(f"Monitoring job {job_id}...")

    max_wait_time = 300  # 5 minutes
    start_time = time.time()

    while time.time() - start_time < max_wait_time:
        status_response = make_request('GET', f'/evaluations/{job_id}')

        if 'status' in status_response:
            status = status_response['status']
            progress = status_response.get('progress', 0)

            print(f"Status: {status}, Progress: {progress*100 if progress else 0:.1f}%")

            if status in ['completed', 'failed', 'cancelled']:
                if status == 'completed':
                    print("‚úÖ Job completed successfully!")
                    print(f"   Completed at: {status_response.get('completed_at')}")
                elif status == 'failed':
                    print("‚ùå Job failed!")
                    print(f"   Error: {status_response.get('error_message', 'Unknown error')}")
                else:
                    print("‚è∏Ô∏è Job was cancelled")
                break
        else:
            print("‚ùå Failed to get job status")
            break

        time.sleep(5)  # Wait 5 seconds before checking again

    else:
        print("‚è∞ Job monitoring timed out")
else:
    print("‚ö†Ô∏è No job to monitor")

Monitoring job 58047dd4-711c-4a95-9bec-e366de422fea...

üåê External HTTP Request: GET http://localhost:8000/api/v1/evaluations/58047dd4-711c-4a95-9bec-e366de422fea
üì° Response Status: 200
üìã JSON Response: {
  "job_id": "58047dd4-711c-4a95-9bec-e366de422fea",
  "status": "running",
  "evaluation_status": null,
  "request": {
    "benchmark_id": "hellaswag",
    "model": {
      "name": "gpt2",
      "provider": "huggingface",
      "parameters": {
        "temperature": 0.1,
        "max_tokens": 100
      },
      "device": null,
      "batch_size": null
    },
    "num_examples": 10,
    "num_few_shot": null,
    "random_seed": 42,
    "benchmark_config": {},
    "experiment_name": "demo_evaluation",
    "tags": {},
    "priority": 0
  },
  "submitted_at": "2025-12-15T10:00:23.920291Z",
  "started_at": "2025-12-15T10:00:23.920557Z",
  "completed_at": null,
  "progress": 0.3,
  "current_step": null,
  "total_steps": null,
  "completed_steps": null,
  "error_message": null,
  "er

## Step 3: Submit Evaluation Jobs

Now we'll submit some evaluation jobs to test the adapter functionality.

In [15]:
# Get evaluation results
if job_id:
    results_response = make_request('GET', f'/evaluations/{job_id}/results')

    if 'results' in results_response:
        print(f"‚úÖ Evaluation results for job {job_id}:")
        print(f"   Benchmark: {results_response['benchmark_id']}")
        print(f"   Model: {results_response['model_name']}")
        print(f"   Overall Score: {results_response.get('overall_score', 'N/A')}")
        print(f"   Examples Evaluated: {results_response.get('num_examples_evaluated', 'N/A')}")
        print(f"   Duration: {results_response.get('duration_seconds', 'N/A')} seconds")

        print("\n   Detailed Results:")
        for result in results_response['results']:
            metric_name = result['metric_name']
            metric_value = result['metric_value']
            metric_type = result.get('metric_type', 'unknown')
            num_samples = result.get('num_samples', 'N/A')

            print(f"     {metric_name}: {metric_value} ({metric_type})")
            if num_samples != 'N/A':
                print(f"       Samples: {num_samples}")
    else:
        print("‚ùå No results available yet or job not completed")
else:
    print("‚ö†Ô∏è No job ID to get results for")


üåê External HTTP Request: GET http://localhost:8000/api/v1/evaluations/58047dd4-711c-4a95-9bec-e366de422fea/results
üì° Response Status: 200
üìã JSON Response: {
  "job_id": "58047dd4-711c-4a95-9bec-e366de422fea",
  "benchmark_id": "hellaswag",
  "model_name": "gpt2",
  "results": [
    {
      "metric_name": "accuracy",
      "metric_value": 0.81,
      "metric_type": "float",
      "confidence_interval": null,
      "num_samples": 10,
      "metadata": {}
    }
  ],
  "overall_score": 0.81,
  "num_examples_evaluated": 10,
  "evaluation_metadata": {},
  "completed_at": "2025-12-15T10:00:25.949669Z",
  "duration_seconds": 120.0
}
‚úÖ Evaluation results for job 58047dd4-711c-4a95-9bec-e366de422fea:
   Benchmark: hellaswag
   Model: gpt2
   Overall Score: 0.81
   Examples Evaluated: 10
   Duration: 120.0 seconds

   Detailed Results:
     accuracy: 0.81 (float)
       Samples: 10


## Step 5: Test Multiple Evaluations

Let's submit multiple evaluation jobs to test concurrent processing.

In [16]:
# Submit multiple evaluation jobs
if isinstance(benchmarks_response, list) and len(benchmarks_response) >= 2:
    job_ids = []

    for i, benchmark in enumerate(benchmarks_response[:3]):  # Test with up to 3 benchmarks
        evaluation_request = {
            "benchmark_id": benchmark['benchmark_id'],
            "model": {
                "name": "gpt2",
                "provider": "huggingface"
            },
            "num_examples": 5,
            "experiment_name": f"batch_evaluation_{i+1}"
        }

        job_response = make_request('POST', '/evaluations', evaluation_request)

        if 'job_id' in job_response:
            job_ids.append(job_response['job_id'])
            print(f"‚úÖ Submitted job {i+1}: {job_response['job_id']}")

    print(f"\nSubmitted {len(job_ids)} jobs for concurrent processing")
else:
    print("‚ö†Ô∏è Not enough benchmarks available for batch testing")
    job_ids = []


üåê External HTTP Request: POST http://localhost:8000/api/v1/evaluations
üì° Response Status: 200
üìã JSON Response: {
  "job_id": "dc5f5549-c75d-4865-acb3-7ceda6bba4b4",
  "status": "pending",
  "evaluation_status": null,
  "request": {
    "benchmark_id": "hellaswag",
    "model": {
      "name": "gpt2",
      "provider": "huggingface",
      "parameters": {},
      "device": null,
      "batch_size": null
    },
    "num_examples": 5,
    "num_few_shot": null,
    "random_seed": 42,
    "benchmark_config": {},
    "experiment_name": "batch_evaluation_1",
    "tags": {},
    "priority": 0
  },
  "submitted_at": "2025-12-15T10:00:32.327101Z",
  "started_at": null,
  "completed_at": null,
  "progress": null,
  "current_step": null,
  "total_steps": null,
  "completed_steps": null,
  "error_message": null,
  "error_details": null,
  "estimated_duration": null,
  "actual_duration": null
}
‚úÖ Submitted job 1: dc5f5549-c75d-4865-acb3-7ceda6bba4b4

üåê External HTTP Request: POST http

In [17]:
# Check status of all jobs
if job_ids:
    print("\nChecking status of all submitted jobs:")

    for job_id in job_ids:
        status_response = make_request('GET', f'/evaluations/{job_id}')
        if 'status' in status_response:
            status = status_response['status']
            benchmark_id = status_response['request']['benchmark_id']
            progress = status_response.get('progress', 0)
            print(f"   Job {job_id}: {benchmark_id} - {status} ({progress*100 if progress else 0:.1f}%)")
        else:
            print(f"   Job {job_id}: Failed to get status")
else:
    print("‚ö†Ô∏è No batch jobs to check")


Checking status of all submitted jobs:

üåê External HTTP Request: GET http://localhost:8000/api/v1/evaluations/dc5f5549-c75d-4865-acb3-7ceda6bba4b4
üì° Response Status: 200
üìã JSON Response: {
  "job_id": "dc5f5549-c75d-4865-acb3-7ceda6bba4b4",
  "status": "running",
  "evaluation_status": null,
  "request": {
    "benchmark_id": "hellaswag",
    "model": {
      "name": "gpt2",
      "provider": "huggingface",
      "parameters": {},
      "device": null,
      "batch_size": null
    },
    "num_examples": 5,
    "num_few_shot": null,
    "random_seed": 42,
    "benchmark_config": {},
    "experiment_name": "batch_evaluation_1",
    "tags": {},
    "priority": 0
  },
  "submitted_at": "2025-12-15T10:00:32.327101Z",
  "started_at": "2025-12-15T10:00:32.327318Z",
  "completed_at": null,
  "progress": 0.3,
  "current_step": null,
  "total_steps": null,
  "completed_steps": null,
  "error_message": null,
  "error_details": null,
  "estimated_duration": null,
  "actual_duration": nul

## Step 6: Test Job Cancellation

## Step 4: Test Multiple Evaluations

Let's submit multiple evaluation jobs to test concurrent processing.

## Step 7: Test Error Handling

In [20]:
# Test with invalid benchmark ID
print("Testing error handling with invalid benchmark ID:")
invalid_request = {
    "benchmark_id": "non_existent_benchmark",
    "model": {
        "name": "gpt2"
    },
    "num_examples": 10
}

error_response = make_request('POST', '/evaluations', invalid_request)
print("Expected error response received ‚úÖ" if 'error' in error_response or 'detail' in error_response else "Unexpected response ‚ùå")

Testing error handling with invalid benchmark ID:

üåê External HTTP Request: POST http://localhost:8000/api/v1/evaluations
üì° Response Status: 404
üìã JSON Response: {
  "error_type": "NotFound",
  "error_message": "Resource not found",
  "path": "/api/v1/evaluations"
}
Unexpected response ‚ùå


## Step 9: Cleanup

Stop and remove the container when we're done.

In [22]:
# Stop and remove the container
print("Cleaning up...")
print("To stop and remove the container, run these commands in your terminal:")
print(f"   podman stop {CONTAINER_NAME}")
print(f"   podman rm {CONTAINER_NAME}")
print(f"   podman rmi {FULL_IMAGE_NAME}  # Optional: remove image")

Cleaning up...
To stop and remove the container, run these commands in your terminal:
   podman stop lighteval-adapter
   podman rm lighteval-adapter
   podman rmi evalhub/lighteval-adapter:latest  # Optional: remove image
