# OpenAI-Compatible Batch Inference Demo

This notebook demonstrates how to use the batch inference API with OpenAI-compatible endpoints.

In [None]:
!pip install requests

## 1. Import Libraries and Setup

In [None]:
import requests
import json
import time
import logging
from IPython.display import JSON, display

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Configuration
BASE_URL = "http://localhost:8000"  # Change if your server runs elsewhere

## 2. Load Sample Data

In [None]:
# Load sample prompts from JSONL file
with open('sample_batch.jsonl', 'r') as f:
    prompts = [json.loads(line)['prompt'] for line in f if line.strip()]

logger.info(f"Loaded {len(prompts)} sample prompts:")
for i, prompt in enumerate(prompts, 1):
    logger.info(f"{i}. {prompt}")

## 3. Submit Batch Job

In [None]:
# Submit batch job
batch_request = {
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "input": [{"prompt": m} for m in prompts],
    "max_tokens": 100,
    "temperature": 0.7
}

logger.info("Submitting batch job...")
response = requests.post(
    f"{BASE_URL}/v1/batches",
    json=batch_request
)

if response.status_code == 200:
    data = response.json()
    batch_id = data["id"]
    logger.info(f"Batch created with ID: {batch_id}")
    logger.info(f"Status: {data['status']}")
    logger.info(f"Created at: {data['created_at']}")
else:
    logger.error(f"Failed to create batch: {response.status_code}")
    logger.error(response.text)

## 4. Monitor Job Progress

In [None]:
# Function to check job status
def check_job_status(batch_id):
    response = requests.get(f"{BASE_URL}/v1/batches/{batch_id}")
    if response.status_code == 200:
        data = response.json()
        return data
    else:
        logger.error(f"Failed to get status: {response.status_code}")
        return None

# Monitor job progress
logger.info("Monitoring job progress...")
for i in range(10):  # Check 10 times with 2-second intervals
    status_data = check_job_status(batch_id)
    if status_data:
        status = status_data["status"]
        logger.info(f"Check {i+1}: {status}")
        if status in ["completed", "failed"]:
            break
    time.sleep(2.0)
else:
    logger.warning("Timeout or error checking status")

## 5. Retrieve Results

In [None]:
# Get final results
logger.info("Retrieving final results...")
response = requests.get(f"{BASE_URL}/v1/batches/{batch_id}/results")

if response.status_code == 200:
    data = response.json()
    results = data.get("data", [])
    logger.info(f"Retrieved {len(results)} results:")
    display(JSON(results))
    
    # Show first few results
    for i, result in enumerate(results[:3], 1):
        prompt = result.get("prompt", "")
        response_text = result.get("response", "")
        tokens = result.get("tokens", 0)
        logger.info(f"--- Result {i} ---")
        logger.info(f"Prompt: {prompt}")
        logger.info(f"Response: {response_text}")
        logger.info(f"Tokens: {tokens}")
else:
    logger.error(f"Failed to get results: {response.status_code}")

## 6. Summary

This notebook demonstrated the complete OpenAI-compatible batch inference workflow:

1. **Batch Creation** - Submit prompts via POST `/v1/batches`
2. **Job Processing** - Background worker processes jobs asynchronously
3. **Status Monitoring** - Poll job status until completion
4. **Result Retrieval** - Get processed results via `/v1/batches/{id}/results`

The system uses:
- **File-based job storage** (JSON + JSONL format)
- **In-memory queue** for job scheduling
- **Background worker** for asynchronous processing
- **Mock inference engine** for demonstration purposes

**Production upgrades** would include:
- Real Ray Data + vLLM integration
- Redis/RabbitMQ for job queuing
- Database for job metadata
- SLA-aware scheduling
- Docker/Kubernetes deployment