# Synth GEPA Demo - Banking77 (Production)

Self-contained notebook for prompt optimization using GEPA.

Banking77 is a task where an AI needs to label a customer service request with one of 77 fixed possible intents.

**Run in Google Colab:** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/synth-laboratories/synth-ai/blob/main/demos/gepa_banking77/demo_prod.ipynb)

**What this demo does:**
1. Spins up a local api that runs the banking77 classification pipeline
2. Creates a Cloudflare tunnel to expose it to the internet
3. Runs GEPA prompt optimization via Synth
4. Compares baseline vs optimized prompts on held-out data

In [None]:
# Step 0: Install dependencies (run this first on Colab)
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running in Google Colab - installing dependencies...")
    !pip install -q synth-ai httpx fastapi uvicorn datasets nest_asyncio
    
    # Install cloudflared
    !wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O /usr/local/bin/cloudflared
    !chmod +x /usr/local/bin/cloudflared
    !cloudflared --version
    
    print("Dependencies installed!")
else:
    print("Not in Colab - assuming dependencies are already installed")
    print("Required: pip install synth-ai httpx fastapi uvicorn datasets nest_asyncio")
    print("Required: brew install cloudflare/cloudflare/cloudflared (macOS)")

## Step 1: Setup

To complete prompt optimization, we'll need to spin up services that expose our business logic to the optimizer via http. Because we'll evaluate the optimized prompt against the baseline prompt concurrently at the end, we'll select two unique ports.

In [None]:
# Step 1: Imports, Config, and Backend Health Check
import os, sys, time, asyncio, json, threading
from typing import Any, Optional

import httpx
import uvicorn
from fastapi import FastAPI, Request, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from datasets import load_dataset

# Production backend
SYNTH_API_BASE = 'https://api.usesynth.ai'
LOCAL_API_PORT = 8001
OPTIMIZED_LOCAL_API_PORT = 8002

print(f'Backend: {SYNTH_API_BASE}')
print(f'Local API Ports: {LOCAL_API_PORT}, {OPTIMIZED_LOCAL_API_PORT}')

# Check backend health
r = httpx.get(f'{SYNTH_API_BASE}/health', timeout=30)
if r.status_code == 200:
    print(f'Backend health: {r.json()}')
else:
    print(f'WARNING: Backend returned status {r.status_code}')
    print(f'Response: {r.text[:200]}...' if len(r.text) > 200 else f'Response: {r.text}')
    raise RuntimeError(f'Backend not healthy: status {r.status_code}')

## Step 2: Authentication

To kick off any job via Synth, we need an API key. Typically, you'll use your own. But here, for simplicity's sake, we'll mint a demo API key which is scoped to these demos and should not be used in other instances.

In [None]:
# Step 2: Get API Key (use env var or mint demo key)
API_KEY = os.environ.get('SYNTH_API_KEY', '')

if not API_KEY:
    print('No SYNTH_API_KEY found, minting demo key...')
    resp = httpx.post(f'{SYNTH_API_BASE}/api/demo/keys', json={'ttl_hours': 4}, timeout=30)
    resp.raise_for_status()
    API_KEY = resp.json()['api_key']
    print(f'Demo API Key: {API_KEY[:25]}...')
else:
    print(f'Using SYNTH_API_KEY: {API_KEY[:20]}...')

## Step 3: Environment Key

Because your business logic's integrity is important, you should require authentication to the local api you serve during optimization. The SDK has functions to help you mint and register your key, which we'll then use to authenticate requests from the Synth optimization job to our local API.

In [None]:
# Step 3: Mint and Upload Environment Key (using SDK helpers)
from synth_ai.sdk.learning.rl import mint_environment_api_key, setup_environment_api_key

ENVIRONMENT_API_KEY = mint_environment_api_key()
print(f'Minted env key: {ENVIRONMENT_API_KEY[:12]}...{ENVIRONMENT_API_KEY[-4:]}')

result = setup_environment_api_key(SYNTH_API_BASE, API_KEY, token=ENVIRONMENT_API_KEY)
print(f'Uploaded env key: {result}')

## Step 4: Local API Definition

Here we define a FastAPI service that serves our Banking77 classification pipeline.

**Key Design: Clean Separation**

The code is split into two parts:

1. **`classify_banking77_query()`** - A clean classification pipeline using standard OpenAI Python client. This is normal business logic with NO Synth-specific code. It could be used anywhere.

2. **`create_banking77_local_api()`** - A thin wrapper that:
   - Sets `OPENAI_BASE_URL` to route calls through Synth's inference proxy
   - Calls the clean classification pipeline
   - Computes rewards
   - Returns `RolloutResponse` with just metrics (no trace)

**Why no trace?** Synth's inference proxy intercepts all OpenAI calls and reconstructs traces server-side. Your code doesn't need to manually build trace objects.

**Required Routes:**

| Route | Method | Description |
|-------|--------|-------------|
| `/health` | GET | Health check |
| `/info` | GET | Static metadata |
| `/task_info` | GET | Task instances for seeds |
| `/rollout` | POST | Execute classification and return reward |

In [None]:
# Step 4: Define Banking77 Local API
import json
import os

from datasets import load_dataset
from openai import AsyncOpenAI

# Synth-AI SDK LocalAPI imports
from synth_ai.sdk.localapi import LocalAPIConfig, create_local_api
from synth_ai.sdk.task.contracts import RolloutMetrics, RolloutRequest, RolloutResponse, TaskInfo

# ============================================================================
# PART 1: Core Classification Pipeline (NO Synth logic)
# ============================================================================
# This is your business logic - a normal OpenAI-based classification pipeline.
# It uses the standard openai library and has no knowledge of Synth.
# The only "magic" is that OPENAI_BASE_URL can be set externally to route
# calls through Synth's inference proxy for trace reconstruction.

APP_ID = "banking77"
APP_NAME = "Banking77 Intent Classification"

BANKING77_LABELS = [
    "activate_my_card", "age_limit", "apple_pay_or_google_pay", "atm_support", "automatic_top_up",
    "balance_not_updated_after_bank_transfer", "balance_not_updated_after_cheque_or_cash_deposit",
    "beneficiary_not_allowed", "cancel_transfer", "card_about_to_expire", "card_acceptance",
    "card_arrival", "card_delivery_estimate", "card_linking", "card_not_working",
    "card_payment_fee_charged", "card_payment_not_recognised", "card_payment_wrong_exchange_rate",
    "card_swallowed", "cash_withdrawal_charge", "cash_withdrawal_not_recognised", "change_pin",
    "compromised_card", "contactless_not_working", "country_support", "declined_card_payment",
    "declined_cash_withdrawal", "declined_transfer", "direct_debit_payment_not_recognised",
    "disposable_card_limits", "edit_personal_details", "exchange_charge", "exchange_rate",
    "exchange_via_app", "extra_charge_on_statement", "failed_transfer", "fiat_currency_support",
    "get_disposable_virtual_card", "get_physical_card", "getting_spare_card", "getting_virtual_card",
    "lost_or_stolen_card", "lost_or_stolen_phone", "order_physical_card", "passcode_forgotten",
    "pending_card_payment", "pending_cash_withdrawal", "pending_top_up", "pending_transfer",
    "pin_blocked", "receiving_money", "Refund_not_showing_up", "request_refund",
    "reverted_card_payment?", "supported_cards_and_currencies", "terminate_account",
    "top_up_by_bank_transfer_charge", "top_up_by_card_charge", "top_up_by_cash_or_cheque",
    "top_up_failed", "top_up_limits", "top_up_reverted", "topping_up_by_card",
    "transaction_charged_twice", "transfer_fee_charged", "transfer_into_account",
    "transfer_not_received_by_recipient", "transfer_timing", "unable_to_verify_identity",
    "verify_my_identity", "verify_source_of_funds", "verify_top_up", "virtual_card_not_working",
    "visa_or_mastercard", "why_verify_identity", "wrong_amount_of_cash_received",
    "wrong_exchange_rate_for_cash_withdrawal",
]

TOOL_NAME = "banking77_classify"
TOOL_SCHEMA = {
    "type": "function",
    "function": {
        "name": TOOL_NAME,
        "description": "Return the predicted banking77 intent label.",
        "parameters": {
            "type": "object",
            "properties": {"intent": {"type": "string"}},
            "required": ["intent"],
        },
    },
}


def format_available_intents(label_names: list) -> str:
    """Format the list of available intents for the prompt."""
    return "\n".join(f"{i+1}. {l}" for i, l in enumerate(label_names))


# =============================================================================
# CORE PIPELINE: This is the main classification logic.
# =============================================================================
# This function is completely independent of Synth. It's just a normal
# async function that calls OpenAI and returns a prediction. You could
# use this exact code in any application.

async def classify_banking77_query(
    query: str,
    system_prompt: str,
    available_intents: str,
    model: str = "gpt-4o-mini",
    api_key: str | None = None,
) -> str:
    """Classify a banking query into an intent using OpenAI.
    
    This is the CORE PIPELINE - clean async code with NO Synth-specific logic.
    It uses whatever OPENAI_BASE_URL is set in the environment, allowing
    calls to be routed through Synth's inference proxy for trace reconstruction.
    
    Args:
        query: The customer query to classify
        system_prompt: System prompt for the model
        available_intents: Formatted list of available intents
        model: Model to use (e.g., "gpt-4o-mini")
        api_key: Optional API key (uses OPENAI_API_KEY env var if not provided)
    
    Returns:
        The predicted intent label
    """
    # Standard async OpenAI client - uses OPENAI_BASE_URL from environment
    client = AsyncOpenAI(api_key=api_key) if api_key else AsyncOpenAI()
    
    user_msg = (
        f"Customer Query: {query}\n\n"
        f"Available Intents:\n{available_intents}\n\n"
        f"Classify this query into one of the above banking intents using the tool call."
    )
    
    response = await client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_msg},
        ],
        tools=[TOOL_SCHEMA],
        tool_choice={"type": "function", "function": {"name": TOOL_NAME}},
    )
    
    # Extract intent from tool call
    tool_call = response.choices[0].message.tool_calls[0]
    args = json.loads(tool_call.function.arguments)
    return args["intent"]


# ============================================================================
# PART 2: Dataset Loader
# ============================================================================

class Banking77Dataset:
    """Lazy Hugging Face dataset loader for Banking77."""
    def __init__(self):
        self._cache = {}
        self._label_names = None

    def _load_split(self, split: str):
        if split not in self._cache:
            ds = load_dataset("banking77", split=split, trust_remote_code=False)
            self._cache[split] = ds
            if self._label_names is None and hasattr(ds.features.get("label"), "names"):
                self._label_names = ds.features["label"].names
        return self._cache[split]

    def ensure_ready(self, splits):
        for split in splits:
            self._load_split(split)

    def size(self, split: str) -> int:
        return len(self._load_split(split))

    def sample(self, *, split: str, index: int) -> dict:
        ds = self._load_split(split)
        idx = index % len(ds)
        row = ds[idx]
        label_idx = int(row.get("label", 0))
        label_text = self._label_names[label_idx] if self._label_names and label_idx < len(self._label_names) else f"label_{label_idx}"
        return {"index": idx, "split": split, "text": str(row.get("text", "")), "label": label_text}

    @property
    def label_names(self) -> list:
        if self._label_names is None:
            self._load_split("train")
        return self._label_names or []


# ============================================================================
# PART 3: Local API (thin Synth integration layer)
# ============================================================================
# This is the wrapper that:
# 1. Sets OPENAI_BASE_URL to route calls through Synth's inference proxy
# 2. Calls the clean classification pipeline
# 3. Computes rewards
# 4. Returns RolloutResponse (NO trace - Synth reconstructs it server-side)

def create_banking77_local_api(system_prompt: str, env_api_key: str):
    """Create a Banking77 local API.
    
    The classification pipeline is completely clean - no Synth logic.
    Synth integration happens via OPENAI_BASE_URL routing.
    """
    from fastapi import Request
    
    # Set ENVIRONMENT_API_KEY so SDK auth middleware can validate it
    os.environ["ENVIRONMENT_API_KEY"] = env_api_key
    
    # Initialize dataset
    dataset = Banking77Dataset()
    dataset.ensure_ready(["train", "test"])
    
    async def run_rollout(request: RolloutRequest, fastapi_request: Request) -> RolloutResponse:
        """Execute a rollout for the banking77 classification task."""
        split = request.env.config.get("split", "train")
        seed = request.env.seed
        
        # Get sample from dataset
        sample = dataset.sample(split=split, index=seed)
        
        # Route OpenAI calls through Synth's inference proxy for trace reconstruction.
        # The inference_url is a base URL in the format:
        #   {interceptor_base}/v1/{trial_id}/{correlation_id}
        # The OpenAI SDK automatically appends /chat/completions to this base URL.
        os.environ["OPENAI_BASE_URL"] = request.policy.config.get("inference_url")
        
        # Extract API key from request (Synth provides this)
        api_key = request.policy.config.get("api_key")
        
        # Call the CORE PIPELINE (no Synth logic inside)
        predicted_intent = await classify_banking77_query(
            query=sample["text"],
            system_prompt=system_prompt,
            available_intents=format_available_intents(dataset.label_names),
            model=request.policy.config.get("model", "gpt-4o-mini"),
            api_key=api_key,
        )
        
        # Compute reward (simple exact match)
        expected_intent = sample["label"]
        is_correct = (
            predicted_intent.lower().replace("_", " ").strip() 
            == expected_intent.lower().replace("_", " ").strip()
        )
        reward = 1.0 if is_correct else 0.0
        
        # Return just metrics - NO trace (Synth reconstructs from intercepted calls)
        return RolloutResponse(
            run_id=request.run_id,
            metrics=RolloutMetrics(outcome_reward=reward),
            trace=None,  # Synth server reconstructs traces from inference proxy calls
            trace_correlation_id=request.policy.config.get("trace_correlation_id"),
        )
    
    def provide_taskset_description():
        """Returns static metadata about the dataset."""
        return {
            "splits": ["train", "test"],
            "sizes": {"train": dataset.size("train"), "test": dataset.size("test")},
        }
    
    def provide_task_instances(seeds):
        """Returns per-seed task metadata."""
        for seed in seeds:
            sample = dataset.sample(split="train", index=seed)
            yield TaskInfo(
                task={"id": APP_ID, "name": APP_NAME},
                dataset={"id": APP_ID, "split": sample["split"], "index": sample["index"]},
                inference={"tool": TOOL_NAME},
                limits={"max_turns": 1},
                task_metadata={"query": sample["text"], "expected_intent": sample["label"]},
            )
    
    return create_local_api(LocalAPIConfig(
        app_id=APP_ID,
        name=APP_NAME,
        description=f"{APP_NAME} local API for classifying customer queries into banking intents.",
        provide_taskset_description=provide_taskset_description,
        provide_task_instances=provide_task_instances,
        rollout=run_rollout,
        cors_origins=["*"],
    ))

print('Banking77 local API defined')
print('  - classify_banking77_query(): Core pipeline (async), no Synth logic')
print('  - create_banking77_local_api(): Thin wrapper that routes calls via OPENAI_BASE_URL')

## Step 5: Start Baseline Local API

Now we'll spin up the baseline local API and expose it via a Cloudflare tunnel. The SDK provides `TunneledLocalAPI` for clean one-liner tunnel setup that handles rotation, cloudflared process management, and DNS verification automatically.

In [None]:
# Step 5: Start Baseline Local API with Cloudflare Tunnel
import nest_asyncio
nest_asyncio.apply()

# Import tunnel helpers from SDK - TunneledLocalAPI handles rotation, cloudflared, and DNS verification
from synth_ai.sdk.tunnels import TunneledLocalAPI, TunnelBackend, kill_port, wait_for_health_check
from synth_ai.sdk.task import run_server_background

BASELINE_SYSTEM_PROMPT = "You are an expert banking assistant that classifies customer queries into banking intents. Given a customer message, respond with exactly one intent label from the provided list using the `banking77_classify` tool."
USER_PROMPT = "Customer Query: {query}\n\nAvailable Intents:\n{available_intents}\n\nClassify this query into one of the above banking intents using the tool call."

# Create baseline local API using SDK abstractions
baseline_app = create_banking77_local_api(BASELINE_SYSTEM_PROMPT, ENVIRONMENT_API_KEY)

# Kill port if in use, start server in background
kill_port(LOCAL_API_PORT)
run_server_background(baseline_app, LOCAL_API_PORT)

# Wait for local health check
print(f'Waiting for baseline local API on port {LOCAL_API_PORT}...')
await wait_for_health_check("localhost", LOCAL_API_PORT, ENVIRONMENT_API_KEY, timeout=30.0)
print('Baseline local API ready!')

# Create managed tunnel - handles rotation, cloudflared process, and DNS verification
print('\nProvisioning Cloudflare tunnel for baseline...')
baseline_tunnel = await TunneledLocalAPI.create(
    local_port=LOCAL_API_PORT,
    backend=TunnelBackend.CloudflareManagedTunnel,
    api_key=API_KEY,
    env_api_key=ENVIRONMENT_API_KEY,
    reason="baseline_notebook",
    backend_url=SYNTH_API_BASE,
    progress=True,
)
BASELINE_LOCAL_API_URL = baseline_tunnel.url

print(f'\nBaseline local API URL: {BASELINE_LOCAL_API_URL}')

## Step 6: Run GEPA Optimization

This kicks off the GEPA prompt optimization job on the Synth backend. GEPA will evolve the system prompt over multiple generations, evaluating each candidate against the training seeds via our tunneled local API. Because evolutionary prompt optimization requires multiple sequential rounds of evals, expect the process to take roughly as long as (time to eval) x (number of generations)

In [None]:
# Step 6: Run GEPA Optimization (using SDK)
from synth_ai.sdk.api.train.prompt_learning import PromptLearningJob

def run_gepa():
    config_body = {
        'prompt_learning': {
            'algorithm': 'gepa',
            'run_local': False,  # Run on remote backend
            'local_api_url': BASELINE_LOCAL_API_URL,
            'local_api_key': ENVIRONMENT_API_KEY,
            'env_name': 'banking77',
            'initial_prompt': {
                'messages': [
                    {'role': 'system', 'order': 0, 'pattern': BASELINE_SYSTEM_PROMPT},
                    {'role': 'user', 'order': 1, 'pattern': USER_PROMPT},
                ],
                'wildcards': {'query': 'REQUIRED', 'available_intents': 'OPTIONAL'},
            },
            'policy': {'model': 'gpt-4.1-nano', 'provider': 'openai', 'temperature': 0.0, 'max_completion_tokens': 256},
            'gepa': {
                'env_name': 'banking77',
                'evaluation': {'seeds': list(range(30)), 'validation_seeds': list(range(50, 56))},
                'rollout': {'budget': 50, 'max_concurrent': 5, 'minibatch_size': 5},
                'mutation': {'rate': 0.3, 'llm_model': 'gpt-4.1-nano'},
                'population': {'initial_size': 3, 'num_generations': 2, 'children_per_generation': 2},
                'archive': {'size': 5, 'pareto_set_size': 10},
                'token': {'counting_model': 'gpt-4'},
            },
        },
    }

    print(f'Creating GEPA job (local_api_url={BASELINE_LOCAL_API_URL})...')
    
    # Use SDK instead of raw httpx - both file-based and dict-based configs
    # go through the same PromptLearningConfig Pydantic validation
    pl_job = PromptLearningJob.from_dict(
        config_dict=config_body,
        backend_url=SYNTH_API_BASE,
        api_key=API_KEY,
        local_api_key=ENVIRONMENT_API_KEY,
        skip_health_check=True,  # Tunnel DNS may not have propagated yet
    )
    
    job_id = pl_job.submit()
    print(f'Job ID: {job_id}')

    # SDK handles polling with built-in progress printing
    # Returns typed PromptLearningResult with .succeeded, .best_score, etc.
    result = pl_job.poll_until_complete(timeout=3600.0, interval=3.0, progress=True)
    
    print(f'\nFINAL: {result.status.value}')
    
    if result.succeeded:
        print(f'BEST SCORE: {result.best_score}')
    elif result.failed:
        print(f'ERROR: {result.error}')
    
    return result

# Sync function - no await needed
result = run_gepa()

## Step 7: Evaluation

Now, we'll run an evaluation for both the baseline prompt we started with, and the optimized prompt. This will be on a separate, but still randomly drawn dataset, so it will tell us if the optimizer in fact improved the prompt generally, or if it overfit to the trainset.

In [None]:
# Step 7: Run Formal Eval Jobs (Baseline vs Optimized)
from synth_ai.sdk.api.eval import EvalJob, EvalJobConfig, EvalResult
from synth_ai.sdk.learning.prompt_learning_client import PromptLearningClient
from synth_ai.sdk.tunnels import TunneledLocalAPI, TunnelBackend, kill_port, wait_for_health_check
from synth_ai.sdk.task import run_server_background

EVAL_SEEDS = list(range(100, 120))  # 20 held-out test samples

def run_eval_job(local_api_url: str, local_api_key: str, seeds: list[int], mode: str) -> EvalResult:
    """Run an eval job using the SDK and wait for completion."""
    config = EvalJobConfig(
        local_api_url=local_api_url,
        backend_url=SYNTH_API_BASE,
        api_key=API_KEY,
        local_api_key=local_api_key,
        env_name='banking77',
        seeds=seeds,
        policy_config={'model': 'gpt-4.1-nano', 'provider': 'openai'},
        env_config={'split': 'test'},
        concurrency=10,
    )

    job = EvalJob(config)
    job_id = job.submit()
    print(f'  {mode} eval job: {job_id}')

    # SDK handles polling with built-in progress printing
    # Returns typed EvalResult with .succeeded, .mean_score, etc.
    return job.poll_until_complete(timeout=600.0, interval=2.0, progress=True)


def extract_system_prompt(prompt_results) -> str:
    """Extract system prompt from the best optimized prompt."""
    sections = prompt_results.top_prompts[0]['template']['sections']
    return next(s['content'] for s in sections if s['role'] == 'system')


# Use typed result from cell-8 (PromptLearningResult with .succeeded, .job_id, etc.)
if result.succeeded:
    print("GEPA Job Succeeded!\n")
    
    # Use PromptLearningClient to retrieve the optimized prompt
    pl_client = PromptLearningClient(SYNTH_API_BASE, API_KEY)
    prompt_results = await pl_client.get_prompts(result.job_id)
    
    # Extract optimized system prompt from top_prompts
    optimized_system = extract_system_prompt(prompt_results)
    best_train_reward = prompt_results.best_score
    
    print('=' * 60)
    print('BASELINE SYSTEM PROMPT')
    print('=' * 60)
    print(BASELINE_SYSTEM_PROMPT)
    
    print('\n' + '=' * 60)
    print('OPTIMIZED SYSTEM PROMPT (from GEPA)')
    print('=' * 60)
    print(optimized_system[:800] + "..." if len(optimized_system) > 800 else optimized_system)
    
    print('\n' + '=' * 60)
    print('GEPA TRAINING RESULTS')
    print('=' * 60)
    print(f"Best Train Reward: {best_train_reward:.1%}")
    
    print('\n' + '=' * 60)
    print(f'FORMAL EVAL JOBS (test split, seeds {EVAL_SEEDS[0]}-{EVAL_SEEDS[-1]})')
    print('=' * 60)
    
    # Start optimized local API on different port using SDK abstractions
    print(f'\nStarting optimized local API on port {OPTIMIZED_LOCAL_API_PORT}...')
    optimized_app = create_banking77_local_api(optimized_system, ENVIRONMENT_API_KEY)
    
    kill_port(OPTIMIZED_LOCAL_API_PORT)
    run_server_background(optimized_app, OPTIMIZED_LOCAL_API_PORT)
    await wait_for_health_check("localhost", OPTIMIZED_LOCAL_API_PORT, ENVIRONMENT_API_KEY, timeout=30.0)
    print('Optimized local API ready!')
    
    # Create managed tunnel - TunneledLocalAPI handles rotation, cloudflared, and DNS verification
    print('\nProvisioning Cloudflare tunnel for optimized...')
    optimized_tunnel = await TunneledLocalAPI.create(
        local_port=OPTIMIZED_LOCAL_API_PORT,
        backend=TunnelBackend.CloudflareManagedTunnel,
        api_key=API_KEY,
        env_api_key=ENVIRONMENT_API_KEY,
        reason="optimized_notebook",
        backend_url=SYNTH_API_BASE,
        progress=True,
    )
    OPTIMIZED_LOCAL_API_URL = optimized_tunnel.url
    
    # Run baseline eval - SDK handles progress printing
    print('\nRunning BASELINE eval job...')
    baseline_result = run_eval_job(
        local_api_url=BASELINE_LOCAL_API_URL,
        local_api_key=ENVIRONMENT_API_KEY,
        seeds=EVAL_SEEDS,
        mode='baseline'
    )
    
    if baseline_result.succeeded:
        print(f'  Baseline eval reward: {baseline_result.mean_score:.1%}')
    else:
        print(f'  Baseline eval failed: {baseline_result.error}')
    
    # Run optimized eval - SDK handles progress printing
    print('\nRunning OPTIMIZED eval job...')
    optimized_result = run_eval_job(
        local_api_url=OPTIMIZED_LOCAL_API_URL,
        local_api_key=ENVIRONMENT_API_KEY,
        seeds=EVAL_SEEDS,
        mode='optimized'
    )
    
    if optimized_result.succeeded:
        print(f'  Optimized eval reward: {optimized_result.mean_score:.1%}')
    else:
        print(f'  Optimized eval failed: {optimized_result.error}')
    
    # Final comparison (only if both succeeded)
    if baseline_result.succeeded and optimized_result.succeeded:
        print('\n' + '=' * 60)
        print('FINAL COMPARISON')
        print('=' * 60)
        print(f"Training:")
        print(f"  Best Train Reward: {best_train_reward:.1%}")
        
        print(f"\nEval (seeds {EVAL_SEEDS[0]}-{EVAL_SEEDS[-1]}, held-out):")
        print(f"  Baseline Reward:  {baseline_result.mean_score:.1%}")
        print(f"  Optimized Reward: {optimized_result.mean_score:.1%}")
        
        eval_lift = optimized_result.mean_score - baseline_result.mean_score
        print(f"  Lift:             {eval_lift:+.1%}")
        
        if eval_lift > 0:
            print("\n>>> OPTIMIZATION GENERALIZES TO HELD-OUT DATA!")
        elif eval_lift == 0:
            print("\n=== Same performance on held-out data")
        else:
            print("\n<<< Baseline better on held-out (possible overfitting)")
else:
    print(f"Job failed: {result.status.value}")
    if result.error:
        print(f"Error: {result.error}")

## Step 8: Cleanup

Terminate cloudflared tunnel processes to free up resources.

In [None]:
# Step 8: Cleanup
from synth_ai.sdk.tunnels import cleanup_all

print('Cleaning up cloudflared processes...')
cleanup_all()
print('Demo complete!')