# Synth GEPA Demo - Banking77 (Production)

Self-contained notebook for prompt optimization using GEPA.

Banking77 is a task where an AI needs to label a customer service request with one of 77 fixed possible intents.

**Run in Google Colab:** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/synth-laboratories/synth-ai/blob/main/demos/gepa_banking77/demo_prod.ipynb)

**What this demo does:**
1. Spins up a local api that runs the banking77 classification pipeline
2. Creates a Cloudflare tunnel to expose it to the internet
3. Runs GEPA prompt optimization via Synth
4. Compares baseline vs optimized prompts on held-out data

In [None]:
# Step 0: Install dependencies (run this first on Colab)
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running in Google Colab - installing dependencies...")
    !pip install -q synth-ai httpx fastapi uvicorn datasets nest_asyncio
    
    # Install cloudflared
    !wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O /usr/local/bin/cloudflared
    !chmod +x /usr/local/bin/cloudflared
    !cloudflared --version
    
    print("Dependencies installed!")
else:
    print("Not in Colab - assuming dependencies are already installed")
    print("Required: pip install synth-ai httpx fastapi uvicorn datasets nest_asyncio")
    print("Required: brew install cloudflare/cloudflare/cloudflared (macOS)")

## Step 1: Setup

To complete prompt optimization, we'll need to spin up services that expose our business logic to the optimizer via http. Because we'll evaluate the optimized prompt against the baseline prompt concurrently at the end, we'll select two unique ports.

In [None]:
# Step 1: Imports, Config, and Backend Health Check
import os, sys, time, asyncio, json, threading
from typing import Any, Optional

import httpx
import uvicorn
from fastapi import FastAPI, Request, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from datasets import load_dataset

# Production backend
SYNTH_API_BASE = 'https://api.usesynth.ai'
TASK_APP_PORT = 8001
OPTIMIZED_TASK_APP_PORT = 8002

print(f'Backend: {SYNTH_API_BASE}')
print(f'Task App Ports: {TASK_APP_PORT}, {OPTIMIZED_TASK_APP_PORT}')

# Check backend health
r = httpx.get(f'{SYNTH_API_BASE}/health', timeout=30)
if r.status_code == 200:
    print(f'Backend health: {r.json()}')
else:
    print(f'WARNING: Backend returned status {r.status_code}')
    print(f'Response: {r.text[:200]}...' if len(r.text) > 200 else f'Response: {r.text}')
    raise RuntimeError(f'Backend not healthy: status {r.status_code}')

## Step 2: Authentication

To kick off any job via Synth, we need an API key. Typically, you'll use your own. But here, for simplicity's sake, we'll mint a demo API key which is scoped to these demos and should not be used in other instances.

In [None]:
# Step 2: Get API Key (use env var or mint demo key)
API_KEY = os.environ.get('SYNTH_API_KEY', '')

if not API_KEY:
    print('No SYNTH_API_KEY found, minting demo key...')
    resp = httpx.post(f'{SYNTH_API_BASE}/api/demo/keys', json={'ttl_hours': 4}, timeout=30)
    resp.raise_for_status()
    API_KEY = resp.json()['api_key']
    print(f'Demo API Key: {API_KEY[:25]}...')
else:
    print(f'Using SYNTH_API_KEY: {API_KEY[:20]}...')

## Step 3: Environment Key

Because your business logic's integrity is important, you should require authentication to the local api you serve during optimization. The SDK has functions to help you mint and register your key, which we'll then use to authenticate requests from the Synth optimization job to our local API.

In [None]:
# Step 3: Mint and Upload Environment Key (using SDK helpers)
from synth_ai.sdk.learning.rl import mint_environment_api_key, setup_environment_api_key

ENVIRONMENT_API_KEY = mint_environment_api_key()
print(f'Minted env key: {ENVIRONMENT_API_KEY[:12]}...{ENVIRONMENT_API_KEY[-4:]}')

result = setup_environment_api_key(SYNTH_API_BASE, API_KEY, token=ENVIRONMENT_API_KEY)
print(f'Uploaded env key: {result}')

## Step 4: Local API Definition

Here, we're going to define a FastAPI service that will serve our Banking77 classification pipeline for the optimizer to use.

The service needs these routes:

| Route | Method | Input | Output | Description |
|-------|--------|-------|--------|-------------|
| `/` | GET | - | `str` | Root endpoint, returns welcome message |
| `/health` | GET | - | `{"status": "ok"}` | Health check for liveness probes |
| `/info` | GET | - | `TaskInfo` | Static metadata about the task app |
| `/task_info` | GET | `?seeds=0,1,2` | `list[TaskInfo]` | Task instances for specific seeds |
| `/rollout` | POST | `RolloutRequest` | `RolloutResponse` | Execute a rollout and return metrics + trace |

**Key Classes:**
- `RolloutRequest`: Contains `run_id`, `env` (seed, config), `policy` (inference_url, model settings)
- `RolloutResponse`: Contains `metrics` (scores), `trace` (messages, response), `trace_correlation_id`
- `TaskInfo`: Static metadata (task descriptor, dataset info, rubric, inference defaults, limits)

The SDK's `LocalAPIConfig` + `create_local_api()` handles all the routing boilerplateâ€”you just provide callbacks for `rollout`, `describe_taskset`, and `provide_task_instances`.

In [None]:
# Step 4: Define Banking77 Local API using SDK Abstractions
import json
import os
from urllib.parse import urlparse, urlunparse

# Synth-AI SDK LocalAPI imports (preferred over task.* namespace)
from synth_ai.sdk.localapi import (
    LocalAPIConfig,
    create_local_api,
    RolloutResponseBuilder,
)
from synth_ai.sdk.localapi.helpers import (
    call_chat_completion_api,
    create_http_client_hooks,
    extract_api_key,
    normalize_chat_completion_url,
)
from synth_ai.sdk.task.contracts import (
    RolloutMetrics,
    RolloutRequest,
    RolloutResponse,
    TaskInfo,
)

# ============================================================================
# Business Logic (inline for Colab compatibility)
# ============================================================================

BANKING77_LABELS = [
    "activate_my_card", "age_limit", "apple_pay_or_google_pay", "atm_support", "automatic_top_up",
    "balance_not_updated_after_bank_transfer", "balance_not_updated_after_cheque_or_cash_deposit",
    "beneficiary_not_allowed", "cancel_transfer", "card_about_to_expire", "card_acceptance",
    "card_arrival", "card_delivery_estimate", "card_linking", "card_not_working",
    "card_payment_fee_charged", "card_payment_not_recognised", "card_payment_wrong_exchange_rate",
    "card_swallowed", "cash_withdrawal_charge", "cash_withdrawal_not_recognised", "change_pin",
    "compromised_card", "contactless_not_working", "country_support", "declined_card_payment",
    "declined_cash_withdrawal", "declined_transfer", "direct_debit_payment_not_recognised",
    "disposable_card_limits", "edit_personal_details", "exchange_charge", "exchange_rate",
    "exchange_via_app", "extra_charge_on_statement", "failed_transfer", "fiat_currency_support",
    "get_disposable_virtual_card", "get_physical_card", "getting_spare_card", "getting_virtual_card",
    "lost_or_stolen_card", "lost_or_stolen_phone", "order_physical_card", "passcode_forgotten",
    "pending_card_payment", "pending_cash_withdrawal", "pending_top_up", "pending_transfer",
    "pin_blocked", "receiving_money", "Refund_not_showing_up", "request_refund",
    "reverted_card_payment?", "supported_cards_and_currencies", "terminate_account",
    "top_up_by_bank_transfer_charge", "top_up_by_card_charge", "top_up_by_cash_or_cheque",
    "top_up_failed", "top_up_limits", "top_up_reverted", "topping_up_by_card",
    "transaction_charged_twice", "transfer_fee_charged", "transfer_into_account",
    "transfer_not_received_by_recipient", "transfer_timing", "unable_to_verify_identity",
    "verify_my_identity", "verify_source_of_funds", "verify_top_up", "virtual_card_not_working",
    "visa_or_mastercard", "why_verify_identity", "wrong_amount_of_cash_received",
    "wrong_exchange_rate_for_cash_withdrawal",
]

TOOL_NAME = "banking77_classify"
TOOL_SCHEMA = {
    "type": "function",
    "function": {
        "name": TOOL_NAME,
        "description": "Return the predicted banking77 intent label.",
        "parameters": {"type": "object", "properties": {"intent": {"type": "string"}}, "required": ["intent"]},
    },
}

class Banking77Dataset:
    """Lazy Hugging Face dataset loader for Banking77."""
    def __init__(self):
        self._cache = {}
        self._label_names = None

    def _load_split(self, split: str):
        if split not in self._cache:
            ds = load_dataset("banking77", split=split, trust_remote_code=False)
            self._cache[split] = ds
            if self._label_names is None and hasattr(ds.features.get("label"), "names"):
                self._label_names = ds.features["label"].names
        return self._cache[split]

    def ensure_ready(self, splits):
        for split in splits:
            self._load_split(split)

    def size(self, split: str) -> int:
        return len(self._load_split(split))

    def sample(self, *, split: str, index: int) -> dict:
        ds = self._load_split(split)
        idx = index % len(ds)
        row = ds[idx]
        label_idx = int(row.get("label", 0))
        label_text = self._label_names[label_idx] if self._label_names and label_idx < len(self._label_names) else f"label_{label_idx}"
        return {"index": idx, "split": split, "text": str(row.get("text", "")), "label": label_text}

    @property
    def label_names(self) -> list:
        if self._label_names is None:
            self._load_split("train")
        return self._label_names or []

class Banking77Scorer:
    """Scorer for Banking77 intent classification."""
    @staticmethod
    def normalize_intent(intent: str) -> str:
        return intent.lower().replace("_", " ").strip()

    @classmethod
    def score(cls, predicted: str, expected: str) -> tuple:
        is_correct = cls.normalize_intent(predicted) == cls.normalize_intent(expected)
        return is_correct, 1.0 if is_correct else 0.0

def format_available_intents(label_names: list) -> str:
    return "\n".join(f"{i+1}. {l}" for i, l in enumerate(label_names))

# ============================================================================
# Local API Factory using SDK abstractions
# ============================================================================

def create_banking77_local_api(system_prompt: str, env_api_key: str):
    """Factory to create a Banking77 local API using SDK abstractions."""
    from fastapi import Request
    
    # Set ENVIRONMENT_API_KEY so SDK auth middleware can validate it
    os.environ["ENVIRONMENT_API_KEY"] = env_api_key
    
    # Initialize dataset
    dataset = Banking77Dataset()
    dataset.ensure_ready(["train", "test"])
    
    # Create HTTP client lifecycle hooks
    startup_http_client, shutdown_http_client = create_http_client_hooks(
        timeout=60.0,
        log_prefix="banking77_local_api",
    )
    
    async def rollout_executor(request: RolloutRequest, fastapi_request: Request) -> RolloutResponse:
        """Execute a rollout for the banking77 classification task."""
        policy_config = request.policy.config or {}
        split = (request.env.config or {}).get("split", "train")
        seed = request.env.seed or 0
        
        # Get sample from dataset
        sample = dataset.sample(split=split, index=seed)
        intents_list = format_available_intents(dataset.label_names)
        
        # Build messages
        user_msg = f"Customer Query: {sample['text']}\n\nAvailable Intents:\n{intents_list}\n\nClassify this query into one of the above banking intents using the tool call."
        messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_msg}]
        
        # Extract API key using SDK helper
        api_key = extract_api_key(fastapi_request, policy_config)
        
        # Get HTTP client from app state
        http_client = getattr(fastapi_request.app.state, "http_client", None)
        
        # Call chat completion API using SDK helper
        response_text, response_json, tool_calls = await call_chat_completion_api(
            policy_config=policy_config,
            messages=messages,
            tools=[TOOL_SCHEMA],
            tool_choice="required",
            api_key=api_key,
            http_client=http_client,
            enable_dns_preresolution=True,
            expected_tool_name=TOOL_NAME,
            log_prefix="[LOCAL_API]",
        )
        
        # Extract predicted intent from tool calls
        predicted_intent = ""
        if tool_calls:
            for tc in tool_calls:
                if tc.get("function", {}).get("name") == TOOL_NAME:
                    args_str = tc.get("function", {}).get("arguments", "{}")
                    try:
                        args = json.loads(args_str)
                        predicted_intent = args.get("intent", "")
                    except Exception:
                        pass
        elif response_text:
            predicted_intent = response_text.strip().split()[0] if response_text.strip() else ""
        
        if not predicted_intent:
            predicted_intent = "__NO_PREDICTION__"
        
        # Score prediction using business logic
        expected_intent = sample["label"]
        is_correct, reward = Banking77Scorer.score(predicted_intent, expected_intent)
        
        # Build trace correlation ID
        trace_correlation_id = policy_config.get("trace_correlation_id")
        if not trace_correlation_id:
            from urllib.parse import urlsplit, parse_qs
            try:
                parsed = urlsplit(policy_config.get("inference_url", ""))
                cid_vals = parse_qs(parsed.query or "").get("cid", [])
                if cid_vals:
                    trace_correlation_id = cid_vals[0]
            except Exception:
                pass
        
        # Build trace
        trace = {
            "messages": messages,
            "response": response_json,
            "correlation_id": trace_correlation_id,
            "model": response_json.get("model") if isinstance(response_json, dict) else None,
            "metadata": {"env": "banking77", "split": sample["split"], "index": sample["index"], "correct": is_correct}
        }
        
        metrics = RolloutMetrics(
            episode_returns=[reward],
            mean_return=reward,
            num_steps=1,
            num_episodes=1,
            outcome_score=reward,
            events_score=reward,
            details={"correct": is_correct},
        )
        
        return RolloutResponse(
            run_id=request.run_id,
            branches={},
            metrics=metrics,
            aborted=False,
            trace_correlation_id=trace_correlation_id,
            trace=trace,
            pipeline_metadata={"inference_url": policy_config.get("inference_url", "")},
        )
    
    # Define taskset descriptor
    def describe_taskset():
        return {
            "id": "banking77",
            "name": "Banking77 Intent Classification",
            "splits": ["train", "test"],
            "num_labels": len(dataset.label_names),
            "sizes": {"train": dataset.size("train"), "test": dataset.size("test")},
        }
    
    # Define task instance provider (use plain dicts to avoid pydantic model unpacking issues)
    def provide_task_instances(seeds):
        base_dataset = {"id": "banking77", "splits": ["train", "test"]}
        base_task_meta = {"format": "tool_call"}
        for seed in seeds:
            sample = dataset.sample(split="train", index=seed)
            yield TaskInfo(
                task={"id": "banking77", "name": "Banking77 Intent Classification", "version": "1.0.0"},
                environment="banking77",
                dataset={**base_dataset, "split": sample["split"], "index": sample["index"]},
                rubric={"version": "1", "criteria_count": 1},
                inference={"supports_proxy": True, "tool": TOOL_NAME},
                limits={"max_turns": 1},
                task_metadata={**base_task_meta, "query": sample["text"], "expected_intent": sample["label"]},
            )
    
    # Build LocalAPIConfig
    config = LocalAPIConfig(
        app_id="banking77",
        name="Banking77 Intent Classification",
        description="Banking77 local API for classifying customer queries into banking intents.",
        base_task_info=TaskInfo(
            task={"id": "banking77", "name": "Banking77 Intent Classification", "version": "1.0.0"},
            environment="banking77",
            dataset={"id": "banking77", "splits": ["train", "test"]},
            rubric={"version": "1", "criteria_count": 1},
            inference={"supports_proxy": True, "tool": TOOL_NAME},
            limits={"max_turns": 1},
            task_metadata={"format": "tool_call"},
        ),
        describe_taskset=describe_taskset,
        provide_task_instances=provide_task_instances,
        rollout=rollout_executor,
        app_state={"banking77_dataset": dataset},
        cors_origins=["*"],
        startup_hooks=[startup_http_client],
        shutdown_hooks=[shutdown_http_client],
    )
    
    # Create the FastAPI app using SDK
    return create_local_api(config)

print('Banking77 local API factory defined (using SDK abstractions)')

## Step 5: Start Baseline Local API

Now we'll spin up the baseline local API and expose it via a Cloudflare tunnel. The SDK provides helpers for tunnel management (`rotate_tunnel`, `connect_managed_tunnel`) and process lifecycle (`track_process`, `kill_port`).

In [None]:
# Step 5: Start Baseline Local API with Cloudflare Tunnel
import nest_asyncio
nest_asyncio.apply()

# Import tunnel helpers from SDK
from synth_ai.sdk.tunnels import (
    rotate_tunnel,
    connect_managed_tunnel,  # High-level: start + wait + verify
    track_process,
    kill_port,
    wait_for_health_check,
)
from synth_ai.sdk.task import run_server_background

BASELINE_SYSTEM_PROMPT = "You are an expert banking assistant that classifies customer queries into banking intents. Given a customer message, respond with exactly one intent label from the provided list using the `banking77_classify` tool."
USER_PROMPT = "Customer Query: {query}\n\nAvailable Intents:\n{available_intents}\n\nClassify this query into one of the above banking intents using the tool call."

# Create baseline local API using SDK abstractions
baseline_app = create_banking77_local_api(BASELINE_SYSTEM_PROMPT, ENVIRONMENT_API_KEY)

# Kill port if in use, start server in background
kill_port(TASK_APP_PORT)
run_server_background(baseline_app, TASK_APP_PORT)

# Wait for local health check
print(f'Waiting for baseline local API on port {TASK_APP_PORT}...')
await wait_for_health_check("localhost", TASK_APP_PORT, ENVIRONMENT_API_KEY, timeout=30.0)
print('Baseline local API ready!')

# Get tunnel from backend
print('\nProvisioning Cloudflare tunnel for baseline...')
baseline_tunnel = await rotate_tunnel(API_KEY, TASK_APP_PORT, reason="baseline_notebook")
BASELINE_TUNNEL_HOSTNAME = baseline_tunnel['hostname']
BASELINE_LOCAL_API_URL = f'https://{BASELINE_TUNNEL_HOSTNAME}'

# Connect tunnel with automatic waiting and verification (handles all the timing)
baseline_proc = await connect_managed_tunnel(
    baseline_tunnel['tunnel_token'],
    BASELINE_TUNNEL_HOSTNAME,
    api_key=ENVIRONMENT_API_KEY,
)
track_process(baseline_proc)

print(f'\nBaseline local API URL: {BASELINE_LOCAL_API_URL}')

## Step 6: Run GEPA Optimization

This kicks off the GEPA prompt optimization job on the Synth backend. GEPA will evolve the system prompt over multiple generations, evaluating each candidate against the training seeds via our tunneled local API. Because evolutionary prompt optimization requires multiple sequential rounds of evals, expect the process to take roughly as long as (time to eval) x (number of generations)

In [None]:
# Step 6: Run GEPA Optimization (using tunnel URL)
async def run_gepa():
    config_body = {
        'prompt_learning': {
            'algorithm': 'gepa',
            'run_local': False,  # Run on remote backend
            'task_app_url': BASELINE_LOCAL_API_URL,
            'task_app_api_key': ENVIRONMENT_API_KEY,
            'env_name': 'banking77',
            'initial_prompt': {
                'messages': [
                    {'role': 'system', 'order': 0, 'pattern': BASELINE_SYSTEM_PROMPT},
                    {'role': 'user', 'order': 1, 'pattern': USER_PROMPT},
                ],
                'wildcards': {'query': 'REQUIRED', 'available_intents': 'OPTIONAL'},
            },
            'policy': {'model': 'gpt-4.1-nano', 'provider': 'openai', 'temperature': 0.0, 'max_completion_tokens': 256},
            'gepa': {
                'env_name': 'banking77',
                'evaluation': {'seeds': list(range(30)), 'validation_seeds': list(range(50, 56))},
                'rollout': {'budget': 50, 'max_concurrent': 5, 'minibatch_size': 5},
                'mutation': {'rate': 0.3, 'llm_model': 'gpt-4.1-nano'},
                'population': {'initial_size': 3, 'num_generations': 2, 'children_per_generation': 2},
                'archive': {'size': 5, 'pareto_set_size': 10},
                'token': {'counting_model': 'gpt-4'},
            },
        },
    }

    print(f'Creating GEPA job (local_api_url={BASELINE_LOCAL_API_URL})...')
    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(
            f'{SYNTH_API_BASE}/api/prompt-learning/online/jobs',
            json={'algorithm': 'gepa', 'config_body': config_body},
            headers={'Authorization': f'Bearer {API_KEY}'}
        )
        if resp.status_code != 200:
            print(f'ERROR: {resp.status_code} - {resp.text[:500]}')
            resp.raise_for_status()
        job_id = resp.json()['job_id']
    print(f'Job ID: {job_id}')

    print('Polling...')
    start = time.time()
    last_status = None
    job = None
    
    while True:
        async with httpx.AsyncClient(timeout=30) as client:
            resp = await client.get(
                f'{SYNTH_API_BASE}/api/prompt-learning/online/jobs/{job_id}',
                headers={'Authorization': f'Bearer {API_KEY}'}
            )
            resp.raise_for_status()
            job = resp.json()
        
        status = job['status']
        elapsed = int(time.time() - start)
        best = job.get('best_train_score') or job.get('best_score')
        
        if status != last_status or elapsed % 15 == 0:
            print(f'    [{elapsed}s] {status} (best={best})')
            last_status = status
        
        if status in ['succeeded', 'failed', 'cancelled']:
            break
        await asyncio.sleep(3)

    print(f'\nFINAL: {status}')
    if status == 'succeeded':
        best = job.get('best_score') or job.get('best_train_score')
        print(f'BEST SCORE: {best}')
    elif status == 'failed':
        print(f'ERROR: {job.get("error")}')
    
    return job

job = await run_gepa()

## Step 7: Evaluation

Now, we'll run an evaluation for both the baseline prompt we started with, and the optimized prompt. This will be on a separate, but still randomly drawn dataset, so it will tell us if the optimizer in fact improved the prompt generally, or if it overfit to the trainset.

In [None]:
# Step 7: Run Formal Eval Jobs (Baseline vs Optimized)
from synth_ai.sdk.learning.prompt_learning_client import PromptLearningClient

EVAL_SEEDS = list(range(100, 120))  # 20 held-out test samples

async def run_eval_job(local_api_url: str, local_api_key: str, seeds: list, mode: str) -> dict:
    """Run an eval job and wait for completion."""
    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(
            f'{SYNTH_API_BASE}/api/eval/jobs',
            json={
                'task_app_url': local_api_url,
                'task_app_api_key': local_api_key,
                'env_name': 'banking77',
                'seeds': seeds,
                'env_config': {'split': 'test'},
                'policy': {'model': 'gpt-4.1-nano', 'provider': 'openai'},
                'mode': mode,
                'max_concurrent': 10,
            },
            headers={'Authorization': f'Bearer {API_KEY}'}
        )
        if resp.status_code != 200:
            print(f'ERROR creating {mode} eval job: {resp.status_code} - {resp.text[:300]}')
            return {'status': 'failed', 'error': resp.text}
        
        job_id = resp.json()['job_id']
        print(f'  {mode} eval job: {job_id}')
    
    start = time.time()
    while True:
        async with httpx.AsyncClient(timeout=30) as client:
            resp = await client.get(
                f'{SYNTH_API_BASE}/api/eval/jobs/{job_id}',
                headers={'Authorization': f'Bearer {API_KEY}'}
            )
            if resp.status_code != 200:
                print(f'  Error polling: {resp.status_code}')
                await asyncio.sleep(2)
                continue
            job = resp.json()
        
        status = job.get('status', '')
        elapsed = int(time.time() - start)
        
        if status in ['completed', 'failed']:
            break
        
        if elapsed % 10 == 0:
            results = job.get('results') or {}
            completed = results.get('completed', 0)
            total = results.get('total', len(seeds))
            print(f'    [{elapsed}s] {status} ({completed}/{total})')
        
        await asyncio.sleep(2)
    
    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.get(
            f'{SYNTH_API_BASE}/api/eval/jobs/{job_id}/results',
            headers={'Authorization': f'Bearer {API_KEY}'}
        )
        if resp.status_code != 200:
            return {'status': 'failed', 'error': f'Failed to get results: {resp.status_code}'}
        return resp.json()

if job['status'] == 'succeeded':
    print("GEPA Job Succeeded!\n")
    
    # Use PromptLearningClient to properly retrieve the optimized prompt from events
    pl_client = PromptLearningClient(SYNTH_API_BASE, API_KEY)
    prompt_results = await pl_client.get_prompts(job['job_id'])
    
    # Extract optimized system prompt from best_prompt
    optimized_system = None
    best_prompt = prompt_results.best_prompt
    if best_prompt:
        # best_prompt may have 'messages' list with role/pattern structure
        messages = best_prompt.get('messages', [])
        for msg in messages:
            if msg.get('role') == 'system':
                optimized_system = msg.get('pattern') or msg.get('content')
                break
        
        # If messages is empty, try 'sections' structure (alternative format)
        if not optimized_system:
            sections = best_prompt.get('sections', [])
            for sec in sections:
                if sec.get('role') == 'system':
                    optimized_system = sec.get('content')
                    break
    
    # Fallback: try extracting from top_prompts full_text
    if not optimized_system and prompt_results.top_prompts:
        top_prompt = prompt_results.top_prompts[0]
        full_text = top_prompt.get('full_text', '')
        # Extract system content from full_text format: "[system | system_prompt]\ncontent"
        if '[system' in full_text.lower():
            import re
            match = re.search(r'\[system[^\]]*\]\s*\n(.*?)(?:\n\n\[|$)', full_text, re.DOTALL | re.IGNORECASE)
            if match:
                optimized_system = match.group(1).strip()
    
    # If we still don't have optimized_system, use baseline for comparison
    if not optimized_system:
        print("NOTE: Could not extract optimized prompt from job details, using baseline for comparison")
        optimized_system = BASELINE_SYSTEM_PROMPT
    
    # Get scores from prompt_results
    best_score = prompt_results.best_score
    
    print('=' * 60)
    print('BASELINE SYSTEM PROMPT')
    print('=' * 60)
    print(BASELINE_SYSTEM_PROMPT)
    
    print('\n' + '=' * 60)
    print('OPTIMIZED SYSTEM PROMPT (from GEPA)')
    print('=' * 60)
    print(optimized_system[:800] + "..." if len(optimized_system) > 800 else optimized_system)
    
    print('\n' + '=' * 60)
    print('GEPA TRAINING RESULTS')
    print('=' * 60)
    if best_score:
        print(f"Best Train Score: {best_score:.1%}")
    
    print('\n' + '=' * 60)
    print(f'FORMAL EVAL JOBS (test split, seeds {EVAL_SEEDS[0]}-{EVAL_SEEDS[-1]})')
    print('=' * 60)
    
    # Start optimized local API on different port using SDK abstractions
    print(f'\nStarting optimized local API on port {OPTIMIZED_TASK_APP_PORT}...')
    optimized_app = create_banking77_local_api(optimized_system, ENVIRONMENT_API_KEY)
    
    kill_port(OPTIMIZED_TASK_APP_PORT)
    run_server_background(optimized_app, OPTIMIZED_TASK_APP_PORT)
    await wait_for_health_check("localhost", OPTIMIZED_TASK_APP_PORT, ENVIRONMENT_API_KEY, timeout=30.0)
    print('Optimized local API ready!')
    
    # Get tunnel for optimized
    print('\nProvisioning Cloudflare tunnel for optimized...')
    optimized_tunnel = await rotate_tunnel(API_KEY, OPTIMIZED_TASK_APP_PORT, reason="optimized_notebook")
    OPTIMIZED_TUNNEL_HOSTNAME = optimized_tunnel['hostname']
    OPTIMIZED_LOCAL_API_URL = f'https://{OPTIMIZED_TUNNEL_HOSTNAME}'
    
    # Connect tunnel with automatic waiting and verification (handles all the timing)
    optimized_proc = await connect_managed_tunnel(
        optimized_tunnel['tunnel_token'],
        OPTIMIZED_TUNNEL_HOSTNAME,
        api_key=ENVIRONMENT_API_KEY,
    )
    track_process(optimized_proc)
    
    # Run baseline eval
    print('\nRunning BASELINE eval job...')
    baseline_results = await run_eval_job(
        local_api_url=BASELINE_LOCAL_API_URL,
        local_api_key=ENVIRONMENT_API_KEY,
        seeds=EVAL_SEEDS,
        mode='baseline'
    )
    
    baseline_summary = baseline_results.get('summary', {})
    baseline_eval_score = baseline_summary.get('mean_score')
    print(f'  Baseline eval: {baseline_eval_score:.1%}' if baseline_eval_score is not None else '  Baseline eval: N/A')
    
    # Run optimized eval
    print('\nRunning OPTIMIZED eval job...')
    optimized_results = await run_eval_job(
        local_api_url=OPTIMIZED_LOCAL_API_URL,
        local_api_key=ENVIRONMENT_API_KEY,
        seeds=EVAL_SEEDS,
        mode='optimized'
    )
    
    optimized_summary = optimized_results.get('summary', {})
    optimized_eval_score = optimized_summary.get('mean_score')
    print(f'  Optimized eval: {optimized_eval_score:.1%}' if optimized_eval_score is not None else '  Optimized eval: N/A')
    
    # Final comparison
    print('\n' + '=' * 60)
    print('FINAL COMPARISON')
    print('=' * 60)
    print(f"Training:")
    if best_score:
        print(f"  Best Score: {best_score:.1%}")
    
    print(f"\nEval (seeds {EVAL_SEEDS[0]}-{EVAL_SEEDS[-1]}, held-out):")
    if baseline_eval_score is not None:
        print(f"  Baseline:  {baseline_eval_score:.1%}")
    if optimized_eval_score is not None:
        print(f"  Optimized: {optimized_eval_score:.1%}")
    if baseline_eval_score is not None and optimized_eval_score is not None:
        eval_lift = optimized_eval_score - baseline_eval_score
        print(f"  Lift:      {eval_lift:+.1%}")
        
        if eval_lift > 0:
            print("\n>>> OPTIMIZATION GENERALIZES TO HELD-OUT DATA!")
        elif eval_lift == 0:
            print("\n=== Same performance on held-out data")
        else:
            print("\n<<< Baseline better on held-out (possible overfitting)")
else:
    print(f"Job did not succeed: {job.get('status')}")

## Step 8: Cleanup

Terminate cloudflared tunnel processes to free up resources.

In [None]:
# Step 8: Cleanup
from synth_ai.sdk.tunnels import cleanup_all

print('Cleaning up cloudflared processes...')
cleanup_all()
print('Demo complete!')