# Synth GEPA Demo - Style Matching

This notebook demonstrates **style-matching prompt optimization** using Synth's GEPA algorithm.

We use a small, fixed dataset of essay prompts and gold examples defined directly in this notebook.
The task app scores outputs using a zero-shot contrastive verifier, then we **optimize a verifier graph**
(via Graph Evolve) and compare results across baseline vs optimized verifiers.

**Run in Google Colab:** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/synth-laboratories/synth-ai/blob/main/demos/style_matching/style_matching_prompt_optimization.ipynb)

**Structure:**
1. **Setup** - Install dependencies and configure API keys
2. **Dataset + Task App** - Define a fixed dataset and a local task app
3. **Optimize Verifier** - Evolve a verifier graph using graph-evolve jobs
4. **Local Server + Tunnel** - Expose the task app to the Synth backend
5. **Optimize Prompts** - Run GEPA with baseline vs optimized verifiers
6. **Heldout Evaluation** - Score four combinations on a heldout set using eval jobs


In [1]:
# Step 0: Install dependencies (run this first on Colab)
import sys

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    print("Running in Google Colab - installing dependencies...")
    !pip install -q synth-ai httpx fastapi uvicorn nest_asyncio

    # Install cloudflared
    !wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O /usr/local/bin/cloudflared
    !chmod +x /usr/local/bin/cloudflared
    !cloudflared --version

    print("Dependencies installed!")
else:
    print("Not in Colab - assuming dependencies are already installed")
    print("Required: pip install synth-ai httpx fastapi uvicorn nest_asyncio")
    print("Required: brew install cloudflare/cloudflare/cloudflared (macOS)")

Not in Colab - assuming dependencies are already installed
Required: pip install synth-ai httpx fastapi uvicorn nest_asyncio
Required: brew install cloudflare/cloudflare/cloudflared (macOS)


## Step 1: Setup

Configure all imports, API keys, and environment keys in one place.


In [2]:
# Step 1: Setup - All imports, config, and API keys
import os
import json
import asyncio
from typing import Any, Dict, List, Optional
from pathlib import Path

import httpx
import nest_asyncio
from fastapi import FastAPI, Header, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware

nest_asyncio.apply()

from synth_ai.sdk.api.eval import EvalJob, EvalJobConfig
from synth_ai.sdk.api.train.prompt_learning import PromptLearningJob
from synth_ai.sdk.learning.prompt_learning_client import PromptLearningClient
from synth_ai.sdk.learning.rl import mint_environment_api_key, setup_environment_api_key
from synth_ai.sdk.task import run_server_background
from synth_ai.sdk.tunnels import (
    TunneledLocalAPI,
    TunnelBackend,
    kill_port,
    wait_for_health_check,
    cleanup_all,
    find_available_port,
    is_port_available,
)

from synth_ai.products.graph_evolve import GraphOptimizationClient, GraphOptimizationConfig
from synth_ai.products.graph_evolve.config import (
    EvolutionConfig,
    SeedsConfig,
    LimitsConfig,
    ProposerConfig,
)


def _load_env_file(path: str, override: bool = True) -> None:
    env_path = Path(path)
    if not env_path.exists():
        return
    for line in env_path.read_text().splitlines():
        if not line or line.lstrip().startswith("#") or "=" not in line:
            continue
        key, value = line.split("=", 1)
        key = key.strip()
        value = value.strip().strip('"').strip("'")
        if key and (override or key not in os.environ):
            os.environ[key] = value


_load_env_file("synth-ai/.env", override=True)

USE_LOCAL_BACKEND = True
SYNTH_API_BASE = "http://localhost:8000" if USE_LOCAL_BACKEND else "https://api.usesynth.ai"
os.environ["BACKEND_BASE_URL"] = SYNTH_API_BASE


def _get_org_id() -> str:
    headers = {"Authorization": f"Bearer {API_KEY}"}
    urls = [f"{SYNTH_API_BASE}/api/v1/me", f"{SYNTH_API_BASE}/me"]
    for url in urls:
        resp = httpx.get(url, headers=headers, timeout=30)
        if resp.status_code == 404:
            continue
        resp.raise_for_status()
        data = resp.json()
        org_id = data.get("org_id") or data.get("orgId")
        if org_id:
            return str(org_id)
    raise RuntimeError("Unable to resolve org_id from /api/v1/me or /me")


LOCAL_TASK_PORT = 8132
LOCAL_TASK_HOST = "127.0.0.1"


def _validate_api_key(api_key: str) -> bool:
    if not api_key:
        return False
    headers = {"Authorization": f"Bearer {api_key}"}
    try:
        resp = httpx.get(f"{SYNTH_API_BASE}/api/v1/me", headers=headers, timeout=10)
    except Exception:
        return False
    return resp.status_code == 200


print(f"Backend: {SYNTH_API_BASE}")

# Check backend health
r = httpx.get(f"{SYNTH_API_BASE}/health", timeout=30)
if r.status_code != 200:
    raise RuntimeError(f"Backend not healthy: status {r.status_code}")
print(f"Backend health: {r.json()}")

# Get API Key (use env var or mint demo key)
API_KEY = os.environ.get("SYNTH_API_KEY", "").strip()
if not API_KEY or not _validate_api_key(API_KEY):
    print("")
    print("SYNTH_API_KEY missing or invalid for this backend; minting demo key...")
    resp = httpx.post(f"{SYNTH_API_BASE}/api/demo/keys", json={"ttl_hours": 4}, timeout=30)
    resp.raise_for_status()
    API_KEY = resp.json()["api_key"]
    print(f"Demo API Key: {API_KEY[:25]}...")
else:
    print("")
    print("Using SYNTH_API_KEY:")
    print(f"{API_KEY[:20]}...")

# Set API key in environment for SDK to use
os.environ["SYNTH_API_KEY"] = API_KEY

# Mint and upload environment key (for local API authentication)
ENVIRONMENT_API_KEY = mint_environment_api_key()
print("")
print("Minted env key:")
print(f"{ENVIRONMENT_API_KEY[:12]}...{ENVIRONMENT_API_KEY[-4:]}")

try:
    result = setup_environment_api_key(SYNTH_API_BASE, API_KEY, token=ENVIRONMENT_API_KEY)
    print(f"Uploaded env key: {result}")
except Exception as exc:
    print(f"Env key upload failed (continuing locally): {exc}")

print("")
print("=" * 50)
print("SETUP COMPLETE")
print("=" * 50)

  class StructuredOutputConfig(BaseModel):


[celery_app] EXPERIMENT_QUEUE_DB_PATH not set, will use default path


[celery_app] Using default database path: /Users/joshpurtell/.synth_ai/experiment_queue.db


[celery_app] Initializing with database: /Users/joshpurtell/.synth_ai/experiment_queue.db (broker: redis://localhost:6379/0)


Backend: http://localhost:8000
Backend health: {'status': 'ok', 'database': 'connected', 'details': {}}

Using SYNTH_API_KEY:
sk_live_ace8b968-a52...

Minted env key:
7e7aacb2c83e...0df3


Uploaded env key: {'stored': True, 'id': '08a0c070-3c1a-4a99-9386-50a67a65c3ce', 'name': 'ENVIRONMENT_API_KEY', 'updated_at': '2026-01-05T20:01:51.974189Z'}

SETUP COMPLETE


## Step 2: Dataset + Task App

We define a fixed dataset of essay prompts and gold examples. The task app generates essays using the
candidate prompt and scores outputs with the zero-shot contrastive verifier.


In [3]:
# Step 2: Fixed dataset + task app definition

SYSTEM_PROMPT = (
    "You are a thoughtful essayist with a direct, builder-first voice. "
    "Write crisp, opinionated essays with concrete examples, minimal fluff, "
    "and a short, memorable closing line. Aim for ~500 words (roughly 450-650)."
)

USER_PROMPT_TEMPLATE = (
    "Title: {title}\n"
    "Outline:\n{outline}\n\n"
    "Notes:\n{notes}\n\n"
    "Target length: ~500 words.\n\n"
    "Write the essay now."
)

TASKS: List[Dict[str, Any]] = [
    {
        "id": "outline_speed",
        "input": {
            "title": "Shipping Fast Without Burning Out",
            "outline": (
                "1. Why speed compounds learning\n"
                "2. The tradeoff between velocity and quality\n"
                "3. How to keep teams aligned under pressure\n"
                "4. Practical rituals that preserve momentum\n"
                "5. Closing: pace as a competitive advantage"
            ),
            "notes": ["short feedback loops", "protect maker time", "use small bets"],
        },
    },
    {
        "id": "outline_focus",
        "input": {
            "title": "Focus Beats Optionality",
            "outline": (
                "1. Optionality feels safe but slows decisions\n"
                "2. Focus creates a compounding advantage\n"
                "3. The cost of context switching in small teams\n"
                "4. Saying no as a leadership skill\n"
                "5. Closing: clarity is leverage"
            ),
            "notes": ["pick one wedge", "eliminate parallel bets", "repeat a simple story"],
        },
    },
    {
        "id": "outline_learning",
        "input": {
            "title": "Learning in Public",
            "outline": (
                "1. Writing as a forcing function\n"
                "2. How sharing drafts accelerates feedback\n"
                "3. The credibility flywheel\n"
                "4. Risks of over-sharing (and how to avoid them)\n"
                "5. Closing: publish to clarify"
            ),
            "notes": [
                "ship drafts, not polished essays",
                "ask for specific feedback",
                "be concrete about failures",
            ],
        },
    },
    {
        "id": "outline_quality",
        "input": {
            "title": "Quality as a Constraint",
            "outline": (
                "1. The myth that quality slows you down\n"
                "2. Cheap fixes vs durable systems\n"
                "3. When to accept rough edges\n"
                "4. How to build taste in a team\n"
                "5. Closing: quality is a habit"
            ),
            "notes": ["use guardrails", "invest in tooling", "make quality visible"],
        },
    },
]

GOLD_EXAMPLES: List[Dict[str, Any]] = [
    {
        "summary": "Direct, builder tone with concrete examples",
        "gold_score": 0.95,
        "gold_reasoning": "Short sentences, decisive stance, concrete advice, crisp closing line.",
        "trace": {
            "session_id": "gold-1",
            "session_time_steps": [
                {
                    "step_id": "1",
                    "step_index": 0,
                    "events": [
                        {
                            "event_type": "runtime",
                            "event_id": 1,
                            "type": "user_message",
                            "content": "Write about shipping fast.",
                        },
                        {
                            "event_type": "runtime",
                            "event_id": 2,
                            "type": "assistant_message",
                            "content": (
                                "Speed is a learning engine. Ship a small bet, watch users react, then sharpen the next move. "
                                "Protect maker time, keep scope tight, and treat every release as feedback, not theater. "
                                "Momentum beats motivation."
                            ),
                        },
                    ],
                }
            ],
        },
    },
    {
        "summary": "Opinionated essay with a sharp closing line",
        "gold_score": 0.92,
        "gold_reasoning": "Clear thesis, direct claims, practical advice, and a memorable end.",
        "trace": {
            "session_id": "gold-2",
            "session_time_steps": [
                {
                    "step_id": "1",
                    "step_index": 0,
                    "events": [
                        {
                            "event_type": "runtime",
                            "event_id": 1,
                            "type": "user_message",
                            "content": "Write about focus.",
                        },
                        {
                            "event_type": "runtime",
                            "event_id": 2,
                            "type": "assistant_message",
                            "content": (
                                "Optionality feels safe, but it dilutes learning. Pick one wedge, cut parallel bets, and ship. "
                                "Small teams win by saying no early and often. Clarity is leverage."
                            ),
                        },
                    ],
                }
            ],
        },
    },
    {
        "summary": "Concrete, tactical quality guidance with guardrails",
        "gold_score": 0.96,
        "gold_reasoning": "Direct stance, concrete guardrails, and a crisp closing line.",
        "trace": {
            "session_id": "gold-3",
            "session_time_steps": [
                {
                    "step_id": "1",
                    "step_index": 0,
                    "events": [
                        {
                            "event_type": "runtime",
                            "event_id": 1,
                            "type": "user_message",
                            "content": "Write about quality as a constraint.",
                        },
                        {
                            "event_type": "runtime",
                            "event_id": 2,
                            "type": "assistant_message",
                            "content": (
                                "Quality is the guardrail that keeps speed from turning into chaos. Ship small, test the riskiest paths, "
                                "and make rollback cheap. Fix root causes once, then automate the prevention. "
                                "Quality is a habit, not a milestone."
                            ),
                        },
                    ],
                }
            ],
        },
    },
]

# Baseline verifier (zero-shot contrastive)
BASELINE_VERIFIER_JOB_ID = "zero_shot_verifier_contrastive_single"
VERIFIER_JOB_ID = BASELINE_VERIFIER_JOB_ID
VERIFIER_MODEL = "gpt-4.1-nano"

ROLLOUT_LOG: List[Dict[str, Any]] = []


def _verify_api_key(x_api_key: Optional[str]) -> bool:
    if not ENVIRONMENT_API_KEY:
        return True
    return x_api_key == ENVIRONMENT_API_KEY


def _format_notes(notes: List[str]) -> str:
    if not notes:
        return "- (none)"
    return "\n".join(f"- {note}" for note in notes)


def _safe_format(text: str, values: Dict[str, Any]) -> str:
    class _DefaultDict(dict):
        def __missing__(self, key: str) -> str:
            return ""

    return text.format_map(_DefaultDict(values))


def _render_messages_from_sections(
    sections: List[Dict[str, Any]], values: Dict[str, Any]
) -> List[Dict[str, str]]:
    rendered = []
    for section in sorted(sections, key=lambda s: s.get("order", 0)):
        role = section.get("role", "user")
        content = section.get("content") or section.get("pattern") or ""
        if content:
            rendered.append({"role": str(role), "content": _safe_format(str(content), values)})
    return rendered


def _build_messages(
    task_input: Dict[str, Any], prompt_sections: Optional[List[Dict[str, Any]]] = None
) -> List[Dict[str, str]]:
    notes_text = _format_notes(task_input.get("notes", []))
    values = {
        "title": task_input.get("title", ""),
        "outline": task_input.get("outline", ""),
        "notes": notes_text,
    }
    if prompt_sections:
        return _render_messages_from_sections(prompt_sections, values)

    user_prompt = USER_PROMPT_TEMPLATE.format(
        title=values["title"],
        outline=values["outline"],
        notes=values["notes"],
    )
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_prompt},
    ]


def _build_inference_url(inference_url: str) -> str:
    if "?" in inference_url:
        base, query = inference_url.split("?", 1)
        return f"{base.rstrip('/')}/chat/completions?{query}"
    return f"{inference_url.rstrip('/')}/chat/completions"


def _extract_completion(data: Dict[str, Any]) -> str:
    try:
        return data["choices"][0]["message"]["content"] or ""
    except (KeyError, IndexError, TypeError):
        return ""


def _extract_verifier_score(result: Dict[str, Any]) -> float:
    output = result.get("output", result)
    if isinstance(output, dict):
        outcome_review = output.get("outcome_review")
        if isinstance(outcome_review, dict) and isinstance(
            outcome_review.get("total"), (int, float)
        ):
            return float(outcome_review["total"])
        event_reviews = output.get("event_reviews")
        if isinstance(event_reviews, list) and event_reviews:
            totals = [rev.get("total") for rev in event_reviews if isinstance(rev, dict)]
            totals = [t for t in totals if isinstance(t, (int, float))]
            if totals:
                return float(sum(totals) / len(totals))
        if isinstance(output.get("total"), (int, float)):
            return float(output["total"])
    return 0.0


async def _call_policy_llm(messages: List[Dict[str, str]], policy_config: Dict[str, Any]) -> str:
    inference_url = policy_config.get("inference_url")
    if not inference_url:
        raise RuntimeError("policy.config.inference_url is required")

    url = _build_inference_url(inference_url)
    model = policy_config.get("model", "gpt-4.1-nano")
    model_lower = model.lower()
    is_gpt5 = "gpt-5" in model_lower

    headers = {"Content-Type": "application/json"}
    api_key = policy_config.get("api_key")
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    elif ENVIRONMENT_API_KEY:
        headers["X-API-Key"] = ENVIRONMENT_API_KEY
        headers["Authorization"] = f"Bearer {ENVIRONMENT_API_KEY}"

    payload = {"model": model, "messages": messages}
    if is_gpt5:
        payload["max_completion_tokens"] = int(policy_config.get("max_completion_tokens", 16000))
        reasoning_effort = policy_config.get("reasoning_effort")
        if reasoning_effort:
            payload["reasoning_effort"] = reasoning_effort
    else:
        payload["temperature"] = float(policy_config.get("temperature", 0.7))
        payload["max_tokens"] = int(policy_config.get("max_completion_tokens", 1200))

    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(url, headers=headers, json=payload)
        response.raise_for_status()
        return _extract_completion(response.json())


async def _score_with_verifier(
    session_trace: Dict[str, Any], verifier_job_id: Optional[str] = None
) -> float:
    job_id = verifier_job_id or VERIFIER_JOB_ID
    payload = {
        "job_id": job_id,
        "input": {
            "trace": session_trace,
            "gold_examples": GOLD_EXAMPLES,
            "candidate_score": 0.5,
            "candidate_reasoning": "Auto-evaluated from style-matching task app",
            "options": {"model": VERIFIER_MODEL},
        },
    }

    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    }

    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(
            f"{SYNTH_API_BASE.rstrip('/')}/api/graphs/completions",
            headers=headers,
            json=payload,
        )
        if response.status_code != 200:
            raise RuntimeError(
                f"Verifier failed: HTTP {response.status_code} {response.text[:500]}"
            )
        result = response.json()

    return _extract_verifier_score(result)


app = FastAPI(title="Style Matching Task App")
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)


@app.get("/health")
async def health() -> Dict[str, str]:
    return {"status": "ok", "task_app": "style_matching"}


@app.get("/task_info")
async def task_info() -> Dict[str, Any]:
    return {
        "taskset": {
            "name": "style_matching",
            "description": "Style matching demo task app",
            "size": len(TASKS),
        }
    }


@app.get("/tasks")
async def list_tasks() -> Dict[str, Any]:
    return {"tasks": TASKS, "gold_examples": GOLD_EXAMPLES}


@app.get("/rollouts")
async def list_rollouts(x_api_key: Optional[str] = Header(None)) -> Dict[str, Any]:
    if not _verify_api_key(x_api_key):
        raise HTTPException(status_code=401, detail="Unauthorized")
    return {"rollouts": ROLLOUT_LOG}


@app.post("/rollout")
async def rollout(request: Request, x_api_key: Optional[str] = Header(None)) -> Dict[str, Any]:
    if not _verify_api_key(x_api_key):
        raise HTTPException(status_code=401, detail="Unauthorized")

    try:
        data = await request.json()
    except Exception:
        raise HTTPException(status_code=400, detail="Invalid JSON")

    run_id = data.get("run_id")
    env = data.get("env", {})
    policy = data.get("policy", {})
    policy_config = policy.get("config", {})

    trace_correlation_id = policy_config.get("trace_correlation_id")

    env_config = env.get("config", {}) or {}
    verifier_job_id = env_config.get("verifier_job_id") or VERIFIER_JOB_ID
    prompt_sections = env_config.get("prompt_sections")

    seed = int(env.get("seed", 0))
    task = TASKS[seed % len(TASKS)]
    task_input = task["input"]

    messages = _build_messages(task_input, prompt_sections=prompt_sections)

    try:
        essay = await _call_policy_llm(messages, policy_config)
    except Exception as exc:
        essay = f"[error: {exc}]"

    session_trace = {
        "session_id": f"style-matching-{task['id']}",
        "session_time_steps": [
            {
                "step_id": "1",
                "step_index": 0,
                "events": [
                    {
                        "event_type": "runtime",
                        "event_id": 1,
                        "type": "user_message",
                        "content": messages[-1]["content"],
                    },
                    {
                        "event_type": "runtime",
                        "event_id": 2,
                        "type": "assistant_message",
                        "content": essay,
                    },
                ],
            }
        ],
    }

    score = await _score_with_verifier(session_trace, verifier_job_id=verifier_job_id)

    ROLLOUT_LOG.append(
        {
            "run_id": run_id,
            "seed": seed,
            "task_id": task["id"],
            "title": task_input.get("title", ""),
            "essay": essay,
            "score": score,
        }
    )

    return {
        "metrics": {
            "mean_return": score,
            "outcome_score": score,
            "num_steps": 1,
            "details": {"verifier_score": score},
        },
        "trajectories": [
            {"steps": [{"observation": task_input, "action": {"essay": essay}, "reward": score}]}
        ],
        "metadata": {"task_id": task["id"]},
        "trace_correlation_id": trace_correlation_id or "",
    }


print(f"Task dataset size: {len(TASKS)}")

Task dataset size: 4


## Step 3: Optimize Verifier Graph

We use Graph Evolve to optimize a verifier graph on a small, fixed calibration set.
This graph will serve as an alternative verifier in the GEPA loop.


In [4]:
# Step 3: Optimize a verifier graph with Graph Evolve


def _make_trace(user_text: str, assistant_text: str) -> Dict[str, Any]:
    return {
        "session_id": "trace",
        "session_time_steps": [
            {
                "step_id": "1",
                "step_index": 0,
                "events": [
                    {
                        "event_type": "runtime",
                        "event_id": 1,
                        "type": "user_message",
                        "content": user_text,
                    },
                    {
                        "event_type": "runtime",
                        "event_id": 2,
                        "type": "assistant_message",
                        "content": assistant_text,
                    },
                ],
            }
        ],
    }


VERIFIER_EXAMPLES = [
    {
        "task_id": "good_speed",
        "trace": _make_trace(
            "Write about shipping fast.",
            "Speed compounds learning. Ship small bets, learn fast, keep scope tight, and protect maker time.",
        ),
        "score": 0.95,
    },
    {
        "task_id": "good_focus",
        "trace": _make_trace(
            "Write about focus.",
            "Optionality dilutes learning. Pick one wedge, cut parallel bets, and repeat a simple story.",
        ),
        "score": 0.92,
    },
    {
        "task_id": "good_quality",
        "trace": _make_trace(
            "Write about quality as a constraint.",
            "Quality is a guardrail. Ship small, test risky paths, and make rollback cheap.",
        ),
        "score": 0.93,
    },
    {
        "task_id": "good_learning",
        "trace": _make_trace(
            "Write about learning in public.",
            "Publish drafts to accelerate feedback, build credibility, and clarify thinking.",
        ),
        "score": 0.91,
    },
    {
        "task_id": "bad_rambling",
        "trace": _make_trace(
            "Write about focus.",
            "Focus is important because focus is important. You should focus on focusing and focus on focus.",
        ),
        "score": 0.10,
    },
    {
        "task_id": "bad_vague",
        "trace": _make_trace(
            "Write about quality.",
            "Quality is good. Teams should be good and do good things to make quality good.",
        ),
        "score": 0.05,
    },
]

VERIFIER_TRAIN_SEEDS = list(range(4))
VERIFIER_VAL_SEEDS = list(range(4, len(VERIFIER_EXAMPLES)))

verifier_dataset = {
    "tasks": [
        {
            "task_id": example["task_id"],
            "input": {"trace": example["trace"], "gold_examples": GOLD_EXAMPLES},
        }
        for example in VERIFIER_EXAMPLES
    ],
    "gold_outputs": [
        {"task_id": example["task_id"], "output": {}, "score": example["score"]}
        for example in VERIFIER_EXAMPLES
    ],
    "metadata": {
        "name": "style_matching_verifier",
        "task_description": "Score style-matching quality using strict verifier-style outputs. Return event_reviews, outcome_review, event_totals; use a 1.0 baseline with deductions for discrepancies (generic outputs < 0.3).",
        "input_schema": {
            "type": "object",
            "properties": {"trace": {"type": "object"}, "gold_examples": {"type": "array"}},
            "required": ["trace", "gold_examples"],
        },
        "output_schema": {
            "type": "object",
            "properties": {
                "event_reviews": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "criteria": {"type": "object"},
                            "total": {"type": "number"},
                            "summary": {"type": "string"},
                        },
                        "required": ["criteria", "total"],
                    },
                },
                "outcome_review": {
                    "type": "object",
                    "properties": {
                        "criteria": {"type": "object"},
                        "total": {"type": "number"},
                        "summary": {"type": "string"},
                    },
                    "required": ["criteria", "total"],
                },
                "event_totals": {"type": "array", "items": {"type": "number"}},
                "score": {"type": "number"},
            },
            "required": ["event_reviews", "outcome_review", "event_totals"],
        },
        "output_config": {
            "format": "json",
            "strict": True,
            "extract_from": ["(root)"],
            "schema": {
                "type": "object",
                "properties": {
                    "event_reviews": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "criteria": {"type": "object"},
                                "total": {"type": "number"},
                                "summary": {"type": "string"},
                            },
                            "required": ["criteria", "total"],
                        },
                    },
                    "outcome_review": {
                        "type": "object",
                        "properties": {
                            "criteria": {"type": "object"},
                            "total": {"type": "number"},
                            "summary": {"type": "string"},
                        },
                        "required": ["criteria", "total"],
                    },
                    "event_totals": {"type": "array", "items": {"type": "number"}},
                    "score": {"type": "number"},
                },
                "required": ["event_reviews", "outcome_review", "event_totals"],
            },
        },
        "domain": "text",
    },
}

verifier_config = GraphOptimizationConfig(
    algorithm="graph_evolve",
    dataset_name="style_matching_verifier",
    graph_type="verifier",
    graph_structure="single_prompt",
    topology_guidance=(
        "Single-node VerifierGraph. Use one DagNode (e.g., judge_style) with template_transform. "
        "Set output_mapping to copy event_reviews, outcome_review, event_totals, score to root. "
        "Include verdict_weights and aggregation_policy: weighted_average."
    ),
    allowed_policy_models=["gpt-4.1-nano", "gpt-4o-mini"],
    evolution=EvolutionConfig(num_generations=3, children_per_generation=2),
    proposer=ProposerConfig(model="gpt-4.1", temperature=0.0),
    seeds=SeedsConfig(train=VERIFIER_TRAIN_SEEDS, validation=VERIFIER_VAL_SEEDS),
    limits=LimitsConfig(max_spend_usd=5.0, timeout_seconds=3600),
    verifier_mode="contrastive",
    verifier_model=VERIFIER_MODEL,
    dataset=verifier_dataset,
    output_schema=verifier_dataset["metadata"]["output_schema"],
    output_config=verifier_dataset["metadata"]["output_config"],
    task_description="Score style-matching quality using strict verifier-style outputs. Return event_reviews, outcome_review, event_totals; use a 1.0 baseline with deductions for discrepancies (generic outputs < 0.3).",
    problem_spec=(
        "You are generating a VerifierGraph. The final output MUST be a JSON object with: "
        "event_reviews (list of per-event review objects with criteria, total, summary), "
        "outcome_review (object with criteria, total, summary), and event_totals (list of numbers). "
        "Include a top-level score if helpful, but the verifier contract is based on outcome_review.total and event_totals. "
        "Make totals floats in [0,1]. "
        "Scoring policy must be strict: start at 1.0 and deduct for every discrepancy vs gold examples. "
        "Generic/standard outputs should score below 0.3. "
        "Deduction guide: obvious/giveaway discrepancy deduct 0.15-0.3, major discrepancy 0.08-0.15, minor 0.02-0.08."
    ),
)


async def run_verifier_optimization():
    async with GraphOptimizationClient(SYNTH_API_BASE, API_KEY) as client:
        job_id = await client.start_job(verifier_config)
        print(f"Graph evolve job: {job_id}")

    async with httpx.AsyncClient(timeout=90.0) as http:
        headers = {"Authorization": f"Bearer {API_KEY}"}
        for _ in range(900):
            try:
                status_resp = await http.get(
                    f"{SYNTH_API_BASE}/graph-evolve/jobs/{job_id}/status", headers=headers
                )
                if status_resp.status_code == 404:
                    await asyncio.sleep(2.0)
                    continue
                status_resp.raise_for_status()
                status = status_resp.json().get("status")
                if status in {"completed", "failed", "cancelled"}:
                    result_resp = await http.get(
                        f"{SYNTH_API_BASE}/graph-evolve/jobs/{job_id}/result", headers=headers
                    )
                    result_resp.raise_for_status()
                    return job_id, result_resp.json()
            except httpx.HTTPStatusError:
                await asyncio.sleep(2.0)
                continue
            await asyncio.sleep(2.0)

    raise RuntimeError(f"Graph evolve job {job_id} did not complete in time")


ORG_ID = _get_org_id()
print(f"Resolved org_id: {ORG_ID}")


async def save_verifier_graph(job_id: str) -> str:
    headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
    payload = {
        "name": f"style-matching-verifier-{job_id[:8]}",
        "org_id": ORG_ID,
        "kind": "verifier",
    }
    async with httpx.AsyncClient(timeout=30.0) as client:
        resp = await client.post(
            f"{SYNTH_API_BASE}/graph-evolve/jobs/{job_id}/save-graph",
            headers=headers,
            json=payload,
        )
        if resp.status_code >= 400:
            raise RuntimeError(f"save-graph failed ({resp.status_code}): {resp.text[:500]}")
        data = resp.json()
    graph_id = data.get("graph_id") or data.get("id")
    if not graph_id:
        raise RuntimeError(f"save-graph did not return graph_id: {data}")
    return str(graph_id)


OPTIMIZED_VERIFIER_JOB_ID, OPTIMIZED_VERIFIER_RESULT = await run_verifier_optimization()
status = OPTIMIZED_VERIFIER_RESULT.get("status")
if status != "completed":
    error_msg = OPTIMIZED_VERIFIER_RESULT.get("error") or "unknown error"
    raise RuntimeError(f"Graph evolve job failed with status={status}: {error_msg}")
print("\nVerifier optimization complete")
OPTIMIZED_VERIFIER_GRAPH_ID = await save_verifier_graph(OPTIMIZED_VERIFIER_JOB_ID)
print(f"Optimized verifier job id: {OPTIMIZED_VERIFIER_JOB_ID}")
print(f"Optimized verifier graph id: {OPTIMIZED_VERIFIER_GRAPH_ID}")
print(f"Best score: {OPTIMIZED_VERIFIER_RESULT.get('best_score')}")

Resolved org_id: e77ef3a8-677d-4ddd-92d6-0f114d6bbdaf


Graph evolve job: graph_evolve_4aca6751f48a



Verifier optimization complete


Optimized verifier job id: graph_evolve_4aca6751f48a
Optimized verifier graph id: 5bd0d553-40ba-4aca-a157-126009dee1bd
Best score: 0.675


## Step 4: Start Task App + Tunnel

Run the task app locally, then create a tunnel so the Synth backend can call it.


In [5]:
# Step 4: Start task app and create tunnel

import asyncio
import socket
import threading
import time


def wait_for_system_dns(hostname: str, timeout: float = 90.0, interval: float = 3.0) -> None:
    deadline = time.time() + timeout
    last_exc = None
    while time.time() < deadline:
        try:
            socket.gethostbyname(hostname)
            return
        except Exception as exc:
            last_exc = exc
            time.sleep(interval)
    raise RuntimeError(f"System DNS did not resolve {hostname} within {timeout}s: {last_exc}")


def _task_app_healthcheck(host: str, port: int) -> bool:
    try:
        resp = httpx.get(
            f"http://{host}:{port}/health",
            headers={"X-API-Key": ENVIRONMENT_API_KEY},
            timeout=5,
        )
        return resp.status_code == 200
    except Exception:
        return False


def _wait_for_task_app(host: str, port: int, timeout: float = 30.0) -> None:
    deadline = time.time() + timeout
    while time.time() < deadline:
        if _task_app_healthcheck(host, port):
            return
        time.sleep(1.0)
    raise RuntimeError(f"Task app health check failed after {timeout}s")


task_app_thread = None
_task_app_lock = threading.Lock()


def _start_task_app() -> None:
    global LOCAL_TASK_PORT
    global task_app_thread

    kill_port(LOCAL_TASK_PORT)
    if not is_port_available(LOCAL_TASK_PORT):
        LOCAL_TASK_PORT = find_available_port(LOCAL_TASK_PORT + 1)
        print(f"Port in use; switched to {LOCAL_TASK_PORT}")

    task_app_thread = run_server_background(app, LOCAL_TASK_PORT, host=LOCAL_TASK_HOST)
    _wait_for_task_app(LOCAL_TASK_HOST, LOCAL_TASK_PORT, timeout=30.0)


def _start_task_app_monitor(interval: float = 5.0) -> threading.Thread:
    def _monitor() -> None:
        while True:
            time.sleep(interval)
            with _task_app_lock:
                if _task_app_healthcheck(LOCAL_TASK_HOST, LOCAL_TASK_PORT):
                    continue
                print("Task app health check failed; restarting...")
                try:
                    _start_task_app()
                except Exception as exc:
                    print(f"Task app restart failed: {exc}")

    thread = threading.Thread(target=_monitor, daemon=True, name="task-app-monitor")
    thread.start()
    return thread


print(f"Starting task app on port {LOCAL_TASK_PORT}...")
with _task_app_lock:
    _start_task_app()
print("Task app ready!")
monitor_thread = _start_task_app_monitor()

if USE_LOCAL_BACKEND:
    print("Using localhost task app URL (no tunnel)")
    TASK_APP_URL = f"http://{LOCAL_TASK_HOST}:{LOCAL_TASK_PORT}"
    tunnel = None
else:
    print("")
    print("Provisioning Cloudflare tunnel...")
    try:
        tunnel = await TunneledLocalAPI.create(
            local_port=LOCAL_TASK_PORT,
            backend=TunnelBackend.CloudflareManagedTunnel,
            api_key=API_KEY,
            env_api_key=ENVIRONMENT_API_KEY,
            backend_url=SYNTH_API_BASE,
            reason="style_matching_notebook",
            progress=True,
        )
        print(f"Waiting for system DNS to resolve {tunnel.hostname}...")
        await wait_for_system_dns(tunnel.hostname)
    except Exception as exc:
        print(f"Managed tunnel failed or DNS unresolved ({exc}). Falling back to quick tunnel...")
        tunnel = await TunneledLocalAPI.create(
            local_port=LOCAL_TASK_PORT,
            backend=TunnelBackend.CloudflareQuickTunnel,
            env_api_key=ENVIRONMENT_API_KEY,
            progress=True,
        )

if tunnel is not None:
    TASK_APP_URL = tunnel.url
print(f"Task app URL: {TASK_APP_URL}")

Starting task app on port 8132...
Waiting for task app health check...


Task app ready!
Using localhost task app URL (no tunnel)
Task app URL: http://127.0.0.1:8132


## Step 5: Run GEPA With Baseline vs Optimized Verifiers

We run GEPA twice: once with the baseline verifier and once with the optimized verifier graph.


In [6]:
# Step 5: Run GEPA (baseline vs optimized verifier)

POLICY_MODEL = "gpt-4.1-nano"


def _make_gepa_config(task_app_url: str) -> Dict[str, Any]:
    return {
        "prompt_learning": {
            "algorithm": "gepa",
            "task_app_url": task_app_url,
            "task_app_api_key": ENVIRONMENT_API_KEY,
            "env_name": "style-matching",
            "initial_prompt": {
                "messages": [
                    {"role": "system", "order": 0, "pattern": SYSTEM_PROMPT},
                    {"role": "user", "order": 1, "pattern": USER_PROMPT_TEMPLATE},
                ],
                "wildcards": {"title": "REQUIRED", "outline": "REQUIRED", "notes": "REQUIRED"},
            },
            "policy": {
                "inference_mode": "synth_hosted",
                "model": POLICY_MODEL,
                "provider": "openai",
                "temperature": 0.7,
                "max_completion_tokens": 1200,
            },
            "gepa": {
                "env_name": "style-matching",
                "evaluation": {"seeds": list(range(13)), "validation_seeds": list(range(13, 17))},
                "rollout": {"budget": 48, "max_concurrent": 3, "minibatch_size": 3},
                "mutation": {"rate": 0.3},
                "population": {
                    "initial_size": 3,
                    "num_generations": 3,
                    "children_per_generation": 2,
                },
                "archive": {"size": 5, "pareto_set_size": 10},
                "token": {"counting_model": "gpt-4"},
            },
            "verifier": {"enabled": False, "reward_source": "task_app"},
        }
    }


async def run_gepa_with_verifier(label: str, verifier_job_id: str):
    global VERIFIER_JOB_ID
    VERIFIER_JOB_ID = verifier_job_id
    print(f"\nRunning GEPA ({label}) with verifier: {verifier_job_id}")

    config_body = _make_gepa_config(TASK_APP_URL)
    job = PromptLearningJob.from_dict(
        config_dict=config_body,
        backend_url=SYNTH_API_BASE,
        task_app_api_key=ENVIRONMENT_API_KEY,
    )

    job_id = job.submit()
    print(f"GEPA job id ({label}): {job_id}")

    result = job.poll_until_complete(timeout=3600.0, interval=3.0, progress=True)
    print(f"GEPA finished ({label}): {result.status.value}")
    if result.failed:
        error_msg = result.error
        if not error_msg:
            try:
                client = PromptLearningClient(SYNTH_API_BASE, API_KEY)
                events = await client.get_events(job_id, limit=5000)
            except Exception as exc:
                error_msg = f"Failed to fetch job events: {exc}"
            else:
                for event in reversed(events):
                    event_type = str(event.get("type", ""))
                    if event_type in {"prompt.learning.failed", "mipro.job.failed", "job.failed"}:
                        error_msg = event.get("message") or event.get("data")
                        break
        raise RuntimeError(f"GEPA job failed ({label}): {error_msg or 'unknown error'}")
    return job_id, result


baseline_gepa_job_id, baseline_gepa_result = await run_gepa_with_verifier(
    "baseline", BASELINE_VERIFIER_JOB_ID
)

optimized_gepa_job_id, optimized_gepa_result = await run_gepa_with_verifier(
    "optimized", OPTIMIZED_VERIFIER_JOB_ID
)

# Fetch prompts
pl_client = PromptLearningClient(SYNTH_API_BASE, API_KEY)

baseline_prompts = await pl_client.get_prompts(baseline_gepa_job_id)
optimized_prompts = await pl_client.get_prompts(optimized_gepa_job_id)


def _select_prompt(result):
    best = result.best_prompt
    if best:
        return best
    top = result.top_prompts or []
    if top:
        return top[0]
    raise RuntimeError("No prompts returned from GEPA job")


baseline_prompt_obj = _select_prompt(baseline_prompts)
optimized_prompt_obj = _select_prompt(optimized_prompts)

print("\nBaseline best score:", baseline_prompts.best_score)
print("Optimized best score:", optimized_prompts.best_score)


Running GEPA (baseline) with verifier: zero_shot_verifier_contrastive_single


GEPA job id (baseline): pl_b926ded654954042


[00:00] queued | score: --                    

[00:03] running | score: --                    

[00:06] running | score: --                    

[00:09] running | score: --                    

[00:12] running | score: --                    

[00:16] running | score: --                    

[00:19] running | score: --                    

[00:22] running | score: --                    

[00:25] running | score: --                    

[00:28] running | score: --                    

[00:32] running | score: --                    

[00:35] running | score: --                    

[00:38] running | score: --                    

[00:41] running | score: --                    

[00:45] running | score: --                    

[00:48] running | score: --                    

[00:51] running | score: --                    

[00:54] running | score: --                    

[00:57] running | score: --                    

[01:01] running | score: --                    

[01:04] running | score: --                    

[01:07] running | score: --                    

[01:10] running | score: --                    

[01:14] running | score: --                    

[01:17] running | score: --                    

[01:20] running | score: --                    

[01:23] running | score: --                    

[01:26] running | score: --                    

[01:30] running | score: --                    

[01:33] running | score: --                    

[01:36] running | score: --                    

[01:39] running | score: --                    

[01:43] running | score: --                    

[01:46] running | score: --                    

[01:49] running | score: --                    

[01:52] running | score: --                    

[01:56] running | score: --                    

[01:59] running | score: --                    

[02:02] running | score: --                    

[02:05] running | score: --                    

[02:08] running | score: --                    

[02:12] running | score: --                    

[02:15] running | score: --                    

[02:18] running | score: --                    

[02:21] running | score: --                    

[02:25] running | score: --                    

[02:28] running | score: --                    

[02:31] running | score: --                    

[02:34] running | score: --                    

[02:37] running | score: --                    

[02:41] running | score: --                    

[02:44] running | score: --                    

[02:47] running | score: --                    

[02:50] running | score: --                    

[02:54] running | score: --                    

[02:57] running | score: --                    

[03:00] running | score: --                    

[03:03] running | score: --                    

[03:06] running | score: --                    

[03:10] running | score: --                    

[03:13] running | score: --                    

[03:16] running | score: --                    

[03:19] running | score: --                    

[03:23] running | score: --                    

[03:26] running | score: --                    

[03:29] running | score: --                    

[03:32] running | score: --                    

[03:36] running | score: --                    

[03:39] running | score: --                    

[03:42] running | score: --                    

[03:46] running | score: --                    

[03:49] running | score: --                    

[03:52] running | score: --                    

[03:56] running | score: --                    

[03:59] running | score: --                    

[04:02] running | score: --                    

[04:06] running | score: --                    

[04:09] running | score: --                    

[04:12] running | score: --                    

[04:16] running | score: --                    

[04:19] running | score: --                    

[04:22] running | score: --                    

[04:25] running | score: --                    

[04:29] running | score: --                    

[04:32] running | score: --                    

[04:35] running | score: --                    

[04:38] running | score: --                    

[04:42] running | score: --                    

[04:45] running | score: --                    

[04:48] running | score: --                    

[04:51] running | score: --                    

[04:55] running | score: --                    

[04:58] running | score: --                    

[05:01] running | score: --                    

[05:04] running | score: --                    

[05:08] running | score: --                    

[05:11] running | score: --                    

[05:14] running | score: --                    

[05:18] running | score: --                    

[05:21] running | score: --                    

[05:24] running | score: --                    

[05:28] running | score: --                    

[05:31] running | score: --                    

[05:34] running | score: --                    

[05:38] running | score: --                    

[05:41] running | score: --                    

[05:44] running | score: --                    

[05:48] running | score: --                    

[05:51] running | score: --                    

[05:54] running | score: --                    

[05:57] running | score: --                    

[06:01] running | score: --                    

[06:04] running | score: --                    

[06:07] running | score: --                    

[06:10] running | score: --                    

[06:13] running | score: --                    

[06:16] running | score: --                    

[06:19] running | score: --                    

[06:23] running | score: --                    

[06:26] running | score: --                    

[06:29] running | score: --                    

[06:32] running | score: --                    

[06:35] running | score: --                    

[06:38] running | score: --                    

[06:41] running | score: --                    

[06:45] running | score: --                    

[06:48] running | score: --                    

[06:51] running | score: --                    

[06:54] running | score: --                    

[06:57] running | score: --                    

[07:00] running | score: --                    

[07:03] running | score: --                    

[07:07] running | score: --                    

[07:10] running | score: --                    

[07:13] running | score: --                    

[07:16] running | score: --                    

[07:19] running | score: --                    

[07:23] succeeded | score: 0.28                    
GEPA finished (baseline): succeeded

Running GEPA (optimized) with verifier: graph_evolve_4aca6751f48a


GEPA job id (optimized): pl_eee7add542264148


[00:00] queued | score: --                    

[00:03] running | score: --                    

[00:06] running | score: --                    

[00:10] running | score: --                    

[00:13] running | score: --                    

[00:16] running | score: --                    

[00:19] running | score: --                    

[00:23] running | score: --                    

[00:26] running | score: --                    

[00:29] running | score: --                    

[00:32] running | score: --                    

[00:35] running | score: --                    

[00:39] running | score: --                    

[00:42] running | score: --                    

[00:45] running | score: --                    

[00:48] running | score: --                    

[00:51] running | score: --                    

[00:55] running | score: --                    

[00:58] running | score: --                    

[01:01] running | score: --                    

[01:04] running | score: --                    

[01:08] running | score: --                    

[01:11] running | score: --                    

[01:14] running | score: --                    

[01:17] running | score: --                    

[01:21] running | score: --                    

[01:24] running | score: --                    

[01:27] running | score: --                    

[01:30] running | score: --                    

[01:33] running | score: --                    

[01:37] running | score: --                    

[01:40] running | score: --                    

[01:43] running | score: --                    

[01:46] running | score: --                    

[01:50] running | score: --                    

[01:53] running | score: --                    

[01:56] running | score: --                    

[01:59] running | score: --                    

[02:02] running | score: --                    

[02:06] running | score: --                    

[02:09] running | score: --                    

[02:12] running | score: --                    

[02:15] running | score: --                    

[02:19] running | score: --                    

[02:22] running | score: --                    

[02:25] running | score: --                    

[02:28] running | score: --                    

[02:32] running | score: --                    

[02:35] running | score: --                    

[02:38] running | score: --                    

[02:41] running | score: --                    

[02:45] running | score: --                    

[02:48] running | score: --                    

[02:51] running | score: --                    

[02:54] running | score: --                    

[02:58] running | score: --                    

[03:01] running | score: --                    

[03:04] running | score: --                    

[03:07] running | score: --                    

[03:10] running | score: --                    

[03:14] running | score: --                    

[03:17] running | score: --                    

[03:20] running | score: --                    

[03:24] running | score: --                    

[03:27] running | score: --                    

[03:30] running | score: --                    

[03:33] running | score: --                    

[03:37] running | score: --                    

[03:40] running | score: --                    

[03:43] running | score: --                    

[03:46] running | score: --                    

[03:49] running | score: --                    

[03:53] running | score: --                    

[03:56] running | score: --                    

[03:59] running | score: --                    

[04:02] running | score: --                    

[04:05] running | score: --                    

[04:08] running | score: --                    

[04:11] running | score: --                    

[04:14] running | score: --                    

[04:17] running | score: --                    

[04:20] running | score: --                    

[04:23] running | score: --                    

[04:26] running | score: --                    

[04:29] running | score: --                    

[04:32] running | score: --                    

[04:35] running | score: --                    

[04:38] running | score: --                    

[04:41] running | score: --                    

[04:44] running | score: --                    

[04:47] running | score: --                    

[04:50] running | score: --                    

[04:53] running | score: --                    

[04:56] running | score: --                    

[04:59] succeeded | score: 0.85                    
GEPA finished (optimized): succeeded



Baseline best score: 0.27849999999999997
Optimized best score: 0.8499999999999999


## Step 6: Heldout Evaluation (4 Final Scores)

We evaluate both best prompts on a heldout set using **eval jobs** and both verifiers:

1. Baseline prompt + baseline verifier
2. Baseline prompt + optimized verifier
3. Optimized prompt + baseline verifier
4. Optimized prompt + optimized verifier


In [7]:
# Step 6: Heldout evaluation via eval jobs

HELDOUT_SEEDS = [20, 21, 22, 23]
EVAL_POLICY_MODEL = "gpt-4.1-nano"


def _extract_prompt_sections(prompt_obj: Dict[str, Any]) -> List[Dict[str, Any]]:
    template = prompt_obj.get("template") or {}
    sections = template.get("sections")
    if isinstance(sections, list) and sections:
        return sections
    messages = prompt_obj.get("messages")
    if isinstance(messages, list) and messages:
        return messages
    return []


async def run_eval_job(
    label: str, prompt_sections: List[Dict[str, Any]], verifier_job_id: str
) -> float:
    config = EvalJobConfig(
        task_app_url=TASK_APP_URL,
        backend_url=SYNTH_API_BASE,
        api_key=API_KEY,
        task_app_api_key=ENVIRONMENT_API_KEY,
        env_name="style-matching",
        seeds=HELDOUT_SEEDS,
        policy_config={"model": EVAL_POLICY_MODEL, "provider": "openai"},
        env_config={
            "prompt_sections": prompt_sections,
            "verifier_job_id": verifier_job_id,
        },
        concurrency=3,
    )

    job = EvalJob(config)
    job_id = job.submit()
    print(f"Eval job ({label}): {job_id}")

    result = job.poll_until_complete(timeout=1800.0, interval=3.0, progress=True)
    if not result.succeeded:
        raise RuntimeError(f"Eval job failed ({label}): {result.error}")
    return result.mean_score or 0.0


baseline_prompt = baseline_prompt_obj
optimized_prompt = optimized_prompt_obj

baseline_sections = _extract_prompt_sections(baseline_prompt)
optimized_sections = _extract_prompt_sections(optimized_prompt)

results = {}
results["baseline_prompt__baseline_verifier"] = await run_eval_job(
    "baseline_prompt__baseline_verifier", baseline_sections, BASELINE_VERIFIER_JOB_ID
)
results["baseline_prompt__optimized_verifier"] = await run_eval_job(
    "baseline_prompt__optimized_verifier", baseline_sections, OPTIMIZED_VERIFIER_JOB_ID
)
results["optimized_prompt__baseline_verifier"] = await run_eval_job(
    "optimized_prompt__baseline_verifier", optimized_sections, BASELINE_VERIFIER_JOB_ID
)
results["optimized_prompt__optimized_verifier"] = await run_eval_job(
    "optimized_prompt__optimized_verifier", optimized_sections, OPTIMIZED_VERIFIER_JOB_ID
)

print("\nHeldout results (mean verifier score):")
for k, v in results.items():
    print(f"- {k}: {v:.3f}")

Eval job (baseline_prompt__baseline_verifier): eval_cd3258b0559148fb


[00:00] running | 0/4 completed


[00:03] running | 0/4 completed


[00:06] running | 0/4 completed


[00:09] running | 0/4 completed


[00:13] running | 0/4 completed


[00:16] running | 0/4 completed


[00:19] running | 0/4 completed


[00:22] running | 0/4 completed


[00:25] completed | mean_score: 0.35


Eval job (baseline_prompt__optimized_verifier): eval_d70138772e7e48a5


[00:00] running | 0/4 completed


[00:03] running | 0/4 completed


[00:06] running | 0/4 completed


[00:09] running | 0/4 completed


[00:12] running | 0/4 completed


[00:16] running | 0/4 completed


[00:19] completed | mean_score: 0.85


Eval job (optimized_prompt__baseline_verifier): eval_5a282dd87db84afb
[00:00] running | 0/4 completed


[00:03] running | 0/4 completed


[00:06] running | 0/4 completed


[00:09] running | 0/4 completed


[00:12] running | 0/4 completed


[00:16] running | 0/4 completed


[00:19] running | 0/4 completed


[00:22] completed | mean_score: 0.19


Eval job (optimized_prompt__optimized_verifier): eval_9eef1a1fe97e458e
[00:00] running | 0/4 completed


[00:03] running | 0/4 completed


[00:06] running | 0/4 completed


[00:09] running | 0/4 completed


[00:12] running | 0/4 completed


[00:15] completed | mean_score: 0.85



Heldout results (mean verifier score):
- baseline_prompt__baseline_verifier: 0.346
- baseline_prompt__optimized_verifier: 0.850
- optimized_prompt__baseline_verifier: 0.190
- optimized_prompt__optimized_verifier: 0.850


In [8]:
# Cleanup
print("Cleaning up tunnels...")
cleanup_all()
print("Done.")

Cleaning up tunnels...
Done.
