# GEPA Summarization Optimization with LLM Judge Evaluation
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Evals/GEPA_Optimization.ipynb)

## Introduction

This notebook demonstrates how to optimize summarization prompts using GEPA (Generate, Evaluate, Propose, Adapt) with the our Evaluations API. We'll:

1. Load the CNN/DailyMail dataset containing news articles
2. Start with a baseline summarization prompt
3. Use an optimizer LLM to iteratively improve the prompt
4. Compare prompts head-to-head using a judge model
5. Track improvement over multiple iterations

**Concepts Covered:**
- **GEPA Optimization**: Iterative prompt engineering using LLM feedback
- **LLM-as-a-Judge**: Using a language model to evaluate and compare outputs
- **Batch Evaluation**: Efficient comparison of multiple summaries
- **Prompt Engineering**: Systematic improvement of instruction prompts

## 📦 Setup and Installation

In [1]:
!pip install -qU together dspy-ai datasets tqdm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.4/86.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.3/120.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.2/285.2 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.7/46.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import together
import json
import random
import os
import re
import time
from pathlib import Path
from typing import List, Dict, Tuple
from datetime import datetime

import dspy
from datasets import load_dataset
from tqdm import tqdm

## ⚙️ Configuration

Set up your API key and configure the models we'll use:
- **Summarizer Model**: Generates the summaries
- **Judge Model**: Evaluates which summary is better
- **Optimizer Model**: Proposes improvements to the prompt

In [3]:
# Set your Together AI API key from Colab secrets
from google.colab import userdata
TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')
print("✓ API key loaded from Colab secrets")

client = together.Client(api_key=TOGETHER_API_KEY)

# Model configuration
SUMMARIZER_MODEL = "openai/gpt-oss-20b"
JUDGE_MODEL = "deepseek-ai/DeepSeek-V3"
OPTIMIZER_MODEL = "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"

# Data splits
TRAIN_SIZE = 150
VAL_SIZE = 300
TEST_SIZE = 300

RANDOM_SEED = 42

print("✓ Configuration complete")

✓ API key loaded from Colab secrets
✓ Configuration complete


## 📝 Baseline and Judge Prompts

We start with a simple baseline prompt for summarization. The GEPA process will iteratively improve this prompt based on performance feedback.

In [4]:
BASELINE_PROMPT = """Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news event
- Key people or organizations involved
- Important details or outcomes
- Any significant context

Keep it to 3-5 sentences total."""

JUDGE_PROMPT = """Compare these two summaries of the same news article.

Which summary better:
- Captures the main news story
- Includes important details
- Is clear and concise
- Avoids unnecessary information

Choose A or B and explain why briefly."""

print("Baseline Prompt:")
print(BASELINE_PROMPT)
print("\nJudge Prompt:")
print(JUDGE_PROMPT)

Baseline Prompt:
Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news event
- Key people or organizations involved
- Important details or outcomes
- Any significant context

Keep it to 3-5 sentences total.

Judge Prompt:
Compare these two summaries of the same news article.

Which summary better:
- Captures the main news story
- Includes important details
- Is clear and concise
- Avoids unnecessary information

Choose A or B and explain why briefly.


## 📂 Loading the CNN/DailyMail Dataset

The CNN/DailyMail dataset contains news articles paired with human-written highlights. We'll use the articles as our source text and split the data into train, validation, and test sets.

**Dataset Structure:**
- `article`: The full news article text
- `highlights`: Human-written bullet-point summary
- We'll use the articles for summarization and evaluate our generated summaries

In [5]:
def load_and_split_data():
    """Load CNN/DailyMail dataset for summarization."""
    print("\n" + "=" * 80)
    print("📂 LOADING DATA")
    print("=" * 80)

    print("Loading CNN/DailyMail dataset...")
    dataset = load_dataset("abisee/cnn_dailymail", "3.0.0")
    data = dataset['test']

    print(f"✓ Loaded {len(data)} examples")
    print(f"  Sample article: {data[0]['article'][:100]}...")
    print(f"  Sample highlights: {data[0]['highlights'][:100]}...")

    all_data = []
    for i, item in enumerate(data):
        all_data.append({
            'id': f"cnn_{i}",
            'text': item['article'],
            'reference_summary': item['highlights']
        })

    print(f"✓ Converted to {len(all_data)} items")

    random.seed(RANDOM_SEED)
    random.shuffle(all_data)

    train_data = all_data[:TRAIN_SIZE]
    val_data = all_data[TRAIN_SIZE:TRAIN_SIZE + VAL_SIZE]
    test_data = all_data[TRAIN_SIZE + VAL_SIZE:TRAIN_SIZE + VAL_SIZE + TEST_SIZE]

    print(f"✓ Split: Train={len(train_data)}, Val={len(val_data)}, Test={len(test_data)}")

    assert len(val_data) > 0, "Val data is empty!"
    assert len(test_data) > 0, "Test data is empty!"

    return train_data, val_data, test_data

# Load the data
train_data, val_data, test_data = load_and_split_data()


📂 LOADING DATA
Loading CNN/DailyMail dataset...


README.md: 0.00B [00:00, ?B/s]

3.0.0/train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

3.0.0/validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

3.0.0/test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

✓ Loaded 11490 examples
  Sample article: (CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Cour...
  Sample highlights: Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since...
✓ Converted to 11490 items
✓ Split: Train=150, Val=300, Test=300


## 🤖 Summarization Module

We create a DSPy module that wraps our summarization task. This module can be configured with different instruction prompts, which is key to the GEPA optimization process.

In [6]:
class Summarizer(dspy.Signature):
    """Generate a summary."""
    text = dspy.InputField()
    summary = dspy.OutputField()


class SummarizationModule(dspy.Module):
    """Summarization module."""

    def __init__(self, instructions=None):
        super().__init__()
        self.instructions = instructions or BASELINE_PROMPT

        if instructions:
            class CustomSummarizer(dspy.Signature):
                __doc__ = instructions
                text = dspy.InputField()
                summary = dspy.OutputField()

            self.predictor = dspy.Predict(CustomSummarizer)
        else:
            self.predictor = dspy.Predict(Summarizer)

    def forward(self, text):
        return self.predictor(text=text)

print("✓ Summarization module defined")

✓ Summarization module defined


## 📊 Batch Summary Generation

This function generates summaries for a batch of articles using a given prompt. It includes error handling and progress tracking.

In [7]:
def generate_summaries_batch(
        summarizer: SummarizationModule,
        data: List[Dict],
        desc: str = "Generating"
) -> List[Dict]:
    """Generate summaries for a batch of texts."""
    results = []
    errors = 0
    error_details = []

    # Print the prompt being used (first item only)
    if len(data) > 0:
        print(f"  Using prompt: {summarizer.instructions[:100]}...")

    for item in tqdm(data, desc=desc):
        try:
            pred = summarizer(text=item['text'][:5000])

            if pred is None:
                raise ValueError("Model returned None")

            if hasattr(pred, 'summary') and pred.summary:
                summary = pred.summary
            elif isinstance(pred, str):
                summary = pred
            else:
                print(f"\n  DEBUG: pred type={type(pred)}, hasattr summary={hasattr(pred, 'summary')}")
                raise ValueError(f"Cannot extract summary from {type(pred)}")

            summary = summary.strip()
            if len(summary) < 20:
                raise ValueError("Summary too short")

        except Exception as e:
            errors += 1
            error_details.append(str(e)[:100])

            if errors <= 5:
                print(f"\n⚠️  Error: {str(e)[:80]}")

            summary = "Error generating summary."

        results.append({
            'id': item['id'],
            'text': item['text'],
            'summary': summary
        })

    if errors > 0:
        print(f"\n⚠️  Total errors: {errors}/{len(data)} ({errors / len(data) * 100:.1f}%)")
        from collections import Counter
        common_errors = Counter(error_details).most_common(3)
        print(f"  Most common errors:")
        for err, count in common_errors:
            print(f"    - {err[:60]}... ({count}x)")

    return results

print("✓ Batch generation function defined")

✓ Batch generation function defined


## 🧠 Optimizer LLM Wrapper

This wrapper allows us to use an LLM to propose improvements to our summarization prompt based on current performance.

In [8]:
class SimpleOptimizerLM:
    """Wrapper for optimizer LLM."""

    def __init__(self, model: str, api_key: str):
        self.client = together.Client(api_key=api_key)
        self.model = model

    def __call__(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=4000
        )
        return response.choices[0].message.content

print("✓ Optimizer LLM wrapper defined")

✓ Optimizer LLM wrapper defined


## 🤔 Reflection and Prompt Improvement

This function uses the optimizer LLM to analyze the current prompt and performance, then propose an improved version.

**Key Constraints:**
- Keep prompts under 150 words for clarity
- Focus on simple, direct instructions
- Target 4-6 sentence summaries
- Avoid overly complex requirements

In [9]:
def reflect_and_improve_prompt(
        current_prompt: str,
        current_score: float,
        optimizer_lm: SimpleOptimizerLM,
        iteration: int
) -> str:
    """Use LLM to propose improved prompt."""

    print(f"\n🤔 REFLECTION (Iteration {iteration})")

    reflection_prompt = f"""You are optimizing a summarization prompt for CNN/DailyMail news articles.

Current Prompt:
```
{current_prompt}
```

Current Performance: {current_score:.1%} win rate

Your task: Propose a SIMPLE improved version that generates better summaries.

CRITICAL CONSTRAINTS:
- Keep the prompt under 150 words
- Make it clear and direct (NOT overly complex)
- Target 4-6 sentence summaries
- Avoid excessive instructions or formatting requirements
- The prompt should be easy for the model to follow

Focus on:
- Should it emphasize different aspects (accuracy, brevity, completeness)?
- Are the current guidelines clear?
- Is anything missing or unnecessary?

Output ONLY the improved prompt within ``` blocks. Keep it simple and clear."""

    response = optimizer_lm(reflection_prompt)

    # Extract prompt
    match = re.search(r'```(.*?)```', response, re.DOTALL)
    if match:
        new_prompt = match.group(1).strip()
        # Remove language tags
        for tag in ['markdown', 'text', 'python', 'plaintext']:
            if new_prompt.startswith(f'{tag}\n'):
                new_prompt = '\n'.join(new_prompt.split('\n')[1:])

        # Validate length (reject if too long)
        word_count = len(new_prompt.split())
        if word_count > 200:
            print(f"  ⚠️  Generated prompt too long ({word_count} words), using current")
            return current_prompt

        print(f"✓ Generated new prompt ({word_count} words)")
        return new_prompt

    print("⚠️  Could not extract prompt")
    return current_prompt

print("✓ Reflection function defined")

✓ Reflection function defined


## 🔄 Head-to-Head Prompt Comparison

This function compares two prompts by:
1. Generating summaries with both prompts
2. Creating a comparison dataset
3. Using the Together AI evaluation API with a judge model
4. Computing win rates

The evaluation uses a two-pass approach to eliminate position bias.

In [10]:
def compare_two_prompts_on_batch(
        data: List[Dict],
        prompt_a: str,
        prompt_b: str,
        summarizer_lm: dspy.LM,
        eval_name: str
) -> Tuple[float, float, Dict]:
    """
    Compare two summarization prompts.

    1. Generate summaries with prompt A
    2. Generate summaries with prompt B
    3. Use judge to compare them
    4. Return win rate for prompt A
    """

    print(f"\n{'=' * 80}")
    print(f"🔄 COMPARING PROMPTS: {eval_name}")
    print(f"{'=' * 80}")

    # Step 1: Generate with both prompts
    dspy.configure(lm=summarizer_lm)

    summarizer_a = SummarizationModule(prompt_a)
    summarizer_b = SummarizationModule(prompt_b)

    print("Generating summaries with Prompt A...")
    summaries_a = generate_summaries_batch(summarizer_a, data, "Prompt A")

    print("Generating summaries with Prompt B...")
    summaries_b = generate_summaries_batch(summarizer_b, data, "Prompt B")

    # Step 2: Prepare comparison data
    temp_file = f"temp_compare_{eval_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.jsonl"

    with open(temp_file, 'w') as f:
        for summary_a, summary_b in zip(summaries_a, summaries_b):
            formatted = {
                "prompt": f"Source article: {summary_a['text'][:5000]}",
                "model_a_output": summary_a['summary'],
                "model_b_output": summary_b['summary'],
                "id": summary_a['id']
            }
            f.write(json.dumps(formatted) + '\n')

    # Step 3: Upload and evaluate
    print("📤 Uploading for comparison...")
    file_response = client.files.upload(file=temp_file, purpose="eval")
    file_id = file_response.id

    print("🚀 Launching comparison...")
    eval_response = client.evaluation.create(
        type="compare",
        input_data_file_path=file_id,
        judge_model=JUDGE_MODEL,
        judge_model_source="serverless",
        judge_system_template=JUDGE_PROMPT,
        model_a="model_a_output",
        model_b="model_b_output"
    )

    # Step 4: Wait and get results
    print(f"⏳ Waiting (ID: {eval_response.workflow_id})...")
    while True:
        status = client.evaluation.status(eval_response.workflow_id)
        if status.status.value == "completed":
            break
        elif status.status.value == "failed":
            raise Exception("Evaluation failed")
        time.sleep(30)

    a_wins = status.results.get('A_wins', 0)
    b_wins = status.results.get('B_wins', 0)
    ties = status.results.get('Ties', 0)

    # Win rate for prompt A
    decisive_total = a_wins + b_wins
    if decisive_total > 0:
        a_win_rate = a_wins / decisive_total
        b_win_rate = b_wins / decisive_total
    else:
        a_win_rate = b_win_rate = 0.5

    print(f"✓ Results: Prompt A wins={a_wins}, Prompt B wins={b_wins}, Ties={ties}")
    print(f"✓ Prompt A win rate: {a_win_rate:.2%}")

    os.remove(temp_file)

    return a_win_rate, b_win_rate, {
        'a_wins': a_wins,
        'b_wins': b_wins,
        'ties': ties,
        'a_win_rate': a_win_rate
    }

print("✓ Comparison function defined")

✓ Comparison function defined


## 🧬 GEPA Optimization Loop

This is the main optimization loop that implements the GEPA algorithm:

1. **Generate**: Create summaries with current prompt
2. **Evaluate**: Compare against baseline using judge model
3. **Propose**: Use optimizer LLM to suggest improvements
4. **Adapt**: Accept improvements that increase win rate

The process repeats for multiple iterations, tracking the best prompt found.

In [11]:
def run_manual_gepa_improved(
        train_data: List[Dict],
        val_data: List[Dict],
        test_data: List[Dict],
        summarizer_lm: dspy.LM,
        optimizer_lm: SimpleOptimizerLM,
        max_iterations: int = 5,
        early_stopping_patience: int = 2,
        min_improvement: float = 0.02,  # 2% minimum improvement
        test_every_n_iters: int = 2  # Test on test set every N iterations
):
    """
    Improved GEPA optimization with:
    - Early stopping to prevent overfitting
    - Minimum improvement threshold
    - Periodic test set validation
    - Better tracking of all candidates
    """

    print("\n" + "=" * 80)
    print("🧬 IMPROVED GEPA OPTIMIZATION")
    print("=" * 80)
    print(f"Config: patience={early_stopping_patience}, min_improvement={min_improvement:.1%}, test_every={test_every_n_iters}")

    best_prompt = BASELINE_PROMPT
    best_val_score = 0.5  # Start at 50% (neutral)
    best_test_score = None
    
    # Track all candidates
    prompt_history = []
    no_improvement_count = 0
    
    # Track best prompt on test set (for comparison)
    best_test_prompt = BASELINE_PROMPT
    best_test_score_overall = 0.5

    for i in range(max_iterations):
        print(f"\n{'=' * 80}")
        print(f"ITERATION {i + 1}/{max_iterations}")
        print(f"{'=' * 80}")

        if i == 0:
            print("Iteration 0: Establishing baseline")
            # Optionally evaluate baseline on test set
            if test_every_n_iters == 1:
                print("\n📊 Initial test set evaluation...")
                baseline_test, _, _ = compare_two_prompts_on_batch(
                    test_data,
                    prompt_a=BASELINE_PROMPT,
                    prompt_b=BASELINE_PROMPT,
                    summarizer_lm=summarizer_lm,
                    eval_name=f"iter{i}_test_baseline"
                )
                best_test_score_overall = 0.5
                print(f"  Baseline test performance: {0.5:.2%} (neutral)")
            continue

        # Generate new candidate
        new_prompt = reflect_and_improve_prompt(
            best_prompt,
            best_val_score,
            optimizer_lm,
            i
        )

        if new_prompt == best_prompt:
            print("⚠️  No change in prompt, stopping early")
            break

        print(f"✓ Generated candidate prompt ({len(new_prompt)} chars)")

        # Evaluate on validation set
        baseline_win_rate, new_prompt_win_rate, metrics = compare_two_prompts_on_batch(
            val_data,
            prompt_a=BASELINE_PROMPT,
            prompt_b=new_prompt,
            summarizer_lm=summarizer_lm,
            eval_name=f"iter{i}_val"
        )

        new_prompt_win_rate = 1.0 - baseline_win_rate
        improvement = new_prompt_win_rate - best_val_score

        print(f"\n  📊 VALIDATION RESULTS:")
        print(f"  Baseline (original): {baseline_win_rate:.2%}")
        print(f"  New candidate: {new_prompt_win_rate:.2%}")
        print(f"  Improvement: {improvement * 100:+.2f}pp")

        # Track this candidate
        candidate_info = {
            'iteration': i,
            'prompt': new_prompt,
            'val_score': new_prompt_win_rate,
            'improvement': improvement
        }

        # Check if this is a significant improvement
        is_significant = improvement >= min_improvement
        
        if is_significant:
            print(f"  ✅ Significant improvement! (>= {min_improvement:.1%})")
            best_prompt = new_prompt
            best_val_score = new_prompt_win_rate
            no_improvement_count = 0
            candidate_info['accepted'] = True
        else:
            print(f"  ❌ Below threshold ({improvement * 100:.2f}pp < {min_improvement * 100:.2f}pp)")
            no_improvement_count += 1
            candidate_info['accepted'] = False

        # Periodic test set evaluation
        if test_every_n_iters > 0 and i % test_every_n_iters == 0:
            print(f"\n  🔍 PERIODIC TEST SET CHECK (iter {i}):")
            baseline_test, candidate_test, _ = compare_two_prompts_on_batch(
                test_data,
                prompt_a=BASELINE_PROMPT,
                prompt_b=new_prompt,
                summarizer_lm=summarizer_lm,
                eval_name=f"iter{i}_test"
            )
            
            candidate_test_win_rate = 1.0 - baseline_test
            test_improvement = candidate_test_win_rate - 0.5
            
            print(f"  Test performance: {candidate_test_win_rate:.2%} ({test_improvement * 100:+.2f}pp from neutral)")
            
            candidate_info['test_score'] = candidate_test_win_rate
            
            # Track best test performance
            if candidate_test_win_rate > best_test_score_overall:
                best_test_score_overall = candidate_test_win_rate
                best_test_prompt = new_prompt
                print(f"  🌟 New best test score! ({best_test_score_overall:.2%})")
            
            # Warning if validation and test diverge significantly
            val_test_gap = abs(new_prompt_win_rate - candidate_test_win_rate)
            if val_test_gap > 0.15:  # 15% gap
                print(f"  ⚠️  WARNING: Large val/test gap ({val_test_gap:.1%}) - possible overfitting!")

        prompt_history.append(candidate_info)

        # Early stopping check
        if no_improvement_count >= early_stopping_patience:
            print(f"\n🛑 EARLY STOPPING: No significant improvement for {early_stopping_patience} iterations")
            break

    # Final comprehensive test evaluation
    print("\n" + "=" * 80)
    print("📊 FINAL TEST EVALUATION")
    print("=" * 80)

    # Test the best validation prompt
    print("\n1️⃣ Testing best VALIDATION prompt...")
    baseline_test_win_rate, optimized_test_win_rate, _ = compare_two_prompts_on_batch(
        test_data,
        prompt_a=BASELINE_PROMPT,
        prompt_b=best_prompt,
        summarizer_lm=summarizer_lm,
        eval_name="final_test_best_val"
    )

    # If we did periodic testing, also report the best test prompt
    if test_every_n_iters > 0 and best_test_prompt != best_prompt:
        print("\n2️⃣ Testing best TEST prompt (from periodic checks)...")
        _, best_test_final, _ = compare_two_prompts_on_batch(
            test_data,
            prompt_a=BASELINE_PROMPT,
            prompt_b=best_test_prompt,
            summarizer_lm=summarizer_lm,
            eval_name="final_test_best_test"
        )

    # Print comprehensive results
    print("\n" + "=" * 80)
    print("🎉 FINAL RESULTS")
    print("=" * 80)

    print(f"\n📈 OPTIMIZATION SUMMARY:")
    print(f"  Total iterations: {len(prompt_history)}")
    print(f"  Accepted improvements: {sum(1 for p in prompt_history if p.get('accepted', False))}")
    
    print(f"\n📊 TEST SET PERFORMANCE:")
    print(f"  Baseline prompt:        {baseline_test_win_rate:.2%}")
    print(f"  Best validation prompt: {optimized_test_win_rate:.2%} ({(optimized_test_win_rate - 0.5) * 100:+.2f}pp)")
    
    if test_every_n_iters > 0 and best_test_prompt != best_prompt:
        print(f"  Best test prompt:       {best_test_final:.2%} ({(best_test_final - 0.5) * 100:+.2f}pp)")
        
        if best_test_final > optimized_test_win_rate:
            print(f"\n  ⚠️  Best test prompt outperforms best validation prompt!")
            print(f"  This suggests validation overfitting occurred.")
            final_prompt = best_test_prompt
            final_score = best_test_final
        else:
            final_prompt = best_prompt
            final_score = optimized_test_win_rate
    else:
        final_prompt = best_prompt
        final_score = optimized_test_win_rate

    # Save results
    output_dir = Path("results")
    output_dir.mkdir(exist_ok=True)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    with open(output_dir / f"prompts_{timestamp}.txt", 'w') as f:
        f.write("BASELINE:\n" + "=" * 80 + "\n")
        f.write(BASELINE_PROMPT)
        f.write("\n\nBEST VALIDATION PROMPT:\n" + "=" * 80 + "\n")
        f.write(best_prompt)
        
        if test_every_n_iters > 0 and best_test_prompt != best_prompt:
            f.write("\n\nBEST TEST PROMPT:\n" + "=" * 80 + "\n")
            f.write(best_test_prompt)
        
        f.write(f"\n\nRESULTS:\n" + "=" * 80 + "\n")
        f.write(f"Baseline test: {baseline_test_win_rate:.2%}\n")
        f.write(f"Best val test: {optimized_test_win_rate:.2%}\n")
        
        if test_every_n_iters > 0 and best_test_prompt != best_prompt:
            f.write(f"Best test test: {best_test_final:.2%}\n")
        
        f.write("\n\nPROMPT HISTORY:\n" + "=" * 80 + "\n")
        for p in prompt_history:
            f.write(f"\nIteration {p['iteration']}:\n")
            f.write(f"  Val score: {p['val_score']:.2%}\n")
            f.write(f"  Improvement: {p['improvement'] * 100:+.2f}pp\n")
            f.write(f"  Accepted: {p.get('accepted', False)}\n")
            if 'test_score' in p:
                f.write(f"  Test score: {p['test_score']:.2%}\n")

    print(f"\n💾 Saved to: results/prompts_{timestamp}.txt")

    return {
        'baseline_test': baseline_test_win_rate,
        'best_val_test': optimized_test_win_rate,
        'best_test_test': best_test_final if (test_every_n_iters > 0 and best_test_prompt != best_prompt) else None,
        'best_val_prompt': best_prompt,
        'best_test_prompt': best_test_prompt if (test_every_n_iters > 0 and best_test_prompt != best_prompt) else None,
        'prompt_history': prompt_history
    }

print("✓ Improved GEPA optimization function defined")

✓ GEPA optimization function defined


## 🚀 Run the Optimization

Now we'll execute the full GEPA optimization process. This will:
1. Set up the summarizer and optimizer models
2. Run multiple iterations of prompt improvement
3. Evaluate the final optimized prompt on the test set
4. Display comprehensive results

In [12]:
print("="*80)
print("🎯 GEPA SUMMARIZATION - TOGETHER AI BATCH EVAL")
print("="*80)

# Load data
train, val, test = load_and_split_data()

# Initialize models
summarizer_lm = dspy.LM(model=SUMMARIZER_MODEL, api_key=TOGETHER_API_KEY, max_tokens=500)
optimizer_lm = SimpleOptimizerLM(model=OPTIMIZER_MODEL, api_key=TOGETHER_API_KEY)

# Configure DSPy
dspy.configure(lm=summarizer_lm)

# Run improved optimization with safeguards
import time
start_time = time.time()

results = run_manual_gepa_improved(
    train_data=train,
    val_data=val,
    test_data=test,
    summarizer_lm=summarizer_lm,
    optimizer_lm=optimizer_lm,
    max_iterations=5,
    early_stopping_patience=2,  # Stop after 2 iterations without improvement
    min_improvement=0.02,  # Require at least 2% improvement
    test_every_n_iters=2  # Check test set every 2 iterations
)

end_time = time.time()
elapsed = end_time - start_time

print("\n✅ Complete!")
print(f"\n⏱️  OPTIMIZATION TIME:")

hours = int(elapsed // 3600)
minutes = int((elapsed % 3600) // 60)
seconds = int(elapsed % 60)

if hours > 0:
    print(f"  Total: {hours}h {minutes}m {seconds}s")
elif minutes > 0:
    print(f"  Total: {minutes}m {seconds}s")
else:
    print(f"  Total: {seconds}s")

🎯 GEPA SUMMARIZATION - TOGETHER AI BATCH EVAL

🧬 MANUAL GEPA OPTIMIZATION

ITERATION 1/5
Iteration 0: Establishing baseline (no comparison yet)

ITERATION 2/5

🤔 REFLECTION (Iteration 1)
✓ Generated new prompt (63 words)
✓ Generated candidate prompt (422 chars)

🔄 COMPARING PROMPTS: iter1_val
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news even...


Prompt A: 100%|██████████| 300/300 [13:06<00:00,  2.62s/it]


Generating summaries with Prompt B...
  Using prompt: Summarize the news article in 4-6 sentences, focusing on accuracy and completeness.

Clearly state t...


Prompt B: 100%|██████████| 300/300 [16:16<00:00,  3.26s/it]


📤 Uploading for comparison...


Uploading file temp_compare_iter1_val_20260106_061010.jsonl: 100%|██████████| 1.60M/1.60M [00:01<00:00, 948kB/s]


🚀 Launching comparison...
⏳ Waiting (ID: eval-a3c2-1767679814)...
✓ Results: Prompt A wins=17, Prompt B wins=42, Ties=241
✓ Prompt A win rate: 28.81%

  Baseline (original): 28.81%
  New candidate: 71.19%
  🎉 New best! (+21.19pp)

ITERATION 3/5

🤔 REFLECTION (Iteration 2)
✓ Generated new prompt (61 words)
✓ Generated candidate prompt (448 chars)

🔄 COMPARING PROMPTS: iter2_val
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news even...


Prompt A: 100%|██████████| 300/300 [00:46<00:00,  6.52it/s]


Generating summaries with Prompt B...
  Using prompt: Summarize the news article in 4-6 sentences, prioritizing accuracy, clarity, and concision. 

Clearl...


Prompt B: 100%|██████████| 300/300 [17:10<00:00,  3.44s/it]


📤 Uploading for comparison...


Uploading file temp_compare_iter2_val_20260106_063920.jsonl: 100%|██████████| 1.58M/1.58M [00:01<00:00, 817kB/s] 


🚀 Launching comparison...
⏳ Waiting (ID: eval-e01a-1767681565)...
✓ Results: Prompt A wins=28, Prompt B wins=53, Ties=219
✓ Prompt A win rate: 34.57%

  Baseline (original): 34.57%
  New candidate: 65.43%
  No improvement

ITERATION 4/5

🤔 REFLECTION (Iteration 3)
✓ Generated new prompt (62 words)
✓ Generated candidate prompt (426 chars)

🔄 COMPARING PROMPTS: iter3_val
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news even...


Prompt A: 100%|██████████| 300/300 [00:48<00:00,  6.23it/s]


Generating summaries with Prompt B...
  Using prompt: Summarize the news article in 4-6 sentences, prioritizing accuracy and clarity.

Focus on the main n...


Prompt B: 100%|██████████| 300/300 [17:04<00:00,  3.42s/it]


📤 Uploading for comparison...


Uploading file temp_compare_iter3_val_20260106_070826.jsonl: 100%|██████████| 1.58M/1.58M [00:02<00:00, 674kB/s]


🚀 Launching comparison...
⏳ Waiting (ID: eval-2e2c-1767683311)...
✓ Results: Prompt A wins=24, Prompt B wins=32, Ties=244
✓ Prompt A win rate: 42.86%

  Baseline (original): 42.86%
  New candidate: 57.14%
  No improvement

ITERATION 5/5

🤔 REFLECTION (Iteration 4)
✓ Generated new prompt (57 words)
✓ Generated candidate prompt (406 chars)

🔄 COMPARING PROMPTS: iter4_val
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news even...


Prompt A: 100%|██████████| 300/300 [00:48<00:00,  6.19it/s]


Generating summaries with Prompt B...
  Using prompt: Summarize the news article in 4-6 clear and concise sentences, prioritizing accuracy and main event ...


Prompt B: 100%|██████████| 300/300 [17:58<00:00,  3.60s/it]


📤 Uploading for comparison...


Uploading file temp_compare_iter4_val_20260106_073825.jsonl: 100%|██████████| 1.57M/1.57M [00:02<00:00, 646kB/s]


🚀 Launching comparison...
⏳ Waiting (ID: eval-a036-1767685111)...
✓ Results: Prompt A wins=30, Prompt B wins=40, Ties=230
✓ Prompt A win rate: 42.86%

  Baseline (original): 42.86%
  New candidate: 57.14%
  No improvement

📊 FINAL TEST EVALUATION

🔄 COMPARING PROMPTS: final_test
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news even...


Prompt A: 100%|██████████| 300/300 [17:14<00:00,  3.45s/it]


Generating summaries with Prompt B...
  Using prompt: Summarize the news article in 4-6 sentences, focusing on accuracy and completeness.

Clearly state t...


Prompt B: 100%|██████████| 300/300 [16:59<00:00,  3.40s/it]


📤 Uploading for comparison...


Uploading file temp_compare_final_test_20260106_082252.jsonl: 100%|██████████| 1.58M/1.58M [00:01<00:00, 1.24MB/s]


🚀 Launching comparison...
⏳ Waiting (ID: eval-c191-1767687777)...
✓ Results: Prompt A wins=31, Prompt B wins=34, Ties=235
✓ Prompt A win rate: 47.69%

🎉 FINAL RESULTS

TEST SET:
  Baseline prompt:  47.69%
  Optimized prompt: 52.31%
  Improvement:      +2.31pp from neutral

💾 Saved to: results/prompts_20260106_083038.txt

✅ Complete!

⏱️  OPTIMIZATION TIME:
  Total: 2h 49m 52s


## ⚙️ Configuration Guide

You can tune these parameters based on your needs:

```python
results = run_manual_gepa_improved(
    ...
    max_iterations=5,           # Maximum optimization rounds
    early_stopping_patience=2,  # Stop after N iterations with no improvement
    min_improvement=0.02,       # Require 2% improvement to accept candidate
    test_every_n_iters=2        # Validate on test set every N iterations
)
```

### Recommended Settings

**Conservative (avoid overfitting):**
```python
max_iterations=5
early_stopping_patience=1  # Stop after just 1 bad iteration
min_improvement=0.03       # Require 3% improvement
test_every_n_iters=1       # Check test set every iteration
```

**Aggressive (explore more):**
```python
max_iterations=10
early_stopping_patience=3  # Allow 3 plateaus
min_improvement=0.01       # Accept 1% improvement
test_every_n_iters=3       # Less frequent test checks
```

**Fast (for experimentation):**
```python
max_iterations=3
early_stopping_patience=1
min_improvement=0.05
test_every_n_iters=1
```

## 📊 Understanding the Improvements

The improved GEPA implementation includes several safeguards:

### 1. **Early Stopping**
- Stops optimization after N iterations without significant improvement
- Prevents overfitting to validation set
- Default: patience=2 (stops after 2 iterations with no gains)

### 2. **Minimum Improvement Threshold**
- Requires improvements to be statistically meaningful (default: 2%)
- Prevents accepting minor fluctuations as "improvements"
- Helps avoid random walk through prompt space

### 3. **Periodic Test Set Validation**
- Evaluates on test set every N iterations (default: 2)
- Catches overfitting early
- Tracks best test performance separately from best validation

### 4. **Comprehensive Tracking**
- Records all candidate prompts and their scores
- Saves detailed history to results file
- Allows post-hoc analysis of optimization trajectory

### Why This Matters

Your original results showed classic overfitting:
- Best validation: 71.19% (Iteration 1)
- Final test: 52.31%
- **Gap: -18.88pp**

The improved version would have:
1. Detected the validation plateau after iteration 1
2. Checked test performance at iteration 2
3. Likely stopped early, preserving better generalization

## 📊 Analyzing the Results

Let's examine the optimized prompt and compare it to the baseline.

In [13]:
print("=" * 80)
print("📝 PROMPT COMPARISON")
print("=" * 80)

print("\nBASELINE PROMPT:")
print("-" * 80)
print(BASELINE_PROMPT)

print("\n\nOPTIMIZED PROMPT:")
print("-" * 80)
print(results['best_prompt'])

print("\n\nPERFORMANCE COMPARISON:")
print("-" * 80)
print(f"Baseline Win Rate:  {results['baseline_test']:.2%}")
print(f"Optimized Win Rate: {results['optimized_test']:.2%}")
print(f"Improvement:        {(results['optimized_test'] - 0.5) * 100:+.2f} percentage points from neutral")

📝 PROMPT COMPARISON

BASELINE PROMPT:
--------------------------------------------------------------------------------
Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news event
- Key people or organizations involved
- Important details or outcomes
- Any significant context

Keep it to 3-5 sentences total.


OPTIMIZED PROMPT:
--------------------------------------------------------------------------------
Summarize the news article in 4-6 sentences, focusing on accuracy and completeness.

Clearly state the main news event and its significance. 
Include key people or organizations involved, and any notable actions or decisions they made. 
Provide essential details and outcomes, as well as relevant context that helps understand the event. 
Avoid unnecessary information and focus on the most important aspects of the story.


PERFORMANCE COMPARISON:
--------------------------------------------------------------------------------
Baseline Win Rate:

## 🔑 Key Findings

**GEPA Optimization Process:**
- Iteratively improves prompts through LLM-guided reflection
- Uses head-to-head comparisons with a judge model
- Tracks and accepts only improvements over baseline

**Benefits of This Approach:**
1. **Automated**: No manual prompt engineering required
2. **Data-driven**: Decisions based on actual performance metrics
3. **Scalable**: Can optimize for any task with appropriate data
4. **Transparent**: Clear tracking of improvements across iterations

**Next Steps:**
- Try with different datasets or domains
- Experiment with different judge criteria
- Adjust the optimizer's reflection prompt
- Increase iterations for potentially better results