# Two-Stage Optimization: BootstrapFewShot → GEPA

This notebook demonstrates a two-stage optimization approach for the ConversationOrchestrator:

## Stage 1: BootstrapFewShot
- Generates high-quality few-shot examples from training data
- Fast iteration (~15-30 min)
- Establishes baseline with good demonstrations

## Stage 2: GEPA (Genetic-Evolutionary Prompt Adaptation)
- Takes Stage 1 output as starting point
- Uses LLM reflection to evolve prompt instructions
- Optimizes tool selection patterns
- More sophisticated, slower (~1-2 hours)

## Why Two Stages?
1. **BootstrapFewShot**: Finds good examples of correct behavior
2. **GEPA**: Refines the instructions that guide the ReAct agent
3. **Combined**: Best of both - good examples AND evolved prompts

## Models Used
- **Advanced Model**: `gemini/gemini-2.5-pro` - For reflection and optimization
- **Production Model**: `gemini/gemini-2.5-flash-lite` - For running the orchestrator

## Setup

In [None]:
import dspy
import pandas as pd
import json
from pathlib import Path
import sys
from typing import List, Tuple
import matplotlib.pyplot as plt

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

from app.llm.modules import ConversationOrchestrator
from app.models import ConversationMessage

## Configure DSPy with Gemini Models

In [None]:
import os

# Configure production model (for running orchestrator)
production_lm = dspy.LM('gemini/gemini-2.0-flash-lite', api_key="AIzaSyB_7NrakdKTUpx6_DtjBgNat1dGWj9G4Ak")
dspy.configure(lm=production_lm)

# Configure advanced model (for optimization/reflection)
advanced_lm = dspy.LM('gemini/gemini-2.5-pro', api_key="AIzaSyB_7NrakdKTUpx6_DtjBgNat1dGWj9G4Ak")

print("✓ Configured models:")
print(f"  Production: gemini/gemini-2.0-flash-lite")
print(f"  Advanced: gemini/gemini-2.5-pro")

## Load Training Data

In [None]:
def load_training_data(csv_path: str) -> Tuple[List[dspy.Example], List[dspy.Example]]:
    """Load and split training data."""
    df = pd.read_csv(csv_path)
    
    examples = []
    for _, row in df.iterrows():
        conv_history = json.loads(row['conversation_history']) if row['conversation_history'] else []
        
        example = dspy.Example(
            question=row['user_input'],
            previous_conversation=conv_history,
            page_context=row['page_context'] if row['page_context'] else "",
            answer=row['expected_response']
        ).with_inputs("question", "previous_conversation", "page_context")
        
        examples.append(example)
    
    # Split 80/20 train/dev
    split_idx = int(len(examples) * 0.8)
    return examples[:split_idx], examples[split_idx:]

# Load data
data_path = Path.cwd().parent / "datasets" / "conversation_react_training.csv"

if data_path.exists():
    trainset, devset = load_training_data(str(data_path))
    print(f"✓ Loaded training data:")
    print(f"  Train: {len(trainset)} examples")
    print(f"  Dev: {len(devset)} examples")
else:
    print(f"⚠ Dataset not found at {data_path}")
    print("  Run notebook 01_generate_react_training_data.ipynb first")
    trainset, devset = [], []

## Define Evaluation Metric

In [None]:
def conversation_quality_metric(example, prediction, trace=None):
    """Multi-faceted metric for conversation quality.
    
    Returns:
        float or tuple: Score 0.0-1.0, optionally with textual feedback for GEPA
    """
    score = 0.0
    feedback = []
    
    # Get prediction response
    pred_response = prediction.answer if hasattr(prediction, 'answer') else str(prediction)
    
    # 1. Response exists and is substantial (30%)
    if pred_response and len(pred_response.strip()) > 20:
        score += 0.3
    else:
        feedback.append("Response too short or empty")
    
    # 2. Language consistency (30%)
    expected = example.answer
    has_chinese_expected = any('\u4e00' <= char <= '\u9fff' for char in expected)
    has_chinese_pred = any('\u4e00' <= char <= '\u9fff' for char in pred_response)
    
    if has_chinese_expected == has_chinese_pred:
        score += 0.3
    else:
        lang_expected = "Chinese" if has_chinese_expected else "English"
        feedback.append(f"Language mismatch: expected {lang_expected}")
    
    # 3. Reasonable length relative to expected (20%)
    if len(pred_response) >= len(expected) * 0.3:
        score += 0.2
    else:
        feedback.append("Response significantly shorter than expected")
    
    # 4. Basic content check (20%)
    # Check if response seems relevant
    question_lower = example.question.lower()
    response_lower = pred_response.lower()
    
    relevance = False
    if "membership" in question_lower and ("membership" in response_lower or "member" in response_lower):
        relevance = True
    elif "ticket" in question_lower and "ticket" in response_lower:
        relevance = True
    elif "event" in question_lower or "concert" in question_lower:
        relevance = True
    elif "音樂" in example.question or "活動" in example.question:
        relevance = True
    else:
        relevance = len(pred_response) > 30  # At least trying to help
    
    if relevance:
        score += 0.2
    else:
        feedback.append("Response may not be relevant to query")
    
    # Return score with feedback for GEPA (feedback helps reflection)
    if feedback:
        return (score, " | ".join(feedback))
    return score

# Test metric
if trainset:
    test_example = trainset[0]
    test_pred = dspy.Prediction(answer="This is a test response with adequate length.")
    test_score = conversation_quality_metric(test_example, test_pred)
    print(f"✓ Metric test score: {test_score if isinstance(test_score, float) else test_score[0]:.2f}")

## Baseline Evaluation

In [None]:
from dspy.evaluate import Evaluate

# Initialize unoptimized orchestrator
orchestrator = ConversationOrchestrator()

# Evaluate on dev set (sample for speed)
evaluator = Evaluate(
    devset=devset[:10] if len(devset) > 10 else devset,
    metric=conversation_quality_metric,
    num_threads=1,
    display_progress=True
)

print("\n" + "="*80)
print("BASELINE EVALUATION")
print("="*80)
baseline_score = evaluator(orchestrator)
print(f"\n✓ Baseline Score: {baseline_score:.2%}")

## STAGE 1: BootstrapFewShot Optimization

Generate high-quality few-shot examples by running the orchestrator multiple times and collecting successful trajectories.

In [None]:
from dspy.teleprompt import BootstrapFewShot

print("\n" + "="*80)
print("STAGE 1: BootstrapFewShot Optimization")
print("="*80)

# Configure BootstrapFewShot
bootstrap_optimizer = BootstrapFewShot(
    metric=conversation_quality_metric,
    max_bootstrapped_demos=8,  # Generate up to 8 examples
    max_labeled_demos=4,       # Use up to 4 labeled examples
    max_rounds=2,              # Bootstrap rounds
    max_errors=10              # Allow some failures during bootstrapping
)

print("\nOptimizer configuration:")
print(f"  max_bootstrapped_demos: 8")
print(f"  max_labeled_demos: 4")
print(f"  max_rounds: 2")
print("\nStarting optimization (this may take 15-30 minutes)...\n")

# Compile with training set
stage1_optimized = bootstrap_optimizer.compile(
    student=orchestrator,
    trainset=trainset[:30] if len(trainset) > 30 else trainset  # Use subset for speed
)

print("\n✓ Stage 1 optimization complete!")

## Evaluate Stage 1 Results

In [None]:
print("\nEvaluating Stage 1 (BootstrapFewShot) results...\n")

stage1_score = evaluator(stage1_optimized)
stage1_improvement = stage1_score - baseline_score

print(f"\n" + "="*80)
print(f"STAGE 1 RESULTS")
print("="*80)
print(f"Baseline Score:    {baseline_score:.2%}")
print(f"Stage 1 Score:     {stage1_score:.2%}")
print(f"Improvement:       +{stage1_improvement:.2%}")
print("="*80)

# Save Stage 1 checkpoint
checkpoint_path = Path.cwd().parent / "src" / "app" / "optimized" / "ConversationOrchestrator"
checkpoint_path.mkdir(parents=True, exist_ok=True)
stage1_optimized.save(str(checkpoint_path / "stage1_bootstrap.json"))
print(f"\n✓ Stage 1 checkpoint saved to: {checkpoint_path / 'stage1_bootstrap.json'}")

## Inspect Stage 1 Few-Shot Examples

Let's look at the examples that BootstrapFewShot generated

In [None]:
print("\nGenerated Few-Shot Examples from Stage 1:")
print("="*80)

# Access demos from the agent's signature
if hasattr(stage1_optimized, 'agent') and hasattr(stage1_optimized.agent, 'demos'):
    demos = stage1_optimized.agent.demos
    print(f"Total demos: {len(demos)}\n")
    
    for i, demo in enumerate(demos[:3]):  # Show first 3
        print(f"\nDemo {i+1}:")
        print(f"  Question: {demo.question}")
        print(f"  Answer: {demo.answer[:150]}..." if len(demo.answer) > 150 else f"  Answer: {demo.answer}")
        print("-" * 80)
else:
    print("No demos found in optimized model structure")

## STAGE 2: GEPA Optimization

Now we use GEPA to evolve the prompt instructions through reflection, starting from the Stage 1 output.

In [None]:
from dspy.teleprompt import GEPA

print("\n" + "="*80)
print("STAGE 2: GEPA Optimization")
print("="*80)

# Configure GEPA optimizer
gepa_optimizer = dspy.GEPA(
    metric=conversation_quality_metric,
    reflection_lm=advanced_lm,  # Use Gemini 2.5 Pro for reflection
    auto="medium",  # Can be: "light", "medium", "heavy"
    num_threads=4,
    track_stats=True,
    component_selector="all"  # Optimize all components together
)

print("\nOptimizer configuration:")
print(f"  reflection_lm: gemini/gemini-2.5-pro")
print(f"  auto: medium")
print(f"  num_threads: 4")
print(f"  component_selector: all")
print("\nStarting GEPA optimization (this may take 1-2 hours)...\n")

# Compile using Stage 1 output as starting point
stage2_optimized = gepa_optimizer.compile(
    student=stage1_optimized,  # IMPORTANT: Start from Stage 1 result
    trainset=trainset[:30] if len(trainset) > 30 else trainset,
    valset=devset,
    num_iterations=5  # GEPA iterations
)

print("\n✓ Stage 2 optimization complete!")

## Evaluate Stage 2 Results

In [None]:
print("\nEvaluating Stage 2 (GEPA) results...\n")

stage2_score = evaluator(stage2_optimized)
stage2_improvement = stage2_score - stage1_score
total_improvement = stage2_score - baseline_score

print(f"\n" + "="*80)
print(f"FINAL RESULTS - TWO-STAGE OPTIMIZATION")
print("="*80)
print(f"Baseline Score:           {baseline_score:.2%}")
print(f"Stage 1 (Bootstrap):      {stage1_score:.2%}  (+{(stage1_score - baseline_score):.2%})")
print(f"Stage 2 (GEPA):           {stage2_score:.2%}  (+{stage2_improvement:.2%})")
print(f"\nTotal Improvement:        +{total_improvement:.2%}")
print("="*80)

## Visualize Results

In [None]:
# Create comparison DataFrame
results = pd.DataFrame({
    'Stage': ['Baseline', 'BootstrapFewShot', 'GEPA'],
    'Score': [baseline_score, stage1_score, stage2_score],
    'Improvement': [0, stage1_score - baseline_score, total_improvement]
})

print("\n", results)

# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Score progression
ax1.bar(results['Stage'], results['Score'], color=['#ff6b6b', '#4ecdc4', '#45b7d1'])
ax1.set_ylabel('Score')
ax1.set_title('Two-Stage Optimization Results')
ax1.set_ylim(0, 1.0)
for i, v in enumerate(results['Score']):
    ax1.text(i, v + 0.02, f'{v:.2%}', ha='center', fontweight='bold')

# Improvement bars
ax2.bar(results['Stage'], results['Improvement'], color=['#ff6b6b', '#4ecdc4', '#45b7d1'])
ax2.set_ylabel('Improvement over Baseline')
ax2.set_title('Improvement by Stage')
for i, v in enumerate(results['Improvement']):
    if v > 0:
        ax2.text(i, v + 0.01, f'+{v:.2%}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

## Inspect GEPA-Evolved Instructions

Let's see how GEPA evolved the prompt instructions

In [None]:
print("\n" + "="*80)
print("GEPA-EVOLVED INSTRUCTIONS")
print("="*80)

if hasattr(stage2_optimized, 'agent') and hasattr(stage2_optimized.agent, 'signature'):
    evolved_instructions = stage2_optimized.agent.signature.instructions
    print(evolved_instructions)
else:
    print("Could not access evolved instructions")

print("\n" + "="*80)
print("ORIGINAL INSTRUCTIONS (for comparison)")
print("="*80)

if hasattr(orchestrator, 'agent') and hasattr(orchestrator.agent, 'signature'):
    original_instructions = orchestrator.agent.signature.instructions
    print(original_instructions)
else:
    print("Could not access original instructions")

## Interactive Testing

Compare all three versions side-by-side

In [None]:
def test_all_stages(query: str, page_context: str = ""):
    """Test query across all optimization stages."""
    print(f"\n{'='*80}")
    print(f"Query: {query}")
    if page_context:
        print(f"Context: {page_context}")
    print(f"{'='*80}\n")
    
    try:
        print("BASELINE:")
        baseline_response = orchestrator(query, [], page_context)
        print(baseline_response)
    except Exception as e:
        print(f"Error: {e}")
    
    print("\n" + "-"*80 + "\n")
    
    try:
        print("STAGE 1 (BootstrapFewShot):")
        stage1_response = stage1_optimized(query, [], page_context)
        print(stage1_response)
    except Exception as e:
        print(f"Error: {e}")
    
    print("\n" + "-"*80 + "\n")
    
    try:
        print("STAGE 2 (GEPA):")
        stage2_response = stage2_optimized(query, [], page_context)
        print(stage2_response)
    except Exception as e:
        print(f"Error: {e}")

# Test various query types
print("\n" + "#"*80)
print("# INTERACTIVE TESTING")
print("#"*80)

In [None]:
# Test 1: Simple event search
test_all_stages("Find rock concerts in Los Angeles")

In [None]:
# Test 2: Membership inquiry
test_all_stages("How much is premium membership?")

In [None]:
# Test 3: Chinese query
test_all_stages("找音樂會")

In [None]:
# Test 4: Multi-intent query
test_all_stages("I want jazz concerts this weekend and what's the refund policy?")

In [None]:
# Test 5: Vague query
test_all_stages("events")

## Save Final Optimized Model

In [None]:
# Save final optimized model
output_dir = Path.cwd().parent / "src" / "app" / "optimized" / "ConversationOrchestrator"
output_dir.mkdir(parents=True, exist_ok=True)

final_model_path = output_dir / "two_stage_optimized.json"
stage2_optimized.save(str(final_model_path))

# Save metadata
metadata = {
    'optimization_method': 'two_stage',
    'stage1': {
        'optimizer': 'BootstrapFewShot',
        'max_bootstrapped_demos': 8,
        'max_labeled_demos': 4,
        'max_rounds': 2,
        'score': float(stage1_score)
    },
    'stage2': {
        'optimizer': 'GEPA',
        'reflection_lm': 'gemini/gemini-2.5-pro',
        'auto': 'medium',
        'num_iterations': 5,
        'score': float(stage2_score)
    },
    'baseline_score': float(baseline_score),
    'final_score': float(stage2_score),
    'total_improvement': float(total_improvement),
    'training_examples': len(trainset),
    'dev_examples': len(devset),
    'optimization_date': pd.Timestamp.now().isoformat(),
    'models': {
        'production': 'gemini/gemini-2.0-flash-lite',
        'advanced': 'gemini/gemini-2.5-pro'
    }
}

metadata_path = output_dir / "two_stage_metadata.json"
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print("\n" + "="*80)
print("✓ OPTIMIZATION COMPLETE")
print("="*80)
print(f"\nFinal model saved to:")
print(f"  {final_model_path}")
print(f"\nMetadata saved to:")
print(f"  {metadata_path}")
print(f"\nFinal Score: {stage2_score:.2%}")
print(f"Total Improvement: +{total_improvement:.2%}")
print("\n" + "="*80)

## Usage in Production

To use the optimized model:

In [None]:
# Example: Load and use optimized model
from app.llm.modules import ConversationOrchestrator

# Load optimized model
optimized = ConversationOrchestrator()
optimized.load('src/app/optimized/ConversationOrchestrator/two_stage_optimized.json')

# Use in production
response = optimized(
    user_message="Find jazz concerts",
    previous_conversation=[],
    page_context=""
)
print(response)

## Summary

### What We Achieved

1. **Stage 1 (BootstrapFewShot)**:
   - Generated high-quality few-shot examples
   - Fast optimization (~15-30 minutes)
   - Established strong baseline

2. **Stage 2 (GEPA)**:
   - Evolved prompt instructions through LLM reflection
   - Optimized tool selection patterns
   - Combined with Stage 1 examples for best results

3. **Final Model**:
   - Optimized few-shot demonstrations
   - Evolved prompt instructions
   - Improved tool usage patterns
   - Better multilingual consistency

### Next Steps

1. **Deploy**: Use the optimized model in production
2. **Monitor**: Track performance on real user queries
3. **Iterate**: Collect edge cases and retrain periodically
4. **A/B Test**: Compare optimized vs baseline in production