# Train and Optimize ConversationOrchestrator

This notebook demonstrates how to train and optimize the refactored ConversationOrchestrator using DSPy.

## Training Strategy

The ConversationOrchestrator has 3 components:
1. **PreGuardrails** - Already optimized (optional: retrain with new data)
2. **ReAct Agent** - Main component to optimize with BootstrapFewShot
3. **PostGuardrails** - Output validation (optional: optimize if needed)

## Optimization Approach

We'll use DSPy's BootstrapFewShot teleprompter to optimize the ReAct agent's ConversationSignature.
This will create few-shot examples that improve tool selection and response quality.

In [5]:
import dspy
import pandas as pd
import json
from pathlib import Path
import sys
from typing import List

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

from app.llm.modules import ConversationOrchestrator
from app.models import ConversationMessage

# Configure DSPy LLM
lm = dspy.LM('gemini/gemini-2.5-flash-lite', api_key='AIzaSyB_7NrakdKTUpx6_DtjBgNat1dGWj9G4Ak')
dspy.configure(lm=lm)

ModuleNotFoundError: No module named 'app.llm.tools.AskClarification'

## Load Training Data

In [None]:
# Load ReAct training data
data_path = Path.cwd().parent / "datasets" / "conversation_react_training.csv"
df = pd.read_csv(data_path)

print(f"Loaded {len(df)} training examples")
print(f"\nIntent distribution:")
print(df['intent_category'].value_counts())
print(f"\nLanguage distribution:")
print(df['language'].value_counts())

## Prepare Training Examples

Convert CSV data into DSPy Examples

In [7]:
def create_dspy_examples(df: pd.DataFrame) -> List[dspy.Example]:
    """Convert DataFrame to DSPy Examples."""
    examples = []
    
    for _, row in df.iterrows():
        # Parse conversation history
        conv_history = json.loads(row['conversation_history']) if row['conversation_history'] else []
        
        # Create example
        example = dspy.Example(
            question=row['user_input'],
            previous_conversation=conv_history,
            page_context=row['page_context'] if row['page_context'] else "",
            answer=row['expected_response']
        ).with_inputs("question", "previous_conversation", "page_context")
        
        examples.append(example)
    
    return examples

# Create examples
all_examples = create_dspy_examples(df)

# Split train/dev/test (70/15/15)
from sklearn.model_selection import train_test_split

train_dev, test_examples = train_test_split(all_examples, test_size=0.15, random_state=42)
train_examples, dev_examples = train_test_split(train_dev, test_size=0.176, random_state=42)  # 0.176 * 0.85 ≈ 0.15

print(f"\nTrain: {len(train_examples)}")
print(f"Dev: {len(dev_examples)}")
print(f"Test: {len(test_examples)}")


Train: 65
Dev: 15
Test: 15


## Define Evaluation Metric

In [8]:
def conversation_quality_metric(example, prediction, trace=None):
    """Evaluate conversation quality.
    
    Checks:
    1. Response is not empty
    2. Response language matches input language (basic check)
    3. Response is helpful (contains substantial content)
    
    For full evaluation, you'd use semantic similarity or LLM-as-judge.
    """
    # Get prediction response
    pred_response = prediction.answer if hasattr(prediction, 'answer') else str(prediction)
    
    # Check 1: Not empty
    if not pred_response or len(pred_response.strip()) < 10:
        return 0.0
    
    # Check 2: Language match (basic heuristic)
    # If expected response has Chinese characters, prediction should too
    expected = example.answer
    has_chinese_expected = any('\u4e00' <= char <= '\u9fff' for char in expected)
    has_chinese_pred = any('\u4e00' <= char <= '\u9fff' for char in pred_response)
    
    if has_chinese_expected != has_chinese_pred:
        return 0.3  # Wrong language, but not completely wrong
    
    # Check 3: Reasonable length (at least 20% of expected)
    if len(pred_response) < len(expected) * 0.2:
        return 0.5
    
    # Basic pass - for full evaluation, use semantic similarity
    return 1.0

# Test metric
test_example = train_examples[0]
test_pred = dspy.Prediction(answer="This is a test response")
score = conversation_quality_metric(test_example, test_pred)
print(f"Test metric score: {score}")

Test metric score: 0.3


## Baseline Performance (Before Optimization)

In [16]:
# Initialize unoptimized orchestrator
orchestrator = ConversationOrchestrator()

# Evaluate on dev set (sample 10 for speed)
from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset=dev_examples[:10],
    metric=conversation_quality_metric,
    num_threads=3,
    display_progress=True
)

print("\n=== Baseline Performance ===")
baseline_score = evaluator(orchestrator)
print(f"Baseline Score: {baseline_score:.2%}")

NameError: name 'ConversationOrchestrator' is not defined

## Optimize with BootstrapFewShot

This will create few-shot examples for the ReAct agent to improve performance.

In [10]:
from dspy.teleprompt import BootstrapFewShot

# Configure optimizer
optimizer = BootstrapFewShot(
    metric=conversation_quality_metric,
    max_bootstrapped_demos=8,  # Number of few-shot examples to create
    max_labeled_demos=8,  # Use up to 8 labeled examples
    max_rounds=1  # Single round of optimization
)

print("\n=== Optimizing ConversationOrchestrator ===")
print("This may take several minutes...\n")

# Optimize on training set
optimized_orchestrator = optimizer.compile(
    orchestrator,
    trainset=train_examples[:30]  # Use subset for speed (increase for production)
)

print("\nOptimization complete!")


=== Optimizing ConversationOrchestrator ===
This may take several minutes...



NameError: name 'orchestrator' is not defined

## Evaluate Optimized Model

In [11]:
print("\n=== Optimized Performance ===")
optimized_score = evaluator(optimized_orchestrator)
print(f"Optimized Score: {optimized_score:.2%}")
print(f"Improvement: {(optimized_score - baseline_score):.2%}")


=== Optimized Performance ===


NameError: name 'evaluator' is not defined

## Test on Held-Out Test Set

In [12]:
test_evaluator = Evaluate(
    devset=test_examples,
    metric=conversation_quality_metric,
    num_threads=1,
    display_progress=True
)

print("\n=== Test Set Performance ===")
test_score = test_evaluator(optimized_orchestrator)
print(f"Test Score: {test_score:.2%}")

NameError: name 'Evaluate' is not defined

## Save Optimized Model

In [13]:
# Create output directory
output_dir = Path.cwd().parent / "src" / "app" / "optimized" / "ConversationOrchestrator"
output_dir.mkdir(parents=True, exist_ok=True)

# Save optimized model
output_path = output_dir / "optimized_model.json"
optimized_orchestrator.save(str(output_path))

# Save metadata
metadata = {
    "baseline_score": float(baseline_score),
    "optimized_score": float(optimized_score),
    "test_score": float(test_score),
    "training_examples": len(train_examples),
    "optimization_date": pd.Timestamp.now().isoformat(),
    "max_bootstrapped_demos": 8,
    "notes": "Optimized with BootstrapFewShot on ReAct agent"
}

with open(output_dir / "metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

print(f"\nOptimized model saved to: {output_path}")
print(f"Metadata saved to: {output_dir / 'metadata.json'}")

NameError: name 'optimized_orchestrator' is not defined

## Interactive Testing

In [14]:
def test_orchestrator(user_input: str, page_context: str = ""):
    """Test orchestrator with a query."""
    print(f"\nUser: {user_input}")
    print(f"Context: {page_context or 'None'}")
    print("\n--- Processing ---\n")
    
    response = optimized_orchestrator(
        user_message=user_input,
        previous_conversation=[],
        page_context=page_context
    )
    
    print(f"Agent: {response}")
    print("\n" + "="*80)

# Test examples
test_orchestrator("Find rock concerts in Los Angeles")
test_orchestrator("How much is membership?")
test_orchestrator("找音樂會")
test_orchestrator("I want jazz shows this weekend and what's the refund policy?")  # Multi-intent


User: Find rock concerts in Los Angeles
Context: None

--- Processing ---



NameError: name 'optimized_orchestrator' is not defined

## Analysis: Tool Usage Patterns

In [15]:
# Analyze which tools are being called
print("\n=== Tool Usage Analysis ===")
print("Run this after testing several queries to see tool call patterns")
print("You can inspect dspy.settings.trace to see tool calls")

# Note: Full trace analysis would require custom logging in the orchestrator
# For production, add logging to track:
# - Which tools are called for each intent
# - Tool call success rates
# - Average response latency per tool


=== Tool Usage Analysis ===
Run this after testing several queries to see tool call patterns
You can inspect dspy.settings.trace to see tool calls


## Next Steps

1. **Load optimized model in production**:
   ```python
   orchestrator = ConversationOrchestrator()
   orchestrator.load('path/to/optimized_model.json')
   ```

2. **Collect real user data** and retrain periodically

3. **Optimize guardrails** separately if needed:
   - PreGuardrails: Use existing dataset
   - PostGuardrails: Use generated dataset from notebook 02

4. **Advanced optimization**:
   - Use MIPROv2 for more sophisticated optimization
   - Create LLM-as-judge metric for semantic similarity
   - Add A/B testing framework

5. **Monitor in production**:
   - Track tool usage rates
   - Monitor response quality
   - Collect edge cases for retraining