# Multi-Agent Research System Evaluation

This notebook provides comprehensive testing and evaluation of the multi-agent research system using the 40-query evaluation dataset and Arize Phoenix integration.

## Overview
- **Dataset**: 40 queries across 4 complexity levels (Simple, Moderate, Complex)
- **Query Types**: Q&A (30 queries) + Deep Research (10 queries)
- **Evaluation**: Automated scoring via Arize Phoenix evaluators
- **Metrics**: Response quality, latency, token usage, citation accuracy

## 1. Setup and Dependencies

In [None]:
# Install required packages if not already installed
!pip install pandas numpy matplotlib seaborn plotly
!pip install arize-phoenix openai
!pip install asyncio aiohttp tenacity
!pip install jupyter-widgets ipywidgets

In [None]:
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# System imports
import asyncio
import time
import json
from datetime import datetime
from typing import List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Arize Phoenix imports
import phoenix as px
from phoenix.evals import (
    HallucinationEvaluator,
    RelevanceEvaluator, 
    QACorrectnessEvaluator,
    run_evals
)

# Project imports (adjust path as needed)
import sys
sys.path.append('..')

from evaluation.evaluation_dataset import (
    to_pandas, to_csv, to_arize_format, create_evaluation_template,
    get_queries_by_complexity, get_queries_by_type, EVALUATION_QUERIES
)

# Agent imports - Complete multi-agent system
try:
    from agents.supervisor import SupervisorAgent
    from agents.search import SearchAgent
    from agents.citation import CitationAgent
    from agents.multi_agents import MultiAgentResearchSystem, initialize_system
    AGENTS_AVAILABLE = True
    print("✅ Complete agent system imported successfully")
except ImportError as e:
    AGENTS_AVAILABLE = False
    print(f"⚠️ Agent modules not yet available: {e}")
    print("📝 This notebook will demonstrate evaluation setup and mock agent responses")

print("📚 All dependencies loaded successfully!")

## 2. Dataset Loading and Exploration

In [None]:
# Load the evaluation dataset
eval_df = to_pandas()
print(f"📊 Loaded {len(eval_df)} evaluation queries")
print(f"Columns: {list(eval_df.columns)}")
print(f"Shape: {eval_df.shape}")

# Display first few rows
eval_df.head()

In [None]:
# Dataset statistics and distribution
print("🔍 Dataset Overview:")
print(f"Total queries: {len(eval_df)}")
print(f"Unique domains: {eval_df['domain'].nunique()}")
print(f"Query types: {eval_df['query_type'].value_counts().to_dict()}")
print(f"Complexity levels: {eval_df['complexity'].value_counts().to_dict()}")
print(f"Require current info: {eval_df['requires_current_info'].sum()}/{len(eval_df)}")
print(f"Average expected sources: {eval_df['expected_sources'].mean():.1f}")
print(f"Average max time: {eval_df['max_time_seconds'].mean():.1f}s")

In [None]:
# Visualize dataset distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Complexity distribution
eval_df['complexity'].value_counts().plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('Distribution by Complexity Level')
axes[0,0].set_xlabel('Complexity')
axes[0,0].set_ylabel('Count')

# Query type distribution
eval_df['query_type'].value_counts().plot(kind='pie', ax=axes[0,1], autopct='%1.1f%%')
axes[0,1].set_title('Distribution by Query Type')

# Domain distribution (top 10)
eval_df['domain'].value_counts().head(10).plot(kind='barh', ax=axes[1,0], color='lightcoral')
axes[1,0].set_title('Top 10 Domains')
axes[1,0].set_xlabel('Count')

# Expected sources vs complexity
sns.boxplot(data=eval_df, x='complexity', y='expected_sources', ax=axes[1,1])
axes[1,1].set_title('Expected Sources by Complexity')
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 3. Agent Testing Framework

In [None]:
class AgentEvaluationFramework:
    """Framework for testing and evaluating multi-agent system"""
    
    def __init__(self):
        self.results = []
        self.evaluation_df = None
        self.phoenix_session = None
        self.research_system = None
        
    def setup_phoenix(self, launch_app=True):
        """Initialize Phoenix for evaluation tracking"""
        if launch_app:
            self.phoenix_session = px.launch_app()
            print("🔥 Phoenix app launched at http://localhost:6006")
        else:
            print("📊 Phoenix tracking enabled (no UI)")
    
    def initialize_research_system(self):
        """Initialize the multi-agent research system"""
        if AGENTS_AVAILABLE and self.research_system is None:
            try:
                self.research_system = initialize_system()
                print("🤖 Multi-agent research system initialized")
                return True
            except Exception as e:
                print(f"❌ Failed to initialize research system: {e}")
                return False
        elif self.research_system is not None:
            print("🤖 Research system already initialized")
            return True
        else:
            print("⚠️ Agents not available, using mock mode")
            return False
    
    async def test_single_query(self, query_data: Dict, mock_mode=False) -> Dict:
        """Test a single query against the agent system"""
        start_time = time.time()
        
        if mock_mode or not AGENTS_AVAILABLE or self.research_system is None:
            # Mock agent response for demonstration
            await asyncio.sleep(np.random.uniform(0.5, 3.0))  # Simulate processing time
            
            response = {
                'query_id': query_data['id'],
                'input': query_data['query'],
                'output': self._generate_mock_response(query_data),
                'complexity': query_data['complexity'],
                'domain': query_data['domain'],
                'query_type': query_data['query_type'],
                'expected_sources': query_data['expected_sources'],
                'response_time_ms': (time.time() - start_time) * 1000,
                'tokens_used': np.random.randint(100, 1000),
                'model_used': self._get_mock_model(query_data['complexity']),
                'sources_found': np.random.randint(1, query_data['expected_sources'] + 2),
                'citations_count': np.random.randint(1, 5),
                'timestamp': datetime.now()
            }
        else:
            # Real agent testing using the multi-agent system
            try:
                result = await self.research_system.process_query(query_data['query'])
                execution_time_ms = (time.time() - start_time) * 1000
                
                # Extract data from real agent response
                agent_response = result.get('response', '')
                citations = result.get('citations', [])
                
                response = {
                    'query_id': query_data['id'],
                    'input': query_data['query'],
                    'output': agent_response,
                    'complexity': query_data['complexity'],
                    'domain': query_data['domain'],
                    'query_type': query_data['query_type'],
                    'expected_sources': query_data['expected_sources'],
                    'response_time_ms': execution_time_ms,
                    'tokens_used': result.get('total_tokens', 0),
                    'model_used': result.get('model_used', 'gpt-5'),
                    'sources_found': len(citations),
                    'citations_count': len(citations),
                    'timestamp': datetime.now(),
                    'session_id': result.get('session_id', ''),
                    'trace_id': result.get('trace_id', ''),
                    'status': result.get('status', 'completed')
                }
                
                # Add citation details if available
                if citations:
                    response['citation_details'] = [
                        {
                            'url': c.url if hasattr(c, 'url') else str(c),
                            'title': c.title if hasattr(c, 'title') else 'Unknown',
                            'credibility_score': c.credibility_score if hasattr(c, 'credibility_score') else 0.5
                        }
                        for c in citations[:5]  # Limit to top 5 citations
                    ]
                
            except Exception as e:
                print(f"❌ Real agent test failed for query {query_data['id']}: {str(e)}")
                # Fallback to mock response on error
                execution_time_ms = (time.time() - start_time) * 1000
                response = {
                    'query_id': query_data['id'],
                    'input': query_data['query'],
                    'output': f"Error processing query: {str(e)}",
                    'complexity': query_data['complexity'],
                    'domain': query_data['domain'],
                    'query_type': query_data['query_type'],
                    'expected_sources': query_data['expected_sources'],
                    'response_time_ms': execution_time_ms,
                    'tokens_used': 0,
                    'model_used': 'error',
                    'sources_found': 0,
                    'citations_count': 0,
                    'timestamp': datetime.now(),
                    'error': str(e),
                    'status': 'failed'
                }
        
        return response
    
    def _generate_mock_response(self, query_data: Dict) -> str:
        """Generate realistic mock responses based on query complexity"""
        complexity = query_data['complexity']
        query_type = query_data['query_type']
        
        if complexity == 'gpt-5-nano':  # Simple
            return f"Mock simple answer to: {query_data['query'][:50]}... [This would be a direct, factual response of 1-2 sentences]"
        elif complexity == 'gpt-5-mini':  # Moderate  
            return f"Mock moderate answer to: {query_data['query'][:50]}... [This would be a comprehensive explanation with 2-3 paragraphs covering key concepts and examples]"
        else:  # Complex
            if query_type == 'research':
                return f"Mock comprehensive research report on: {query_data['query'][:50]}... [This would be a 2-page detailed analysis with multiple sections, current data, examples, and projections]"
            else:
                return f"Mock complex analysis of: {query_data['query'][:50]}... [This would be an in-depth technical explanation with current developments, multiple perspectives, and detailed examples]"
    
    def _get_mock_model(self, complexity: str) -> str:
        """Return appropriate model based on complexity"""
        model_map = {
            'gpt-5-nano': 'gpt-5-nano',
            'gpt-5-mini': 'gpt-5-mini', 
            'gpt-5': 'gpt-5'
        }
        return model_map.get(complexity, 'gpt-5')
    
    async def run_batch_evaluation(self, queries_subset=None, mock_mode=False, max_concurrent=3):
        """Run evaluation on a batch of queries"""
        if queries_subset is None:
            queries_subset = EVALUATION_QUERIES
        
        # Initialize research system if using real agents
        if not mock_mode and AGENTS_AVAILABLE:
            system_ready = self.initialize_research_system()
            if not system_ready:
                print("⚠️ Falling back to mock mode due to system initialization failure")
                mock_mode = True
        
        print(f"🚀 Starting batch evaluation of {len(queries_subset)} queries")
        print(f"   Mode: {'Mock' if mock_mode else 'Real Agents'}, Max concurrent: {max_concurrent}")
        
        # Create semaphore to limit concurrency
        semaphore = asyncio.Semaphore(max_concurrent)
        
        async def bounded_test(query):
            async with semaphore:
                return await self.test_single_query(query.dict(), mock_mode)
        
        # Run tests concurrently with progress tracking
        tasks = [bounded_test(query) for query in queries_subset]
        self.results = []
        
        for i, task in enumerate(asyncio.as_completed(tasks)):
            result = await task
            self.results.append(result)
            if len(self.results) % 5 == 0 or len(self.results) == len(queries_subset):
                print(f"   ✅ Completed {len(self.results)}/{len(queries_subset)} queries")
        
        print(f"🎉 Batch evaluation completed! {len(self.results)} responses collected")
        
        # Show success/failure summary
        successful = len([r for r in self.results if r.get('status', 'completed') != 'failed'])
        if successful < len(self.results):
            failed = len(self.results) - successful
            print(f"   ⚠️ {successful} successful, {failed} failed")
        
        # Convert to DataFrame for analysis
        self.evaluation_df = pd.DataFrame(self.results)
        return self.evaluation_df
    
    def save_results(self, filepath: str):
        """Save evaluation results to CSV"""
        if self.evaluation_df is not None:
            self.evaluation_df.to_csv(filepath, index=False)
            print(f"💾 Results saved to {filepath}")
        else:
            print("❌ No results to save. Run evaluation first.")
    
    def get_system_stats(self):
        """Get system statistics if real system is available"""
        if self.research_system is not None:
            return self.research_system.get_system_stats()
        else:
            return {"message": "Real system not available"}

# Initialize the evaluation framework
evaluator = AgentEvaluationFramework()
print("🔧 Agent evaluation framework initialized")

## 4. Quick Test with Sample Queries

In [None]:
# Test with a small subset first (one from each complexity level)
sample_queries = [
    EVALUATION_QUERIES[0],  # Simple
    EVALUATION_QUERIES[10], # Moderate 
    EVALUATION_QUERIES[20], # Complex Q&A
    EVALUATION_QUERIES[30], # Deep Research
]

print("🧪 Running quick test with 4 sample queries...")
print("   First testing with real agents if available, then mock mode for comparison")

# Try real agents first
sample_results = await evaluator.run_batch_evaluation(sample_queries, mock_mode=False, max_concurrent=2)

# Display results
display_cols = ['query_id', 'complexity', 'domain', 'response_time_ms', 'tokens_used', 'sources_found', 'status']
print("\n📊 Sample Results:")
sample_results[display_cols].round(2)

## 5. Full Evaluation Suite

In [None]:
# Run full evaluation with real agents if available
print("🚀 Starting FULL evaluation of all 40 queries...")
print("⏱️ This may take 10-20 minutes with real agents (vs 5 minutes with mock)")

# First check if agents are available and working
if AGENTS_AVAILABLE:
    print("🤖 Attempting to use real multi-agent system...")
    full_results = await evaluator.run_batch_evaluation(EVALUATION_QUERIES, mock_mode=False, max_concurrent=3)
else:
    print("🎭 Using mock mode for demonstration...")
    full_results = await evaluator.run_batch_evaluation(EVALUATION_QUERIES, mock_mode=True, max_concurrent=5)

print(f"\n🎉 Full evaluation completed! {len(full_results)} responses generated")

# Show system statistics if real system was used
if not full_results.empty and evaluator.research_system is not None:
    print("\n🔍 Real System Statistics:")
    stats = evaluator.get_system_stats()
    if 'system_info' in stats:
        print(f"  System Version: {stats['system_info']['version']}")
        print(f"  Agents Count: {stats['system_info']['agents_count']}")
        print(f"  Success Rate: {stats['session_stats']['success_rate']:.2%}")
        print(f"  Total Sessions: {stats['session_stats']['total_sessions']}")
    else:
        print("  System stats:", stats)

In [None]:
# Save results for future analysis
evaluator.save_results('agent_evaluation_results.csv')

# Display summary statistics
print("📈 Evaluation Results Summary:")
print(f"Total queries processed: {len(full_results)}")
print(f"Average response time: {full_results['response_time_ms'].mean():.1f}ms")
print(f"Average tokens used: {full_results['tokens_used'].mean():.0f}")
print(f"Average sources found: {full_results['sources_found'].mean():.1f}")
print(f"Model distribution: {full_results['model_used'].value_counts().to_dict()}")

# Show first few complete results
full_results.head()

## 6. Arize Phoenix Quality Evaluation

In [None]:
# Setup Phoenix evaluation
evaluator.setup_phoenix(launch_app=True)

# Prepare data for Phoenix evaluators
phoenix_df = full_results.copy()

# Add reference answers for comparison (in real scenario, these would be expert-verified)
# For demo, we'll use simplified mock references
phoenix_df['reference'] = phoenix_df.apply(lambda row: f"Reference answer for {row['input'][:30]}...", axis=1)

print("🔍 Setting up Phoenix evaluators...")
print(f"Prepared {len(phoenix_df)} responses for evaluation")

In [None]:
# Run Phoenix evaluations
try:
    # Initialize evaluators
    hallucination_evaluator = HallucinationEvaluator()
    relevance_evaluator = RelevanceEvaluator()
    qa_evaluator = QACorrectnessEvaluator()
    
    print("🔄 Running Phoenix evaluations...")
    
    # Run evaluations (this requires the columns 'input', 'output', 'reference')
    eval_results = run_evals(
        dataframe=phoenix_df,
        evaluators=[hallucination_evaluator, relevance_evaluator, qa_evaluator],
        provide_explanation=True
    )
    
    print("✅ Phoenix evaluations completed!")
    
    # Combine results
    for i, eval_df in enumerate(eval_results):
        evaluator_name = ['hallucination', 'relevance', 'qa_correctness'][i]
        phoenix_df[f'{evaluator_name}_score'] = eval_df['score']
        phoenix_df[f'{evaluator_name}_label'] = eval_df['label']
        phoenix_df[f'{evaluator_name}_explanation'] = eval_df['explanation']
    
    print("📊 Evaluation scores added to results DataFrame")
    
except Exception as e:
    print(f"⚠️ Phoenix evaluation failed (likely due to API setup): {e}")
    print("💡 Adding mock evaluation scores for demonstration...")
    
    # Add mock evaluation scores
    np.random.seed(42)  # For reproducible mock data
    phoenix_df['hallucination_score'] = np.random.uniform(0.7, 1.0, len(phoenix_df))
    phoenix_df['relevance_score'] = np.random.uniform(0.6, 1.0, len(phoenix_df))
    phoenix_df['qa_correctness_score'] = np.random.uniform(0.5, 0.95, len(phoenix_df))
    
    phoenix_df['hallucination_label'] = phoenix_df['hallucination_score'].apply(lambda x: 'factual' if x > 0.8 else 'hallucinated')
    phoenix_df['relevance_label'] = phoenix_df['relevance_score'].apply(lambda x: 'relevant' if x > 0.7 else 'irrelevant')
    phoenix_df['qa_correctness_label'] = phoenix_df['qa_correctness_score'].apply(lambda x: 'correct' if x > 0.7 else 'incorrect')

## 7. Results Analysis and Visualization

In [None]:
# Performance analysis by complexity
performance_by_complexity = phoenix_df.groupby('complexity').agg({
    'response_time_ms': ['mean', 'std'],
    'tokens_used': ['mean', 'std'], 
    'sources_found': ['mean', 'std'],
    'hallucination_score': 'mean',
    'relevance_score': 'mean',
    'qa_correctness_score': 'mean'
}).round(2)

print("📊 Performance by Complexity Level:")
performance_by_complexity

In [None]:
# Create comprehensive visualization dashboard
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=[
        'Response Time by Complexity',
        'Token Usage by Complexity',
        'Quality Scores by Complexity',
        'Sources Found vs Expected',
        'Quality Score Distribution',
        'Performance vs Quality Correlation'
    ],
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# Response time by complexity
for complexity in phoenix_df['complexity'].unique():
    data = phoenix_df[phoenix_df['complexity'] == complexity]['response_time_ms']
    fig.add_trace(
        go.Box(y=data, name=complexity, showlegend=False),
        row=1, col=1
    )

# Token usage by complexity
for complexity in phoenix_df['complexity'].unique():
    data = phoenix_df[phoenix_df['complexity'] == complexity]['tokens_used']
    fig.add_trace(
        go.Box(y=data, name=complexity, showlegend=False),
        row=1, col=2
    )

# Quality scores by complexity
quality_cols = ['hallucination_score', 'relevance_score', 'qa_correctness_score']
complexity_means = phoenix_df.groupby('complexity')[quality_cols].mean()

for i, score_type in enumerate(quality_cols):
    fig.add_trace(
        go.Bar(
            x=complexity_means.index,
            y=complexity_means[score_type],
            name=score_type.replace('_score', ''),
            showlegend=i==0
        ),
        row=2, col=1
    )

# Sources found vs expected
fig.add_trace(
    go.Scatter(
        x=phoenix_df['expected_sources'],
        y=phoenix_df['sources_found'],
        mode='markers',
        marker=dict(color=phoenix_df['complexity'].astype('category').cat.codes, colorscale='viridis'),
        showlegend=False
    ),
    row=2, col=2
)

# Add diagonal line for perfect source matching
max_sources = max(phoenix_df['expected_sources'].max(), phoenix_df['sources_found'].max())
fig.add_trace(
    go.Scatter(
        x=[0, max_sources],
        y=[0, max_sources],
        mode='lines',
        line=dict(dash='dash', color='red'),
        name='Perfect Match',
        showlegend=False
    ),
    row=2, col=2
)

# Quality score distributions
for score_type in quality_cols:
    fig.add_trace(
        go.Histogram(
            x=phoenix_df[score_type],
            name=score_type.replace('_score', ''),
            opacity=0.7,
            showlegend=False
        ),
        row=3, col=1
    )

# Performance vs Quality correlation
fig.add_trace(
    go.Scatter(
        x=phoenix_df['response_time_ms'],
        y=phoenix_df['qa_correctness_score'],
        mode='markers',
        marker=dict(size=phoenix_df['tokens_used']/50, opacity=0.6),
        showlegend=False
    ),
    row=3, col=2
)

# Update layout
fig.update_layout(
    height=1200,
    title_text="Multi-Agent System Evaluation Dashboard",
    showlegend=True
)

fig.show()

In [None]:
# Quality analysis summary
print("🎯 Quality Analysis Summary:")
print("=" * 50)

print(f"\n📈 Overall Performance:")
print(f"  Average Hallucination Score: {phoenix_df['hallucination_score'].mean():.3f} (higher = better)")
print(f"  Average Relevance Score: {phoenix_df['relevance_score'].mean():.3f} (higher = better)")
print(f"  Average QA Correctness: {phoenix_df['qa_correctness_score'].mean():.3f} (higher = better)")

print(f"\n⚡ Performance Metrics:")
print(f"  Average Response Time: {phoenix_df['response_time_ms'].mean():.0f}ms")
print(f"  P95 Response Time: {phoenix_df['response_time_ms'].quantile(0.95):.0f}ms")
print(f"  Average Token Usage: {phoenix_df['tokens_used'].mean():.0f}")

print(f"\n🎯 Accuracy Metrics:")
factual_responses = (phoenix_df['hallucination_label'] == 'factual').sum()
relevant_responses = (phoenix_df['relevance_label'] == 'relevant').sum()
correct_responses = (phoenix_df['qa_correctness_label'] == 'correct').sum()

print(f"  Factual Responses: {factual_responses}/{len(phoenix_df)} ({factual_responses/len(phoenix_df)*100:.1f}%)")
print(f"  Relevant Responses: {relevant_responses}/{len(phoenix_df)} ({relevant_responses/len(phoenix_df)*100:.1f}%)")
print(f"  Correct Responses: {correct_responses}/{len(phoenix_df)} ({correct_responses/len(phoenix_df)*100:.1f}%)")

# Performance targets from CLAUDE.md
print(f"\n🎯 Target Achievement:")
p95_latency_simple = phoenix_df[phoenix_df['complexity'] == 'gpt-5-nano']['response_time_ms'].quantile(0.95)
p95_latency_complex = phoenix_df[phoenix_df['complexity'] == 'gpt-5']['response_time_ms'].quantile(0.95)

print(f"  P95 Simple Query Latency: {p95_latency_simple:.0f}ms (Target: <3000ms) {'✅' if p95_latency_simple < 3000 else '❌'}")
print(f"  P95 Complex Query Latency: {p95_latency_complex:.0f}ms (Target: <10000ms) {'✅' if p95_latency_complex < 10000 else '❌'}")
print(f"  Overall QA Accuracy: {phoenix_df['qa_correctness_score'].mean()*100:.1f}% (Target: >90%) {'✅' if phoenix_df['qa_correctness_score'].mean() > 0.9 else '❌'}")
print(f"  Factual Accuracy: {factual_responses/len(phoenix_df)*100:.1f}% (Target: >95%) {'✅' if factual_responses/len(phoenix_df) > 0.95 else '❌'}")

## 8. Export Results for Further Analysis

In [None]:
# Save comprehensive results
final_results_path = 'comprehensive_agent_evaluation.csv'
phoenix_df.to_csv(final_results_path, index=False)
print(f"💾 Comprehensive results saved to {final_results_path}")

# Create summary report
summary_stats = {
    'total_queries': len(phoenix_df),
    'avg_response_time_ms': phoenix_df['response_time_ms'].mean(),
    'p95_response_time_ms': phoenix_df['response_time_ms'].quantile(0.95),
    'avg_tokens_used': phoenix_df['tokens_used'].mean(),
    'avg_hallucination_score': phoenix_df['hallucination_score'].mean(),
    'avg_relevance_score': phoenix_df['relevance_score'].mean(),
    'avg_qa_correctness_score': phoenix_df['qa_correctness_score'].mean(),
    'factual_response_rate': factual_responses/len(phoenix_df),
    'relevant_response_rate': relevant_responses/len(phoenix_df),
    'correct_response_rate': correct_responses/len(phoenix_df),
    'evaluation_timestamp': datetime.now().isoformat()
}

with open('evaluation_summary.json', 'w') as f:
    json.dump(summary_stats, f, indent=2)
    
print("📋 Summary report saved to evaluation_summary.json")
print("\n✅ Evaluation complete! Files saved:")
print(f"  - {final_results_path} (detailed results)")
print(f"  - evaluation_summary.json (summary metrics)")
print(f"  - agent_evaluation_results.csv (raw agent responses)")

## 9. Next Steps

This notebook provides a comprehensive framework for testing and evaluating the multi-agent research system. Here's what you can do next:

### 🔄 Continuous Evaluation
- **Run this notebook regularly** as you develop and refine the agent system
- **Compare results over time** to track improvements
- **A/B test different agent configurations** using the same evaluation dataset

### 📊 Analysis Extensions  
- **Add custom evaluation metrics** specific to your use case
- **Implement ground truth comparison** for more accurate quality assessment
- **Add cost analysis** to track token usage and API costs

### 🚀 Integration
- **Connect to real agent system** by replacing mock_mode=False
- **Set up automated evaluation pipeline** for CI/CD integration  
- **Create alerting** for performance regressions

### 📈 Phoenix Integration
- **Upload results to Phoenix** for persistent monitoring
- **Set up real-time evaluation** for production monitoring
- **Create custom evaluators** for domain-specific quality metrics

The evaluation framework is now ready to support the development and optimization of your multi-agent research system!