# Agent Comparison Analysis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/youdotcom-oss/web-search-agent-evals/blob/main/notebooks/comparison.ipynb)

Visualize comparison results from weighted and statistical analysis of web search agent evaluations.

## What This Analyzes

This notebook visualizes pre-computed comparison metrics from:
- **Weighted Strategy**: Balances quality (70%), latency (20%), reliability (10%)
- **Statistical Strategy**: Bootstrap sampling with significance testing (p<0.05)

**Agents Compared**: Claude Code, Gemini, Droid, Codex  
**Search Tools**: builtin, You.com MCP  
**Configurations**: 8 total (4 agents √ó 2 tools)

## Quick Navigation

1. [Setup & Data Loading](#setup)
2. [Overall Rankings](#rankings)
3. [Quality vs Performance](#quality-perf)
4. [Head-to-Head Matrix](#head-to-head)
5. [Statistical Significance](#significance)
6. [Search Provider Comparison](#provider)
7. [Pass Rate Analysis](#pass-rates)

In [None]:
# Cell 1: Colab Setup (auto-detects environment)
import os
from pathlib import Path

# Detect if running in Colab
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    print("üîß Running in Google Colab - cloning repository...")
    
    # Clone repository if not already present
    repo_dir = Path('/content/web-search-agent-evals')
    if not repo_dir.exists():
        !git clone https://github.com/youdotcom-oss/web-search-agent-evals.git /content/web-search-agent-evals
        print("‚úì Repository cloned")
    else:
        print("‚úì Repository already exists")
        # Pull latest changes
        %cd /content/web-search-agent-evals
        !git pull origin main
    
    # Change to repo directory
    %cd /content/web-search-agent-evals
    print(f"‚úì Working directory: {Path.cwd()}")
else:
    print("‚úì Running locally")

In [None]:
# Cell 2: Dependencies & Configuration
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pathlib import Path

# Configure plotting
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = (12, 6)

# Find project root
PROJECT_ROOT = Path.cwd()
if PROJECT_ROOT.name == 'notebooks':
    PROJECT_ROOT = PROJECT_ROOT.parent

DATA_DIR = PROJECT_ROOT / 'data'
print(f"üìÅ Project root: {PROJECT_ROOT}")
print(f"üìä Data directory: {DATA_DIR}")

# Verify data directory exists
if not DATA_DIR.exists():
    raise FileNotFoundError(f"Data directory not found: {DATA_DIR}")

In [None]:
# Cell 3: Configuration - Choose Dataset Mode

# =====================================
# USER CONFIGURATION
# =====================================
MODE = 'test'  # Options: 'test' or 'full'
RUN_DATE = None  # For full mode: '2026-01-24' or None for latest
# =====================================

print(f"üìä MODE: {MODE.upper()}")
print("="*70)

if MODE == 'test':
    comp_dir = DATA_DIR / 'comparisons' / 'test-runs'
    print("Loading test mode comparisons (5 prompts, rapid iteration)")
elif MODE == 'full':
    if RUN_DATE:
        comp_dir = DATA_DIR / 'comparisons' / 'runs' / RUN_DATE
        print(f"Loading full run: {RUN_DATE}")
    else:
        # Find most recent run
        runs_dir = DATA_DIR / 'comparisons' / 'runs'
        if runs_dir.exists():
            latest_date = sorted(d.name for d in runs_dir.iterdir() if d.is_dir())[-1]
            comp_dir = runs_dir / latest_date
            print(f"Loading latest full run: {latest_date}")
        else:
            raise FileNotFoundError(f"No full runs found in {runs_dir}")
else:
    raise ValueError(f"Invalid MODE: {MODE}. Must be 'test' or 'full'")

print(f"Comparison directory: {comp_dir}")
print("="*70)

<a id='setup'></a>
## Load Comparison Data

In [None]:
# Cell 4: Load Weighted Comparison
weighted_file = comp_dir / 'all-weighted.json'

if not weighted_file.exists():
    raise FileNotFoundError(f"Weighted comparison not found: {weighted_file}\n"
                            f"Run: bun run compare --mode {MODE}")

with open(weighted_file) as f:
    weighted = json.load(f)

print("‚úì Loaded weighted comparison")
print(f"  Configurations: {len(weighted['quality'])}")
print(f"  Strategy: {weighted['meta']['strategy']}")
print(f"  Timestamp: {weighted['meta']['timestamp']}")

In [None]:
# Cell 5: Load Statistical Comparison
statistical_file = comp_dir / 'all-statistical.json'

if statistical_file.exists():
    with open(statistical_file) as f:
        statistical = json.load(f)
    print("‚úì Loaded statistical comparison")
    print(f"  Bootstrap iterations: {statistical['meta'].get('bootstrapIterations', 'N/A')}")
    HAS_STATISTICAL = True
else:
    print("‚ö†Ô∏è  No statistical comparison found")
    print(f"   Run: bun run compare --mode {MODE} --strategy statistical")
    statistical = None
    HAS_STATISTICAL = False

In [None]:
# Cell 6: Prepare DataFrames

# Extract rankings from weighted comparison
rankings = []
for config, metrics in weighted['quality'].items():
    rankings.append({
        'config': config,
        'avgScore': metrics['avgScore'],
        'passRate': metrics['passRate'],
        'passCount': metrics['passCount'],
        'failCount': metrics['failCount'],
        'agent': config.split('-')[0] if '-' in config else config,
        'provider': config.split('-')[1] if '-' in config and len(config.split('-')) > 1 else 'unknown'
    })

rankings_df = pd.DataFrame(rankings).sort_values('avgScore', ascending=False)
rankings_df['rank'] = range(1, len(rankings_df) + 1)

# Extract performance metrics
perf_data = []
for config, metrics in weighted['performance'].items():
    perf_data.append({
        'config': config,
        'p50_latency': metrics['latency']['p50'],
        'p90_latency': metrics['latency']['p90'],
        'p99_latency': metrics['latency']['p99']
    })

perf_df = pd.DataFrame(perf_data)

# Merge quality and performance
full_df = rankings_df.merge(perf_df, on='config')

print("‚úì Prepared analysis dataframes")
print(f"  {len(rankings_df)} configurations analyzed")

<a id='rankings'></a>
## Overall Rankings

In [None]:
# Cell 7: Rankings Bar Chart
fig, ax = plt.subplots(figsize=(12, 6))

# Color code: top 3 green, rest blue
colors = ['#2ecc71' if i < 3 else '#3498db' for i in range(len(rankings_df))]
bars = ax.barh(rankings_df['config'], rankings_df['avgScore'] * 100, color=colors)

ax.set_xlabel('Average Score (%)', fontsize=12)
ax.set_ylabel('Configuration', fontsize=12)
ax.set_title(f'Agent Rankings by Quality Score ({MODE.upper()} mode)', fontsize=14, fontweight='bold')
ax.axvline(x=65, color='red', linestyle='--', alpha=0.5, label='Pass Threshold (65%)')
ax.legend()

# Add score labels
for bar, score in zip(bars, rankings_df['avgScore']):
    ax.text(score * 100 + 1, bar.get_y() + bar.get_height()/2,
            f"{score*100:.1f}%", va='center', fontsize=10)

plt.tight_layout()
plt.show()

# Print top 3
print("\nüèÜ TOP 3 CONFIGURATIONS")
print("="*70)
for _, row in rankings_df.head(3).iterrows():
    print(f"#{int(row['rank'])} {row['config']}")
    print(f"   Score: {row['avgScore']:.1%} | Pass Rate: {row['passRate']:.1%} ({row['passCount']}/{row['passCount']+row['failCount']})\n")

<a id='quality-perf'></a>
## Quality vs Performance

In [None]:
# Cell 8: Quality vs Latency Scatter Plot
fig, ax = plt.subplots(figsize=(12, 8))

# Create scatter plot with provider-based colors
providers = full_df['provider'].unique()
colors_map = {'builtin': '#3498db', 'you': '#e74c3c'}

for provider in providers:
    df_subset = full_df[full_df['provider'] == provider]
    ax.scatter(df_subset['p50_latency'], df_subset['avgScore'] * 100,
               s=150, alpha=0.6, label=provider,
               color=colors_map.get(provider, '#95a5a6'))

# Add labels for each point
for _, row in full_df.iterrows():
    ax.annotate(row['agent'], 
                (row['p50_latency'], row['avgScore'] * 100),
                textcoords="offset points", xytext=(0,10), ha='center',
                fontsize=9)

ax.set_xlabel('Median Latency (ms)', fontsize=12)
ax.set_ylabel('Quality Score (%)', fontsize=12)
ax.set_title('Quality vs Performance Tradeoff', fontsize=14, fontweight='bold')
ax.axhline(y=65, color='red', linestyle='--', alpha=0.3, label='Pass Threshold')
ax.legend(title='Search Provider', fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Ideal: Top-left quadrant (high quality, low latency)")

<a id='head-to-head'></a>
## Head-to-Head Comparison

In [None]:
# Cell 9: Head-to-Head Win Rate Matrix
pairwise = weighted['headToHead']['pairwise']
configs = rankings_df['config'].tolist()

# Build win rate matrix
n = len(configs)
win_matrix = np.zeros((n, n))

for i, config_a in enumerate(configs):
    for j, config_b in enumerate(configs):
        if i == j:
            win_matrix[i, j] = 0.5  # Diagonal
        else:
            key = f"{config_a} vs {config_b}"
            if key in pairwise:
                record = pairwise[key]
                total = record['wins'] + record['losses'] + record['ties']
                win_rate = (record['wins'] + 0.5 * record['ties']) / total if total > 0 else 0
                win_matrix[i, j] = win_rate

# Plot heatmap
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(win_matrix, annot=True, fmt='.2f', cmap='RdYlGn', center=0.5,
            xticklabels=configs, yticklabels=configs, ax=ax,
            cbar_kws={'label': 'Win Rate'}, vmin=0, vmax=1)

ax.set_title('Head-to-Head Win Rate Matrix\n(Row vs Column)', fontsize=14, fontweight='bold')
ax.set_xlabel('Opponent', fontsize=12)
ax.set_ylabel('Agent', fontsize=12)

plt.tight_layout()
plt.show()

print("\nüí° How to read: Each cell shows win rate of row agent vs column agent")
print("   Green (>0.5) = Row agent wins more often")
print("   Red (<0.5) = Column agent wins more often")

<a id='significance'></a>
## Statistical Significance

In [None]:
# Cell 10: Confidence Intervals (if statistical data available)
if HAS_STATISTICAL:
    ci_data = []
    for config, metrics in statistical['quality'].items():
        ci = metrics['confidenceInterval']
        ci_data.append({
            'config': config,
            'mean': metrics['avgScore'],
            'ci_lower': ci['lower'],
            'ci_upper': ci['upper'],
            'ci_width': ci['upper'] - ci['lower']
        })
    
    ci_df = pd.DataFrame(ci_data).sort_values('mean', ascending=False)
    
    # Plot confidence intervals
    fig, ax = plt.subplots(figsize=(12, 8))
    
    y_pos = range(len(ci_df))
    ax.errorbar(ci_df['mean'] * 100, y_pos, 
                xerr=[(ci_df['mean'] - ci_df['ci_lower']) * 100, 
                      (ci_df['ci_upper'] - ci_df['mean']) * 100],
                fmt='o', capsize=5, capthick=2, markersize=8)
    
    ax.set_yticks(y_pos)
    ax.set_yticklabels(ci_df['config'])
    ax.set_xlabel('Score (%) with 95% Confidence Interval', fontsize=12)
    ax.set_title('Statistical Confidence Intervals\n(Bootstrap with 1000 iterations)', 
                 fontsize=14, fontweight='bold')
    ax.axvline(x=65, color='red', linestyle='--', alpha=0.3, label='Pass Threshold')
    ax.legend()
    ax.grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    plt.show()
    
    print("\nüí° Overlapping intervals = difference may not be statistically significant")
    print("   Narrow intervals = more reliable estimate")
else:
    print("‚ö†Ô∏è  Statistical analysis not available")
    print(f"   Run: bun run compare --mode {MODE} --strategy statistical")

<a id='provider'></a>
## Search Provider Comparison

In [None]:
# Cell 11: Provider Comparison by Agent
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Group by agent and provider
agent_provider = full_df.groupby(['agent', 'provider'])['avgScore'].mean().unstack()

# Score comparison
agent_provider_pct = agent_provider * 100
agent_provider_pct.plot(kind='bar', ax=ax1, color=['#3498db', '#e74c3c'])
ax1.set_title('Quality Score by Search Provider', fontsize=14, fontweight='bold')
ax1.set_ylabel('Score (%)', fontsize=12)
ax1.set_xlabel('Agent', fontsize=12)
ax1.axhline(y=65, color='red', linestyle='--', alpha=0.3, label='Pass Threshold')
ax1.legend(title='Provider')
ax1.tick_params(axis='x', rotation=45)

# Latency comparison
latency_provider = full_df.groupby(['agent', 'provider'])['p50_latency'].mean().unstack()
latency_provider.plot(kind='bar', ax=ax2, color=['#3498db', '#e74c3c'])
ax2.set_title('Median Latency by Search Provider', fontsize=14, fontweight='bold')
ax2.set_ylabel('Latency (ms)', fontsize=12)
ax2.set_xlabel('Agent', fontsize=12)
ax2.legend(title='Provider')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Print provider winner per agent
print("\nüèÖ PROVIDER WINNER PER AGENT (Quality)")
print("="*70)
for agent in agent_provider.index:
    winner = agent_provider.loc[agent].idxmax()
    score_diff = (agent_provider.loc[agent, winner] - agent_provider.loc[agent].min()) * 100
    print(f"{agent}: {winner} (+{score_diff:.1f}% better)")

<a id='pass-rates'></a>
## Pass Rate Analysis

In [None]:
# Cell 12: Pass Rates
fig, ax = plt.subplots(figsize=(12, 6))

# Color code by pass rate
colors = ['#2ecc71' if rate >= 0.5 else '#f39c12' if rate >= 0.3 else '#e74c3c'
          for rate in rankings_df['passRate']]

bars = ax.barh(rankings_df['config'], rankings_df['passRate'] * 100, color=colors)

ax.set_xlabel('Pass Rate (%)', fontsize=12)
ax.set_ylabel('Configuration', fontsize=12)
ax.set_title(f'Pass Rates (Score ‚â• 65%)', fontsize=14, fontweight='bold')

# Add percentage labels
for bar, rate, count in zip(bars, rankings_df['passRate'], rankings_df['passCount']):
    ax.text(rate * 100 + 1, bar.get_y() + bar.get_height()/2,
            f"{rate*100:.1f}% ({int(count)})", va='center', fontsize=10)

plt.tight_layout()
plt.show()

# Summary statistics
print("\nüìä PASS RATE SUMMARY")
print("="*70)
print(f"Best: {rankings_df.iloc[0]['config']} ({rankings_df.iloc[0]['passRate']:.1%})")
print(f"Worst: {rankings_df.iloc[-1]['config']} ({rankings_df.iloc[-1]['passRate']:.1%})")
print(f"Average: {rankings_df['passRate'].mean():.1%}")
print(f"Median: {rankings_df['passRate'].median():.1%}")

## Summary

This notebook visualized comparison results. For deeper analysis:
- **Raw trajectories**: Load individual JSONL files from `data/results/`
- **Trials analysis**: See `trials.ipynb` for pass@k reliability metrics
- **Custom comparisons**: Use `bun run compare` with different flags