# Agentic Web Search Playoffs: Evaluation Results
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/youdotcom-oss/agentic-web-search-playoffs/blob/YOUR_BRANCH/notebooks/summary.ipynb)
-->
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/youdotcom-oss/agentic-web-search-playoffs/blob/main/notebooks/summary.ipynb)

**Comparing 4 agents √ó 2 search tools = 8 configurations across 1,254 web search tasks**

## Quick Navigation

**üìä Key Findings**
1. [Executive Summary](#Executive-Summary) - Top performers and key observations
2. [Overall Rankings](#Overall-Rankings) - Quality rankings across configurations

**üî¨ Deep Analysis**
3. [Score Distribution](#Score-Distribution-Analysis) - Understanding grading patterns
4. [MCP Tool Usage](#MCP-Tool-Usage-Analysis) - MCP integration verification
5. [Statistical Significance](#Statistical-Significance-Testing) - Bootstrap confidence

**üìà Performance Deep-Dive**
6. [Head-to-Head Matrix](#Head-to-Head-Comparison-Matrix) - Pairwise win/loss
7. [Pass Rates](#Pass-Rates-by-Configuration) - Success rates by agent/tool
8. [Latency Analysis](#Latency-Distribution) - Response time distributions
9. [Error Rates](#Tool-Error-Rates) - Tool failure analysis

**üìÖ Historical Context**
10. [Trends Over Time](#Historical-Trends) - Performance evolution

---

**Methodology:** Hybrid grading (deterministic 60% + LLM 40%), pass threshold ‚â•70%  
**Data:** Raw trajectories and comparisons in `/data/results/runs/` for reproducibility

## Environment Setup

**‚ö†Ô∏è Repository Visibility:** This repository is currently INTERNAL. To use this notebook in Google Colab:
- **Option 1:** Make the repository PUBLIC at [repository settings](https://github.com/youdotcom-oss/agentic-web-search-playoffs/settings)
- **Option 2:** Use GitHub authentication (see error message below if clone fails)

**For local Jupyter users:** This cell will automatically find the project root.

In [None]:
import sys
import os
from pathlib import Path

# Detect environment
IN_COLAB = 'google.colab' in sys.modules
REPO_URL = 'https://github.com/youdotcom-oss/agentic-web-search-playoffs.git'
REPO_DIR = 'agentic-web-search-playoffs'

if IN_COLAB:
    print("üîµ Google Colab detected")
    
    # Clone repo if not already cloned
    if not Path(REPO_DIR).exists():
        print(f"üì• Cloning {REPO_URL}...")
        result = !git clone {REPO_URL} 2>&1
        
        # Check if clone succeeded
        if Path(REPO_DIR).exists():
            print("‚úì Repository cloned")
        else:
            print("\n‚ùå Failed to clone repository")
            print("\n‚ö†Ô∏è  This repository is INTERNAL and requires authentication.")
            print("\nOptions:")
            print("1. Make the repository PUBLIC at: https://github.com/youdotcom-oss/agentic-web-search-playoffs/settings")
            print("2. OR manually upload the data/ folder to Colab")
            print("3. OR use GitHub token authentication:")
            print("   !git clone https://YOUR_TOKEN@github.com/youdotcom-oss/agentic-web-search-playoffs.git")
            raise FileNotFoundError(f"Could not clone {REPO_URL}")
    else:
        print(f"‚úì Repository already exists at {REPO_DIR}")
    
    # Change to repo directory
    os.chdir(REPO_DIR)
    print(f"üìÅ Working directory: {Path.cwd()}")
    
    # Install dependencies
    print("\nüì¶ Installing Python dependencies...")
    !pip install -q pandas matplotlib seaborn numpy
    print("‚úì Dependencies installed")
else:
    print("üíª Local Jupyter detected")
    
    # Find project root (contains data/results/latest.json)
    current_dir = Path.cwd()
    project_root = None
    
    # If in notebooks/ directory, go up one level
    if current_dir.name == 'notebooks':
        project_root = current_dir.parent
    else:
        # Search for project root
        for parent in [current_dir] + list(current_dir.parents):
            if (parent / 'data' / 'results' / 'latest.json').exists():
                project_root = parent
                break
    
    if project_root:
        os.chdir(project_root)
        print(f"‚úì Project root: {Path.cwd()}")
    else:
        print(f"‚ö†Ô∏è  Could not find project root with data/results/latest.json")
        print(f"   Current directory: {Path.cwd()}")

print(f"\n‚úì Ready to load data from: {Path.cwd() / 'data' / 'results' / 'latest.json'}")

In [None]:
## Data Loading

import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Configure plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100

# Read latest run metadata
with open('data/results/latest.json') as f:
    latest = json.load(f)

print(f"üìä Loading data from run: {latest['date']}")
print(f"   Prompts: {latest['promptCount']:,}")
print(f"   Commit: {latest['commit']}")

In [None]:
# Read agents and search providers from manifest
with open('data/results/MANIFEST.jsonl') as f:
    lines = f.readlines()
    latest_entry = json.loads(lines[-1])
    agents = latest_entry['agents']
    search_providers = latest_entry['searchProviders']

print(f"Agents: {', '.join(agents)}")
print(f"Search providers: {', '.join(search_providers)}")

# Load raw trajectory results
results = {}
for agent in agents:
    results[agent] = {}
    for provider in search_providers:
        path = f"data/results/{latest['path']}/{agent}/{provider}.jsonl"
        try:
            with open(path) as f:
                results[agent][provider] = [json.loads(line) for line in f]
        except FileNotFoundError:
            print(f"‚ö†Ô∏è  Missing: {path}")
            results[agent][provider] = []

# Flatten to DataFrame
rows = []
for agent, providers_dict in results.items():
    for provider, result_list in providers_dict.items():
        for r in result_list:
            rows.append({
                'agent': agent,
                'search_provider': provider,
                'id': r['id'],
                'score': r.get('score', 0),
                'pass': r.get('score', 0) >= 0.7,
                'latency_ms': r.get('timing', {}).get('total', 0),
                'tool_errors': r.get('toolErrors', False),
            })

df = pd.DataFrame(rows)
print(f"\n‚úì Loaded {len(df):,} trajectory results")

## Executive Summary

High-level findings and key takeaways from this evaluation run.

In [None]:
if rankings_df is not None:
    print(f"üìä EVALUATION RUN: {latest['date']}")
    print(f"{'='*70}")
    print(f"Prompts Evaluated: {latest['promptCount']:,}")
    print(f"Configurations: {len(rankings_df)} (4 agents √ó 2 search tools)")
    print(f"Data Commit: {latest['commit']}")
    print(f"\n{'='*70}")
    print("TOP 3 CONFIGURATIONS (by weighted average score)")
    print(f"{'='*70}\n")
    
    for _, row in rankings_df.head(3).iterrows():
        score_pct = row['score'] * 100
        pass_pct = row['passRate'] * 100
        print(f"#{row['rank']} {row['run']}")
        print(f"    Avg Score: {score_pct:.2f}% | Pass Rate: {pass_pct:.2f}%\n")
    
    print(f"{'='*70}")
    print("KEY OBSERVATIONS")
    print(f"{'='*70}\n")
    
    # Score distribution
    score_min, score_max = rankings_df['score'].min() * 100, rankings_df['score'].max() * 100
    score_spread = score_max - score_min
    print(f"üìä Score Distribution")
    print(f"   Range: {score_min:.2f}% - {score_max:.2f}% (spread: {score_spread:.2f}pp)")
    if score_spread < 5:
        print(f"   ‚ö†Ô∏è  Narrow spread suggests score clustering - may need grader calibration")
    
    # Pass rate variance
    pass_min, pass_max = rankings_df['passRate'].min() * 100, rankings_df['passRate'].max() * 100
    print(f"\n‚úì Pass Rates")
    print(f"   Range: {pass_min:.2f}% - {pass_max:.2f}%")
    if pass_max > 10 * pass_min:
        print(f"   üìå Wide variance indicates distinct performance tiers")
    
    # Tool reliability
    error_by_config = df.groupby(['agent', 'search_provider'])['tool_errors'].mean() * 100
    print(f"\nüîß Tool Reliability")
    print(f"   Error Rate Range: {error_by_config.min():.1f}% - {error_by_config.max():.1f}%")
    if error_by_config.max() > 20:
        worst = error_by_config.idxmax()
        print(f"   ‚ö†Ô∏è  {worst[0]}-{worst[1]}: {error_by_config.max():.1f}% error rate (reliability issue)")
else:
    print("‚ö†Ô∏è  Comparison data not available. Run: bun scripts/compare.ts --mode full")

## Overall Rankings

Quality rankings by average score (deterministic + LLM hybrid grading).

In [None]:
if rankings_df is not None:
    fig, ax = plt.subplots(figsize=(10, 6))
    colors = ['#2ecc71' if i < 3 else '#3498db' for i in range(len(rankings_df))]
    bars = ax.barh(rankings_df['run'], rankings_df['score'] * 100, color=colors)
    
    ax.set_xlabel('Average Score (%)', fontsize=12)
    ax.set_ylabel('Configuration', fontsize=12)
    ax.set_title('Agent + Search Provider Rankings (Higher is Better)', fontsize=14, fontweight='bold')
    ax.axvline(x=70, color='red', linestyle='--', alpha=0.5, label='Pass Threshold (70%)')
    ax.legend()
    
    for i, (bar, score) in enumerate(zip(bars, rankings_df['score'])):
        ax.text(score * 100 + 0.5, bar.get_y() + bar.get_height()/2, 
                f"{score*100:.2f}%", va='center', fontsize=10)
    
    plt.tight_layout()
    plt.show()
    
    print("\nFull Rankings Table:")
    display(rankings_df[['rank', 'run', 'score', 'passRate']].style.format({'score': '{:.4f}', 'passRate': '{:.4f}'}))
else:
    print("‚ö†Ô∏è No comparison data available")

## Score Distribution Analysis

Understanding how scores are distributed helps identify:
- **Bimodal patterns**: Agent either succeeds or fails completely
- **Clustering**: Grader may need calibration if scores bunch up
- **Outliers**: Unusually high/low performance on specific prompts

In [None]:
# Create agent-provider label
df['config'] = df['agent'] + '-' + df['search_provider']

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Violin plot: Score distribution by configuration
sns.violinplot(data=df, y='config', x='score', ax=axes[0], inner='quartile')
axes[0].axvline(x=0.7, color='red', linestyle='--', alpha=0.5, label='Pass Threshold')
axes[0].set_xlabel('Score', fontsize=12)
axes[0].set_ylabel('Configuration', fontsize=12)
axes[0].set_title('Score Distribution by Configuration\n(wider = more variance)', fontsize=13, fontweight='bold')
axes[0].legend()

# Box plot: Quartile view
sns.boxplot(data=df, y='config', x='score', ax=axes[1])
axes[1].axvline(x=0.7, color='red', linestyle='--', alpha=0.5, label='Pass Threshold')
axes[1].set_xlabel('Score', fontsize=12)
axes[1].set_ylabel('')
axes[1].set_title('Quartile View\n(box = IQR, whiskers = 1.5√óIQR)', fontsize=13, fontweight='bold')
axes[1].legend()

plt.tight_layout()
plt.show()

# Statistical summary
print("\nüìä SCORE DISTRIBUTION INSIGHTS")
print("="*70)
for config in df['config'].unique():
    config_scores = df[df['config'] == config]['score']
    q1, median, q3 = config_scores.quantile([0.25, 0.5, 0.75])
    iqr = q3 - q1
    skew = config_scores.skew()
    
    print(f"\n{config}:")
    print(f"  Q1: {q1:.3f} | Median: {median:.3f} | Q3: {q3:.3f} | IQR: {iqr:.3f}")
    print(f"  Skew: {skew:.2f}", end='')
    if abs(skew) < 0.5:
        print(" (symmetric)")
    elif skew > 0:
        print(" (right-skewed: more low scores)")
    else:
        print(" (left-skewed: more high scores)")
    
    # Detect clustering
    if iqr < 0.15:
        print(f"  ‚ö†Ô∏è Narrow IQR suggests score clustering - 50% of results within {iqr:.3f} range")

## MCP Tool Usage Analysis

Verify that configurations using You.com MCP actually called the MCP tool (not builtin).  
**Why this matters**: If MCP configs fall back to builtin search, results aren't valid comparisons.

In [None]:
# Check trajectory metadata for MCP usage
mcp_usage = []
for agent in agents:
    for provider in search_providers:
        if provider == 'builtin':
            continue  # Skip builtin configs
        
        result_list = results.get(agent, {}).get(provider, [])
        if not result_list:
            continue
        
        mcp_calls = 0
        builtin_calls = 0
        no_search_calls = 0
        
        for r in result_list:
            trajectory = r.get('trajectory', [])
            has_mcp = False
            has_builtin = False
            
            for step in trajectory:
                if step.get('type') == 'tool_call':
                    tool_name = step.get('name', '')
                    # Check for MCP tool patterns
                    if 'mcp' in tool_name.lower() or 'ydc' in tool_name.lower() or 'you.com' in tool_name.lower():
                        has_mcp = True
                    # Check for builtin patterns
                    elif 'websearch' in tool_name.lower() or 'search' in tool_name.lower():
                        has_builtin = True
            
            if has_mcp:
                mcp_calls += 1
            elif has_builtin:
                builtin_calls += 1
            else:
                no_search_calls += 1
        
        total = len(result_list)
        mcp_usage.append({
            'config': f"{agent}-{provider}",
            'expected_mcp': provider,
            'total_prompts': total,
            'mcp_calls': mcp_calls,
            'builtin_calls': builtin_calls,
            'no_search': no_search_calls,
            'mcp_rate': mcp_calls / total if total > 0 else 0,
        })

if mcp_usage:
    mcp_df = pd.DataFrame(mcp_usage)
    
    print("üîç MCP TOOL USAGE VERIFICATION")
    print("="*70)
    print("Expected: Configurations with 'you' provider should use You.com MCP tool\n")
    
    display(mcp_df[['config', 'total_prompts', 'mcp_calls', 'builtin_calls', 'no_search', 'mcp_rate']]
            .style.format({'mcp_rate': '{:.1%}'}))
    
    # Detect issues
    print("\nüìã FINDINGS:")
    for _, row in mcp_df.iterrows():
        if row['mcp_rate'] < 0.5:
            print(f"‚ö†Ô∏è {row['config']}: Only {row['mcp_rate']:.1%} used MCP (expected: you provider)")
            print(f"   ‚Üí {row['builtin_calls']} fell back to builtin, {row['no_search']} had no search")
        elif row['mcp_rate'] > 0.9:
            print(f"‚úÖ {row['config']}: {row['mcp_rate']:.1%} MCP usage (good)")
        else:
            print(f"‚ö†Ô∏è {row['config']}: {row['mcp_rate']:.1%} MCP usage (mixed - investigate)")
else:
    print("‚ÑπÔ∏è No MCP configurations found in this run")

## Statistical Significance Testing

Bootstrap sampling (1000 iterations) to determine if performance differences are statistically significant (p < 0.05).  
**Key**: Differences must be significant to claim one agent/tool is truly better.

In [None]:
if stat_comparison is not None:
    print("üìä STATISTICAL SIGNIFICANCE RESULTS")
    print("="*70)
    print("Bootstrap Method: 1000 iterations | Significance Level: p < 0.05\n")
    
    # Extract significant pairs from statistical comparison
    significant_pairs = []
    
    if 'headToHead' in stat_comparison and 'pairwise' in stat_comparison['headToHead']:
        for pair in stat_comparison['headToHead']['pairwise']:
            # Statistical comparison includes p-values and confidence intervals
            if 'pValue' in pair and pair.get('pValue', 1.0) < 0.05:
                significant_pairs.append({
                    'runA': pair['runA'],
                    'runB': pair['runB'],
                    'pValue': pair['pValue'],
                    'significant': True,
                    'aWins': pair.get('aWins', 0),
                    'bWins': pair.get('bWins', 0),
                })
    
    if significant_pairs:
        sig_df = pd.DataFrame(significant_pairs)
        print(f"Found {len(sig_df)} statistically significant pairwise differences:\n")
        
        for _, row in sig_df.iterrows():
            winner = row['runA'] if row['aWins'] > row['bWins'] else row['runB']
            loser = row['runB'] if row['aWins'] > row['bWins'] else row['runA']
            print(f"‚úÖ {winner} > {loser}")
            print(f"   p-value: {row['pValue']:.4f} (wins: {max(row['aWins'], row['bWins'])} vs {min(row['aWins'], row['bWins'])})\n")
    else:
        print("‚ö†Ô∏è No statistically significant differences found (p < 0.05)")
        print("   This suggests:")
        print("   - Performance differences may be due to chance")
        print("   - Need larger sample size or more diverse test set")
        print("   - Grader may need calibration to differentiate quality\n")
    
    # Show confidence intervals if available
    if 'quality' in stat_comparison:
        print("\n95% CONFIDENCE INTERVALS (Bootstrap)")
        print("="*70)
        ci_data = []
        for run, metrics in stat_comparison['quality'].items():
            if 'confidenceInterval' in metrics:
                ci = metrics['confidenceInterval']
                ci_data.append({
                    'run': run,
                    'avgScore': metrics['avgScore'],
                    'ci_lower': ci['lower'],
                    'ci_upper': ci['upper'],
                    'ci_width': ci['upper'] - ci['lower'],
                })
        
        if ci_data:
            ci_df = pd.DataFrame(ci_data).sort_values('avgScore', ascending=False)
            display(ci_df.style.format({
                'avgScore': '{:.4f}',
                'ci_lower': '{:.4f}',
                'ci_upper': '{:.4f}',
                'ci_width': '{:.4f}'
            }))
            
            print("\nüìä Narrower CI width = more consistent performance")
else:
    print("‚ÑπÔ∏è Statistical comparison not available.")
    print("   Run: bun scripts/compare.ts --mode full --strategy statistical")

## Head-to-Head Comparison Matrix

Pairwise win/loss comparison across all configurations.

In [None]:
if h2h_enhanced is not None:
    # Build matrix
    configs = sorted(df['config'].unique())
    matrix = pd.DataFrame(0, index=configs, columns=configs)
    
    for item in h2h_enhanced:
        winner, loser = item['winner'], item['loser']
        if winner in matrix.index and loser in matrix.columns:
            matrix.loc[winner, loser] = item['confidence']
    
    fig, ax = plt.subplots(figsize=(12, 10))
    sns.heatmap(matrix, annot=True, fmt='.2f', cmap='RdYlGn', center=0.5, 
                vmin=0, vmax=1, ax=ax, cbar_kws={'label': 'Win Confidence'})
    ax.set_title('Head-to-Head Win Matrix (row > column)', fontsize=14, fontweight='bold')
    ax.set_xlabel('Loser', fontsize=12)
    ax.set_ylabel('Winner', fontsize=12)
    plt.tight_layout()
    plt.show()
    
    print("\nInterpretation: Each cell shows win confidence (0.5 = tied, 1.0 = always wins)")
else:
    print("‚ö†Ô∏è No head-to-head data available")

## Pass Rates by Configuration

Percentage of prompts with score ‚â• 70% (pass threshold).

In [None]:
pass_rates = df.groupby('config')['pass'].mean().sort_values(ascending=True) * 100

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#e74c3c' if rate < 10 else '#f39c12' if rate < 20 else '#2ecc71' for rate in pass_rates]
bars = ax.barh(pass_rates.index, pass_rates.values, color=colors)

ax.set_xlabel('Pass Rate (%)', fontsize=12)
ax.set_ylabel('Configuration', fontsize=12)
ax.set_title('Pass Rates (Score ‚â• 70%)', fontsize=14, fontweight='bold')

for bar, rate in zip(bars, pass_rates.values):
    ax.text(rate + 0.5, bar.get_y() + bar.get_height()/2, 
            f"{rate:.2f}%", va='center', fontsize=10)

plt.tight_layout()
plt.show()

print("\nPass Rate Summary:")
print(pass_rates.to_string())

## Latency Distribution

Response time distributions by configuration (median and p90).

In [None]:
latency_stats = df.groupby('config')['latency_ms'].agg(['median', lambda x: x.quantile(0.9)]).rename(columns={'<lambda_0>': 'p90'})
latency_stats = latency_stats.sort_values('median')

fig, ax = plt.subplots(figsize=(10, 6))
x = range(len(latency_stats))
ax.barh(x, latency_stats['median'], label='Median', alpha=0.7)
ax.barh(x, latency_stats['p90'], label='P90', alpha=0.5)

ax.set_yticks(x)
ax.set_yticklabels(latency_stats.index)
ax.set_xlabel('Latency (ms)', fontsize=12)
ax.set_ylabel('Configuration', fontsize=12)
ax.set_title('Latency Distribution (Lower is Better)', fontsize=14, fontweight='bold')
ax.legend()

plt.tight_layout()
plt.show()

print("\nLatency Summary (ms):")
print(latency_stats.to_string())

## Tool Error Rates

Percentage of prompts where tool calls failed.

In [None]:
error_rates = df.groupby('config')['tool_errors'].mean().sort_values(ascending=False) * 100

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#e74c3c' if rate > 20 else '#f39c12' if rate > 10 else '#2ecc71' for rate in error_rates]
bars = ax.barh(error_rates.index, error_rates.values, color=colors)

ax.set_xlabel('Error Rate (%)', fontsize=12)
ax.set_ylabel('Configuration', fontsize=12)
ax.set_title('Tool Error Rates (Lower is Better)', fontsize=14, fontweight='bold')

for bar, rate in zip(bars, error_rates.values):
    ax.text(rate + 0.5, bar.get_y() + bar.get_height()/2, 
            f"{rate:.2f}%", va='center', fontsize=10)

plt.tight_layout()
plt.show()

print("\nError Rate Summary:")
print(error_rates.to_string())

## Historical Trends

Performance evolution across evaluation runs (if multiple runs available).

In [None]:
# Read all runs from MANIFEST
manifest_path = Path('data/results/MANIFEST.jsonl')
if manifest_path.exists():
    with open(manifest_path) as f:
        runs = [json.loads(line) for line in f]
    
    if len(runs) > 1:
        print(f"üìÖ Found {len(runs)} evaluation runs\n")
        
        # Load comparison data for each run
        historical_data = []
        for run_meta in runs:
            if run_meta.get('dataset') != 'full':
                continue  # Skip test runs
            
            comp_path = Path(f"data/comparisons/{run_meta['path']}/all-weighted.json")
            if comp_path.exists():
                with open(comp_path) as f:
                    comp = json.load(f)
                
                for run_name, metrics in comp['quality'].items():
                    historical_data.append({
                        'date': run_meta['date'],
                        'config': run_name,
                        'avgScore': metrics['avgScore'],
                        'passRate': metrics['passRate'],
                    })
        
        if historical_data:
            hist_df = pd.DataFrame(historical_data)
            
            fig, axes = plt.subplots(1, 2, figsize=(16, 6))
            
            # Average score trends
            for config in hist_df['config'].unique():
                config_data = hist_df[hist_df['config'] == config].sort_values('date')
                axes[0].plot(config_data['date'], config_data['avgScore'] * 100, marker='o', label=config)
            
            axes[0].set_xlabel('Date', fontsize=12)
            axes[0].set_ylabel('Average Score (%)', fontsize=12)
            axes[0].set_title('Score Trends Over Time', fontsize=13, fontweight='bold')
            axes[0].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
            axes[0].grid(True, alpha=0.3)
            axes[0].tick_params(axis='x', rotation=45)
            
            # Pass rate trends
            for config in hist_df['config'].unique():
                config_data = hist_df[hist_df['config'] == config].sort_values('date')
                axes[1].plot(config_data['date'], config_data['passRate'] * 100, marker='o', label=config)
            
            axes[1].set_xlabel('Date', fontsize=12)
            axes[1].set_ylabel('Pass Rate (%)', fontsize=12)
            axes[1].set_title('Pass Rate Trends Over Time', fontsize=13, fontweight='bold')
            axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
            axes[1].grid(True, alpha=0.3)
            axes[1].tick_params(axis='x', rotation=45)
            
            plt.tight_layout()
            plt.show()
        else:
            print("‚ÑπÔ∏è No historical full run data available for comparison")
    else:
        print("‚ÑπÔ∏è Only one evaluation run available - no trends to show yet")
else:
    print("‚ÑπÔ∏è No MANIFEST.jsonl found")