# Pass@k Trials Analysis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/youdotcom-oss/web-search-agent-evals/blob/main/notebooks/trials.ipynb)

Deep dive into agent reliability through pass@k metrics from multi-trial evaluations.

## What This Analyzes

This notebook analyzes **trials data** where each prompt is run multiple times (k trials) to measure:
- **pass@k (Capability)**: Can the agent do this task at all?
- **pass^k (Reliability)**: Does it always succeed?
- **Flakiness**: How much variance across trials?

**Typical Configuration**:
- **Capability Mode**: k=10 trials (can it solve this?)
- **Default Mode**: k=5 trials (balanced)
- **Regression Mode**: k=3 trials (faster checks)

## Quick Navigation

1. [Setup & Data Loading](#setup)
2. [Pass Rate Distribution](#pass-rates)
3. [Capability vs Reliability](#capability-reliability)
4. [Flakiness Analysis](#flakiness)
5. [Prompt Difficulty](#difficulty)
6. [Per-Prompt Heatmap](#heatmap)

In [None]:
# Cell 1: Colab Setup (auto-detects environment)
import os
from pathlib import Path

# Detect if running in Colab
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    print("üîß Running in Google Colab - cloning repository...")
    
    # Clone repository if not already present
    repo_dir = Path('/content/web-search-agent-evals')
    if not repo_dir.exists():
        !git clone https://github.com/youdotcom-oss/web-search-agent-evals.git /content/web-search-agent-evals
        print("‚úì Repository cloned")
    else:
        print("‚úì Repository already exists")
        # Pull latest changes
        %cd /content/web-search-agent-evals
        !git pull origin main
    
    # Change to repo directory
    %cd /content/web-search-agent-evals
    print(f"‚úì Working directory: {Path.cwd()}")
else:
    print("‚úì Running locally")

In [None]:
# Cell 2: Dependencies & Configuration
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pathlib import Path

# Configure plotting
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = (12, 6)

# Find project root
PROJECT_ROOT = Path.cwd()
if PROJECT_ROOT.name == 'notebooks':
    PROJECT_ROOT = PROJECT_ROOT.parent

DATA_DIR = PROJECT_ROOT / 'data'
print(f"üìÅ Project root: {PROJECT_ROOT}")
print(f"üìä Data directory: {DATA_DIR}")

# Verify data directory exists
if not DATA_DIR.exists():
    raise FileNotFoundError(f"Data directory not found: {DATA_DIR}")

In [None]:
# Cell 3: Configuration - Choose Trials Dataset

# =====================================
# USER CONFIGURATION
# =====================================
AGENT = 'droid'          # Options: 'claude-code', 'gemini', 'droid', 'codex'
PROVIDER = 'builtin'     # Options: 'builtin', 'you' (or other MCP server keys)
TRIAL_TYPE = 'default'   # Options: 'default', 'capability', 'regression'
RUN_DATE = None          # None for latest, or '2026-01-29' for specific date
# =====================================

print(f"üìä TRIALS ANALYSIS: {AGENT.upper()} - {PROVIDER.upper()}")
print("="*70)

trials_dir = DATA_DIR / 'results' / 'trials'

# Get latest date if not specified
if RUN_DATE is None:
    dirs = sorted([d.name for d in trials_dir.iterdir() if d.is_dir() and d.name[0].isdigit()])
    if not dirs:
        raise FileNotFoundError("No dated trials found")
    RUN_DATE = dirs[-1]
    print(f"Using latest trials run: {RUN_DATE}")
else:
    print(f"Using specified run date: {RUN_DATE}")

# Show available dates
available_dates = sorted([d.name for d in trials_dir.iterdir() if d.is_dir() and d.name[0].isdigit()])
print(f"\nAvailable trial dates:")
for date in available_dates[-5:]:  # Show last 5
    print(f"  - {date}")

# Build trials file path (same nested structure as runs)
suffix = '' if TRIAL_TYPE == 'default' else f'-{TRIAL_TYPE}'
trials_file = trials_dir / RUN_DATE / AGENT / f"{PROVIDER}{suffix}.jsonl"

if not trials_file.exists():
    raise FileNotFoundError(
        f"Trials file not found: {trials_file}\n"
        f"Run: bun run trials -- --agent {AGENT} --search-provider {PROVIDER}"
    )

print(f"\nLoading: {trials_file}")
print("="*70)

<a id='setup'></a>
## Load Trials Data

In [None]:
# Cell 4: Load and Parse Trials Data
with open(trials_file) as f:
    trials = [json.loads(line) for line in f]

df = pd.DataFrame(trials)

print(f"‚úì Loaded {len(df)} prompts with trial data")
print(f"\nColumns: {list(df.columns)}")
print(f"\nSample record:")
print(json.dumps(trials[0], indent=2))

# Extract key metrics
df['k'] = df['trials'].apply(len)
df['prompt_short'] = df['id'].str[:40] + '...'

print(f"\nüìà DATASET SUMMARY")
print("="*70)
print(f"Prompts: {len(df)}")
print(f"Trials per prompt (k): {df['k'].iloc[0]}")
print(f"Total evaluations: {df['k'].sum()}")

<a id='pass-rates'></a>
## Pass Rate Distribution

In [None]:
# Cell 5: Pass Rate Distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Histogram of pass rates
ax1.hist(df['passRate'], bins=20, color='#3498db', alpha=0.7, edgecolor='black')
ax1.axvline(df['passRate'].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
ax1.axvline(df['passRate'].median(), color='green', linestyle='--', linewidth=2, label='Median')
ax1.set_xlabel('Pass Rate', fontsize=12)
ax1.set_ylabel('Number of Prompts', fontsize=12)
ax1.set_title('Distribution of Pass Rates Across Prompts', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Pass@k vs Pass^k scatter
ax2.scatter(df['passAtK'], df['passExpK'], s=100, alpha=0.6, color='#3498db')
ax2.plot([0, 1], [0, 1], 'r--', alpha=0.3, label='Perfect Reliability')
ax2.set_xlabel('pass@k (Capability)', fontsize=12)
ax2.set_ylabel('pass^k (Reliability)', fontsize=12)
ax2.set_title('Capability vs Reliability', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä PASS RATE STATISTICS")
print("="*70)
print(f"Mean Pass Rate: {df['passRate'].mean():.1%}")
print(f"Median Pass Rate: {df['passRate'].median():.1%}")
print(f"Std Dev: {df['passRate'].std():.1%}")
print(f"\nAlways Pass (100%): {(df['passRate'] == 1.0).sum()} prompts")
print(f"Sometimes Pass (0-100%): {((df['passRate'] > 0) & (df['passRate'] < 1.0)).sum()} prompts")
print(f"Never Pass (0%): {(df['passRate'] == 0.0).sum()} prompts")

<a id='capability-reliability'></a>
## Capability vs Reliability Frontier

In [None]:
# Cell 6: Capability vs Reliability Analysis
fig, ax = plt.subplots(figsize=(12, 8))

# Color by pass rate
scatter = ax.scatter(df['passAtK'], df['passExpK'], 
                     c=df['passRate'], cmap='RdYlGn', 
                     s=150, alpha=0.7, edgecolors='black', linewidth=1)

# Add diagonal line (perfect reliability)
ax.plot([0, 1], [0, 1], 'k--', alpha=0.3, linewidth=2, label='Perfect Reliability')

# Add quadrant lines
ax.axhline(0.5, color='gray', linestyle=':', alpha=0.3)
ax.axvline(0.5, color='gray', linestyle=':', alpha=0.3)

# Annotate quadrants
ax.text(0.75, 0.75, 'High Capability\nHigh Reliability', 
        ha='center', va='center', fontsize=10, alpha=0.5, 
        bbox=dict(boxstyle='round', facecolor='green', alpha=0.1))
ax.text(0.25, 0.25, 'Low Capability\nLow Reliability', 
        ha='center', va='center', fontsize=10, alpha=0.5,
        bbox=dict(boxstyle='round', facecolor='red', alpha=0.1))
ax.text(0.75, 0.25, 'High Capability\nLow Reliability (Flaky)', 
        ha='center', va='center', fontsize=10, alpha=0.5,
        bbox=dict(boxstyle='round', facecolor='orange', alpha=0.1))

ax.set_xlabel('pass@k (Probability of success in k attempts)', fontsize=12)
ax.set_ylabel('pass^k (Probability of k consecutive successes)', fontsize=12)
ax.set_title(f'Capability vs Reliability Frontier\n{AGENT.upper()} - {PROVIDER.upper()} ({len(df)} prompts)', 
             fontsize=14, fontweight='bold')
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.05, 1.05)
ax.legend()
ax.grid(True, alpha=0.3)

# Add colorbar
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Pass Rate', fontsize=12)

plt.tight_layout()
plt.show()

print("\nüí° INTERPRETATION")
print("="*70)
print("pass@k = 1 - (1 - p)^k  (Can solve with k attempts)")
print("pass^k = p^k            (Solves k times in a row)")
print("\nIdeal: Top-right (high both) = Capable AND reliable")
print("Concern: Top-left or bottom-right = Flaky (inconsistent)")

<a id='flakiness'></a>
## Flakiness Analysis

In [None]:
# Cell 7: Identify Flaky Prompts
# Flakiness score: high passAtK but low passExpK = inconsistent
df['flakiness'] = df['passAtK'] - df['passExpK']

# Show top 10 flakiest prompts
flaky = df.nlargest(10, 'flakiness')[['id', 'passRate', 'passAtK', 'passExpK', 'flakiness']]

print("üî• TOP 10 FLAKIEST PROMPTS")
print("="*70)
print(flaky.to_string(index=False))

# Plot flakiness distribution
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(df['flakiness'], bins=20, color='#e74c3c', alpha=0.7, edgecolor='black')
ax.axvline(df['flakiness'].mean(), color='blue', linestyle='--', linewidth=2, label='Mean')
ax.set_xlabel('Flakiness Score (pass@k - pass^k)', fontsize=12)
ax.set_ylabel('Number of Prompts', fontsize=12)
ax.set_title('Flakiness Distribution', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° High flakiness = Agent can solve but not consistently")
print("   Consider: Retries, prompt refinement, or different agent")

<a id='difficulty'></a>
## Prompt Difficulty Analysis

In [None]:
# Cell 8: Hardest and Easiest Prompts
easiest = df.nlargest(10, 'passRate')[['id', 'passRate', 'passAtK', 'passExpK']]
hardest = df.nsmallest(10, 'passRate')[['id', 'passRate', 'passAtK', 'passExpK']]

print("‚úÖ TOP 10 EASIEST PROMPTS (Highest Pass Rate)")
print("="*70)
print(easiest.to_string(index=False))

print("\n‚ùå TOP 10 HARDEST PROMPTS (Lowest Pass Rate)")
print("="*70)
print(hardest.to_string(index=False))

# Plot pass rate ranking
df_sorted = df.sort_values('passRate', ascending=False).reset_index(drop=True)
df_sorted['rank'] = range(1, len(df_sorted) + 1)

fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(df_sorted['rank'], df_sorted['passRate'], marker='o', linestyle='-', alpha=0.6)
ax.axhline(0.5, color='red', linestyle='--', alpha=0.3, label='50% Threshold')
ax.fill_between(df_sorted['rank'], 0, df_sorted['passRate'], alpha=0.2)
ax.set_xlabel('Prompt (Sorted by Pass Rate)', fontsize=12)
ax.set_ylabel('Pass Rate', fontsize=12)
ax.set_title('Pass Rate by Prompt Difficulty', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<a id='heatmap'></a>
## Per-Prompt Trial Heatmap

In [None]:
# Cell 9: Trial Results Heatmap
# Build matrix of trial results (rows = prompts, columns = trials)
trial_matrix = []
prompt_ids = []

for _, row in df.iterrows():
    # Extract pass/fail for each trial (1 = pass, 0 = fail)
    trial_results = [1 if t.get('score', 0) >= 0.65 else 0 for t in row['trials']]
    trial_matrix.append(trial_results)
    prompt_ids.append(row['prompt_short'])

trial_matrix = np.array(trial_matrix)

# Plot heatmap
fig, ax = plt.subplots(figsize=(14, max(10, len(df) * 0.3)))
sns.heatmap(trial_matrix, cmap=['#e74c3c', '#2ecc71'], cbar=False,
            yticklabels=prompt_ids, xticklabels=range(1, trial_matrix.shape[1] + 1),
            ax=ax, linewidths=0.5, linecolor='white')

ax.set_xlabel('Trial Number', fontsize=12)
ax.set_ylabel('Prompt', fontsize=10)
ax.set_title(f'Trial Results Heatmap\n{AGENT.upper()} - {PROVIDER.upper()} (Green = Pass, Red = Fail)', 
             fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüí° Each row = one prompt, each column = one trial")
print("   Consistent green rows = reliable prompts")
print("   Mixed rows = flaky prompts")
print("   Consistent red rows = unsolvable prompts (for this agent)")

## Summary & Recommendations

### Key Metrics
- **pass@k**: Probability of at least one success in k trials (capability)
- **pass^k**: Probability of k consecutive successes (reliability)
- **Flakiness**: pass@k - pass^k (inconsistency measure)

### Use Cases
1. **Production Deployment**: Choose agents with high pass^k (reliability)
2. **Prompt Engineering**: Focus on flaky prompts (can work but inconsistent)
3. **Agent Selection**: Compare reliability across agents for specific task types
4. **Regression Testing**: Track if reliability drops over time

### Related Analysis
- **Comparison Analysis**: See `comparison.ipynb` for quality rankings
- **Custom Analysis**: Load raw JSONL files for trajectory inspection