fix(harbor): mean aggregation counts scored attempts even when harbor recorded an exception by shehabyasser-scale · Pull Request #21 · scaleapi/vero

shehabyasser-scale · 2026-07-03T12:24:46Z

Stacked on #18.

The bug

Harbor records agent timeouts and non-zero agent exits as exception_info but still runs the verifier, so such attempts carry a real, measured reward. In our 2026-06-30 GAIA ground-truth run, all 11 excepted trials out of 165 (6.7%) had verifier rewards of 0.0 (8 AgentTimeoutError + 3 NonZeroAgentExitCodeError).

The mean-of-k filter excluded any attempt with exception_info, so the score estimated P(pass | attempt finished cleanly) instead of per-attempt pass probability. That estimator is non-monotone: attempts [1.0, timeout, timeout] scored 1.0 while [timeout, timeout, timeout] scored 0.0 (both verified by executing the code). Worse, it systematically forgives candidates whose prompts make the agent slower, which is exactly the failure shape we observed in the weak-model regression trial (exp4: optimized 0.2 vs baseline 0.3). AgentTimeoutError is in harbor's default retry-exclude list, so these dirty-but-scored finals are systematic, not incidental.

The fix

Every attempt with verifier rewards counts toward the mean, dirty or clean; only attempts that died before scoring (no rewards at all) are excluded. Adds n_clean to the sample metrics so dirtiness stays visible alongside n_scored/n_attempts.

Tests

test_mean_counts_scored_exception_attempts: the live-GAIA shape, [1.0 clean, 0.0 timeout, 0.0 timeout] must score 1/3 (fails with 1.0 on the base branch).
test_mean_over_all_dirty_attempts: every attempt timed out but scored, the mean path still applies (previously fell to the best-trial fallback).
Renamed test_mean_excludes_exception_attempts to test_mean_excludes_attempts_without_rewards to match the real rule; its assertions hold under both semantics.

🤖 Generated with Claude Code

Greptile Summary

This PR fixes a statistical bias in the mean aggregation mode for Harbor multi-attempt scoring: previously, attempts with exception_info (agent timeout, non-zero exit) were excluded even when Harbor had still run the verifier and produced a scored 0.0 reward, causing the mean to estimate P(pass | attempt finished cleanly) rather than the true per-attempt pass probability.

Core fix in runner.py: The not t.get("exception_info") gate is removed from the scored-trials filter; only attempts with no rewards at all are excluded from the mean. The new n_clean metric is added alongside n_scored/n_attempts so dirty-but-scored attempts remain visible.
Three new / updated tests cover the live-GAIA failure shape ([1.0, 0.0-timeout, 0.0-timeout] → 1/3, not 1.0), the all-dirty mean path, and the correct exclusion of attempts with no rewards; the renamed test reflects the actual rule.

Confidence Score: 5/5

Safe to merge — the change is a targeted one-line filter removal with new observability metric, and the failure mode it fixes (systematically forgiving slow-agent candidates) is well-documented and regression-tested.

The fix is minimal and correct: removing the exception_info gate from the scored-trials filter and adding n_clean for observability. Three new tests directly cover the GAIA regression shape, the all-dirty mean path, and the unchanged no-reward exclusion, leaving no untested branch in the changed logic path.

No files require special attention.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/runner.py	Removes the exception_info exclusion filter from the scored-trials list; adds n_clean metric. Logic is correct and well-commented.
vero/src/vero/harbor/config.py	Docstring updated to accurately describe the new mean semantics (dirty-but-scored attempts count; only no-reward attempts are excluded).
vero/tests/test_harbor_runner.py	Adds two new regression tests and renames an existing one; assertions correctly cover the GAIA failure shape, all-dirty mean path, and no-reward exclusion.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[attempts list] --> B{attempts non-empty?}
    B -- No --> F[single best-trial path]
    B -- Yes --> C["scored_trials = attempts where\nverifier_result.rewards is truthy"]
    C --> D{scored_trials non-empty?}
    D -- No --> F
    D -- Yes --> E["scored = [_extract_reward(t) for t in scored_trials]\nn_clean = count(t for t if not t.exception_info)"]
    E --> G["SampleResult(\n  score = mean(scored),\n  metrics = {reward_mean, n_attempts, n_scored, n_clean}\n)"]

    style G fill:#d4edda,stroke:#28a745
    style F fill:#fff3cd,stroke:#ffc107

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[attempts list] --> B{attempts non-empty?}
    B -- No --> F[single best-trial path]
    B -- Yes --> C["scored_trials = attempts where\nverifier_result.rewards is truthy"]
    C --> D{scored_trials non-empty?}
    D -- No --> F
    D -- Yes --> E["scored = [_extract_reward(t) for t in scored_trials]\nn_clean = count(t for t if not t.exception_info)"]
    E --> G["SampleResult(\n  score = mean(scored),\n  metrics = {reward_mean, n_attempts, n_scored, n_clean}\n)"]

    style G fill:#d4edda,stroke:#28a745
    style F fill:#fff3cd,stroke:#ffc107

_{Reviews (2): Last reviewed commit: "fix(harbor): mean aggregation counts sco..." | Re-trigger Greptile}

… recorded an exception Harbor records agent timeouts / non-zero exits as exception_info but still runs the verifier, so such attempts carry a real measured 0.0 (in the 2026-06-30 GAIA run, all 11 excepted trials out of 165 had rewards). The mean filter excluded them, making the score estimate P(pass | attempt finished cleanly): non-monotone (one pass plus two timeouts scored 1.0) and systematically forgiving of candidates that make the agent slower, which is exactly the weak-model regression shape. Now every attempt with rewards counts toward the mean, dirty or clean; only attempts that died before scoring are excluded. Adds n_clean to the sample metrics so dirtiness stays visible. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale mentioned this pull request Jul 3, 2026

fix(harbor): reject invalid aggregate_attempts values at construction #22

Open

shehabyasser-scale force-pushed the harbor-3-mean-attempts branch from 1f8eb19 to 221c933 Compare July 3, 2026 13:01

shehabyasser-scale force-pushed the harbor-3-mean-dirty-attempts branch from 0df4854 to 4698439 Compare July 3, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harbor): mean aggregation counts scored attempts even when harbor recorded an exception#21

fix(harbor): mean aggregation counts scored attempts even when harbor recorded an exception#21
shehabyasser-scale wants to merge 1 commit into
harbor-3-mean-attemptsfrom
harbor-3-mean-dirty-attempts

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 3, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The bug

The fix

Tests

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading