Skip to content

fix(harbor): mean aggregation counts scored attempts even when harbor recorded an exception#21

Open
shehabyasser-scale wants to merge 1 commit into
harbor-3-mean-attemptsfrom
harbor-3-mean-dirty-attempts
Open

fix(harbor): mean aggregation counts scored attempts even when harbor recorded an exception#21
shehabyasser-scale wants to merge 1 commit into
harbor-3-mean-attemptsfrom
harbor-3-mean-dirty-attempts

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #18.

The bug

Harbor records agent timeouts and non-zero agent exits as exception_info but still runs the verifier, so such attempts carry a real, measured reward. In our 2026-06-30 GAIA ground-truth run, all 11 excepted trials out of 165 (6.7%) had verifier rewards of 0.0 (8 AgentTimeoutError + 3 NonZeroAgentExitCodeError).

The mean-of-k filter excluded any attempt with exception_info, so the score estimated P(pass | attempt finished cleanly) instead of per-attempt pass probability. That estimator is non-monotone: attempts [1.0, timeout, timeout] scored 1.0 while [timeout, timeout, timeout] scored 0.0 (both verified by executing the code). Worse, it systematically forgives candidates whose prompts make the agent slower, which is exactly the failure shape we observed in the weak-model regression trial (exp4: optimized 0.2 vs baseline 0.3). AgentTimeoutError is in harbor's default retry-exclude list, so these dirty-but-scored finals are systematic, not incidental.

The fix

Every attempt with verifier rewards counts toward the mean, dirty or clean; only attempts that died before scoring (no rewards at all) are excluded. Adds n_clean to the sample metrics so dirtiness stays visible alongside n_scored/n_attempts.

Tests

  • test_mean_counts_scored_exception_attempts: the live-GAIA shape, [1.0 clean, 0.0 timeout, 0.0 timeout] must score 1/3 (fails with 1.0 on the base branch).
  • test_mean_over_all_dirty_attempts: every attempt timed out but scored, the mean path still applies (previously fell to the best-trial fallback).
  • Renamed test_mean_excludes_exception_attempts to test_mean_excludes_attempts_without_rewards to match the real rule; its assertions hold under both semantics.

🤖 Generated with Claude Code

Greptile Summary

This PR fixes a statistical bias in the mean aggregation mode for Harbor multi-attempt scoring: previously, attempts with exception_info (agent timeout, non-zero exit) were excluded even when Harbor had still run the verifier and produced a scored 0.0 reward, causing the mean to estimate P(pass | attempt finished cleanly) rather than the true per-attempt pass probability.

  • Core fix in runner.py: The not t.get("exception_info") gate is removed from the scored-trials filter; only attempts with no rewards at all are excluded from the mean. The new n_clean metric is added alongside n_scored/n_attempts so dirty-but-scored attempts remain visible.
  • Three new / updated tests cover the live-GAIA failure shape ([1.0, 0.0-timeout, 0.0-timeout] → 1/3, not 1.0), the all-dirty mean path, and the correct exclusion of attempts with no rewards; the renamed test reflects the actual rule.

Confidence Score: 5/5

Safe to merge — the change is a targeted one-line filter removal with new observability metric, and the failure mode it fixes (systematically forgiving slow-agent candidates) is well-documented and regression-tested.

The fix is minimal and correct: removing the exception_info gate from the scored-trials filter and adding n_clean for observability. Three new tests directly cover the GAIA regression shape, the all-dirty mean path, and the unchanged no-reward exclusion, leaving no untested branch in the changed logic path.

No files require special attention.

Important Files Changed

Filename Overview
vero/src/vero/harbor/runner.py Removes the exception_info exclusion filter from the scored-trials list; adds n_clean metric. Logic is correct and well-commented.
vero/src/vero/harbor/config.py Docstring updated to accurately describe the new mean semantics (dirty-but-scored attempts count; only no-reward attempts are excluded).
vero/tests/test_harbor_runner.py Adds two new regression tests and renames an existing one; assertions correctly cover the GAIA failure shape, all-dirty mean path, and no-reward exclusion.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[attempts list] --> B{attempts non-empty?}
    B -- No --> F[single best-trial path]
    B -- Yes --> C["scored_trials = attempts where\nverifier_result.rewards is truthy"]
    C --> D{scored_trials non-empty?}
    D -- No --> F
    D -- Yes --> E["scored = [_extract_reward(t) for t in scored_trials]\nn_clean = count(t for t if not t.exception_info)"]
    E --> G["SampleResult(\n  score = mean(scored),\n  metrics = {reward_mean, n_attempts, n_scored, n_clean}\n)"]

    style G fill:#d4edda,stroke:#28a745
    style F fill:#fff3cd,stroke:#ffc107
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[attempts list] --> B{attempts non-empty?}
    B -- No --> F[single best-trial path]
    B -- Yes --> C["scored_trials = attempts where\nverifier_result.rewards is truthy"]
    C --> D{scored_trials non-empty?}
    D -- No --> F
    D -- Yes --> E["scored = [_extract_reward(t) for t in scored_trials]\nn_clean = count(t for t if not t.exception_info)"]
    E --> G["SampleResult(\n  score = mean(scored),\n  metrics = {reward_mean, n_attempts, n_scored, n_clean}\n)"]

    style G fill:#d4edda,stroke:#28a745
    style F fill:#fff3cd,stroke:#ffc107
Loading

Reviews (2): Last reviewed commit: "fix(harbor): mean aggregation counts sco..." | Re-trigger Greptile

… recorded an exception

Harbor records agent timeouts / non-zero exits as exception_info but still
runs the verifier, so such attempts carry a real measured 0.0 (in the
2026-06-30 GAIA run, all 11 excepted trials out of 165 had rewards). The
mean filter excluded them, making the score estimate
P(pass | attempt finished cleanly): non-monotone (one pass plus two
timeouts scored 1.0) and systematically forgiving of candidates that make
the agent slower, which is exactly the weak-model regression shape.

Now every attempt with rewards counts toward the mean, dirty or clean;
only attempts that died before scoring are excluded. Adds n_clean to the
sample metrics so dirtiness stays visible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shehabyasser-scale shehabyasser-scale force-pushed the harbor-3-mean-dirty-attempts branch from 0df4854 to 4698439 Compare July 3, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant