Skip to content

feat(harbor): mean-of-k attempt aggregation for de-noised eval scores#18

Open
shehabyasser-scale wants to merge 1 commit into
harbor-3-collate-mismatch-guardfrom
harbor-3-mean-attempts
Open

feat(harbor): mean-of-k attempt aggregation for de-noised eval scores#18
shehabyasser-scale wants to merge 1 commit into
harbor-3-collate-mismatch-guardfrom
harbor-3-mean-attempts

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #16. This is the fix implied by the optimization campaign's central finding: across 5 valid GAIA trials (strong/weak inner models, prompt-only and full-code surfaces, aggregate and per-task feedback), the effect size of an edit (~1 task) sat below single-rollout eval noise (~2 tasks in 12), so optimizers hill-climbed on noise; one trial's optimizer re-measured its own leading commit and watched 0.667 fall to 0.5 on identical code.

With n_attempts > 1, the existing dedup keeps the best trial per task, which inflates toward pass@k, the opposite of de-noising. New HarborConfig.aggregate_attempts:

  • best (default): existing behavior, unchanged
  • mean: average reward over all clean scored attempts (a verified 0.0 counts; an exception does not), noise shrinks ~1/sqrt(k) and the score estimates pass probability

Falls back to the best trial when nothing scored clean. 3 new tests; 17 pass.

🤖 Generated with Claude Code

Greptile Summary

Adds aggregate_attempts="mean" to HarborConfig and implements it in HarborRunner._collate / _sample_result. When enabled, a new _trial_groups scan collects all per-task trial dicts and the scorer averages the reward over every clean (no-exception, has-rewards) attempt, shrinking eval noise by ~1/√k; falls back to the existing single-best-trial path when no attempt scored clean.

  • config.py: one new field (aggregate_attempts: str = "best") with inline documentation explaining the de-noising rationale.
  • runner.py: _trial_groups method mirrors _load_trials but retains all trials per task (not just best-ranked); mean branch in _sample_result filters on not exception_info and has rewards, then averages; fallback to best-trial is preserved.
  • test_harbor_runner.py: three new tests for mean averaging, exception exclusion, and unchanged default-best path; 17 tests total.

Confidence Score: 5/5

Safe to merge — the mean aggregation path is well-isolated behind the new config field, the default behavior is unchanged, and the score calculation is correct.

The default aggregate_attempts='best' path is untouched and all existing tests still pass. The new mean path correctly filters out exception-carrying trials and falls back to best-trial when nothing scored clean.

No files require special attention; runner.py carries the core logic change and is worth a quick re-read.

Important Files Changed

Filename Overview
vero/src/vero/harbor/config.py Adds aggregate_attempts: str = "best" field to HarborConfig; plain str with no enum constraint (silent-misconfiguration concern already flagged in a previous thread).
vero/src/vero/harbor/runner.py Adds _trial_groups and mean aggregation branch in _sample_result; score logic is correct, though n_attempts metric counts all trial JSON files including retry artifacts rather than config.n_attempts.
vero/tests/test_harbor_runner.py Three new tests covering mean averaging, exception exclusion, and default-best path; does not cover the all-exceptions fallback path.

Reviews (2): Last reviewed commit: "feat(harbor): mean-of-k attempt aggregat..." | Re-trigger Greptile

Comment thread vero/src/vero/harbor/config.py
The central finding of a 6-experiment optimization campaign: on GAIA,
single-rollout eval noise (~2 tasks in 12) exceeds the effect size of a
prompt/code edit (~1 task), so optimizers hill-climb on noise, and the
existing best-of-k trial dedup INFLATES scores toward pass@k rather than
reducing variance.

Adds HarborConfig.aggregate_attempts = 'best' (default; existing
behavior unchanged) | 'mean'. In mean mode, a task's score is the
average reward over all clean scored attempts (a verified 0.0 is a
valid measurement; an exception is not), shrinking noise ~1/sqrt(k) and
estimating pass probability. Falls back to the best trial when nothing
scored clean, so error surfacing is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant