feat(harbor): mean-of-k attempt aggregation for de-noised eval scores by shehabyasser-scale · Pull Request #18 · scaleapi/vero

shehabyasser-scale · 2026-07-03T10:21:18Z

Stacked on #16. This is the fix implied by the optimization campaign's central finding: across 5 valid GAIA trials (strong/weak inner models, prompt-only and full-code surfaces, aggregate and per-task feedback), the effect size of an edit (~1 task) sat below single-rollout eval noise (~2 tasks in 12), so optimizers hill-climbed on noise; one trial's optimizer re-measured its own leading commit and watched 0.667 fall to 0.5 on identical code.

With n_attempts > 1, the existing dedup keeps the best trial per task, which inflates toward pass@k, the opposite of de-noising. New HarborConfig.aggregate_attempts:

best (default): existing behavior, unchanged
mean: average reward over all clean scored attempts (a verified 0.0 counts; an exception does not), noise shrinks ~1/sqrt(k) and the score estimates pass probability

Falls back to the best trial when nothing scored clean. 3 new tests; 17 pass.

🤖 Generated with Claude Code

Greptile Summary

Adds aggregate_attempts="mean" to HarborConfig and implements it in HarborRunner._collate / _sample_result. When enabled, a new _trial_groups scan collects all per-task trial dicts and the scorer averages the reward over every clean (no-exception, has-rewards) attempt, shrinking eval noise by ~1/√k; falls back to the existing single-best-trial path when no attempt scored clean.

config.py: one new field (aggregate_attempts: str = "best") with inline documentation explaining the de-noising rationale.
runner.py: _trial_groups method mirrors _load_trials but retains all trials per task (not just best-ranked); mean branch in _sample_result filters on not exception_info and has rewards, then averages; fallback to best-trial is preserved.
test_harbor_runner.py: three new tests for mean averaging, exception exclusion, and unchanged default-best path; 17 tests total.

Confidence Score: 5/5

Safe to merge — the mean aggregation path is well-isolated behind the new config field, the default behavior is unchanged, and the score calculation is correct.

The default aggregate_attempts='best' path is untouched and all existing tests still pass. The new mean path correctly filters out exception-carrying trials and falls back to best-trial when nothing scored clean.

No files require special attention; runner.py carries the core logic change and is worth a quick re-read.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/config.py	Adds `aggregate_attempts: str = "best"` field to HarborConfig; plain str with no enum constraint (silent-misconfiguration concern already flagged in a previous thread).
vero/src/vero/harbor/runner.py	Adds `_trial_groups` and mean aggregation branch in `_sample_result`; score logic is correct, though `n_attempts` metric counts all trial JSON files including retry artifacts rather than `config.n_attempts`.
vero/tests/test_harbor_runner.py	Three new tests covering mean averaging, exception exclusion, and default-best path; does not cover the all-exceptions fallback path.

_{Reviews (2): Last reviewed commit: "feat(harbor): mean-of-k attempt aggregat..." | Re-trigger Greptile}

The central finding of a 6-experiment optimization campaign: on GAIA, single-rollout eval noise (~2 tasks in 12) exceeds the effect size of a prompt/code edit (~1 task), so optimizers hill-climb on noise, and the existing best-of-k trial dedup INFLATES scores toward pass@k rather than reducing variance. Adds HarborConfig.aggregate_attempts = 'best' (default; existing behavior unchanged) | 'mean'. In mean mode, a task's score is the average reward over all clean scored attempts (a verified 0.0 is a valid measurement; an exception is not), shrinking noise ~1/sqrt(k) and estimating pass probability. Falls back to the best trial when nothing scored clean, so error surfacing is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread vero/src/vero/harbor/config.py

shehabyasser-scale mentioned this pull request Jul 3, 2026

fix(harbor): mean aggregation counts scored attempts even when harbor recorded an exception #21

Open

shehabyasser-scale force-pushed the harbor-3-mean-attempts branch from 1f8eb19 to 221c933 Compare July 3, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(harbor): mean-of-k attempt aggregation for de-noised eval scores#18

feat(harbor): mean-of-k attempt aggregation for de-noised eval scores#18
shehabyasser-scale wants to merge 1 commit into
harbor-3-collate-mismatch-guardfrom
harbor-3-mean-attempts

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 3, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading