feat(harbor): mean-of-k attempt aggregation for de-noised eval scores#18
Open
shehabyasser-scale wants to merge 1 commit into
Open
feat(harbor): mean-of-k attempt aggregation for de-noised eval scores#18shehabyasser-scale wants to merge 1 commit into
shehabyasser-scale wants to merge 1 commit into
Conversation
The central finding of a 6-experiment optimization campaign: on GAIA, single-rollout eval noise (~2 tasks in 12) exceeds the effect size of a prompt/code edit (~1 task), so optimizers hill-climb on noise, and the existing best-of-k trial dedup INFLATES scores toward pass@k rather than reducing variance. Adds HarborConfig.aggregate_attempts = 'best' (default; existing behavior unchanged) | 'mean'. In mean mode, a task's score is the average reward over all clean scored attempts (a verified 0.0 is a valid measurement; an exception is not), shrinking noise ~1/sqrt(k) and estimating pass probability. Falls back to the best trial when nothing scored clean, so error surfacing is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1f8eb19 to
221c933
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #16. This is the fix implied by the optimization campaign's central finding: across 5 valid GAIA trials (strong/weak inner models, prompt-only and full-code surfaces, aggregate and per-task feedback), the effect size of an edit (~1 task) sat below single-rollout eval noise (~2 tasks in 12), so optimizers hill-climbed on noise; one trial's optimizer re-measured its own leading commit and watched 0.667 fall to 0.5 on identical code.
With
n_attempts > 1, the existing dedup keeps the best trial per task, which inflates toward pass@k, the opposite of de-noising. NewHarborConfig.aggregate_attempts:best(default): existing behavior, unchangedmean: average reward over all clean scored attempts (a verified 0.0 counts; an exception does not), noise shrinks ~1/sqrt(k) and the score estimates pass probabilityFalls back to the best trial when nothing scored clean. 3 new tests; 17 pass.
🤖 Generated with Claude Code
Greptile Summary
Adds
aggregate_attempts="mean"toHarborConfigand implements it inHarborRunner._collate/_sample_result. When enabled, a new_trial_groupsscan collects all per-task trial dicts and the scorer averages the reward over every clean (no-exception, has-rewards) attempt, shrinking eval noise by ~1/√k; falls back to the existing single-best-trial path when no attempt scored clean.config.py: one new field (aggregate_attempts: str = "best") with inline documentation explaining the de-noising rationale.runner.py:_trial_groupsmethod mirrors_load_trialsbut retains all trials per task (not just best-ranked); mean branch in_sample_resultfilters onnot exception_info and has rewards, then averages; fallback to best-trial is preserved.test_harbor_runner.py: three new tests for mean averaging, exception exclusion, and unchanged default-best path; 17 tests total.Confidence Score: 5/5
Safe to merge — the mean aggregation path is well-isolated behind the new config field, the default behavior is unchanged, and the score calculation is correct.
The default aggregate_attempts='best' path is untouched and all existing tests still pass. The new mean path correctly filters out exception-carrying trials and falls back to best-trial when nothing scored clean.
No files require special attention; runner.py carries the core logic change and is worth a quick re-read.
Important Files Changed
aggregate_attempts: str = "best"field to HarborConfig; plain str with no enum constraint (silent-misconfiguration concern already flagged in a previous thread)._trial_groupsand mean aggregation branch in_sample_result; score logic is correct, thoughn_attemptsmetric counts all trial JSON files including retry artifacts rather thanconfig.n_attempts.Reviews (2): Last reviewed commit: "feat(harbor): mean-of-k attempt aggregat..." | Re-trigger Greptile