fix(harbor): reject invalid aggregate_attempts values at construction#22
Open
shehabyasser-scale wants to merge 1 commit into
Open
fix(harbor): reject invalid aggregate_attempts values at construction#22shehabyasser-scale wants to merge 1 commit into
shehabyasser-scale wants to merge 1 commit into
Conversation
Only the exact string 'mean' switches collation to mean-of-k; any other
value silently fell through to best-of-k, so a config typo ('Mean',
'avg') would run an experiment with inflated pass@k scores while looking
de-noised. HarborConfig now raises at construction, which surfaces the
mistake at sidecar startup instead of after the trial's budget is spent.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0df4854 to
4698439
Compare
891207c to
8a91465
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #21.
Only the exact string
meanswitches collation to mean-of-k; any other value silently fell through to best-of-k. A config typo (Mean,avg) would therefore run an experiment with inflated pass@k scores while looking de-noised, and nothing would ever error. This is the same silent-config-failure class that invalidated a full TB2 trial in our campaign (bare task names, #16/#17), so it fails loudly now.HarborConfig.__post_init__raisesValueErrorfor values outside{best, mean}. The only dynamic construction site isHarborConfig(**config.harbor)at sidecar startup (serve.py), so a bad baked config now kills the sidecar at boot with a clear message instead of after the trial's budget is spent.Tests: invalid value raises; valid values round-trip.
🤖 Generated with Claude Code
Greptile Summary
This PR adds a
__post_init__guard toHarborConfigthat raisesValueErrorfor anyaggregate_attemptsvalue outside the allowed set{"best", "mean"}, closing a silent-failure path where a typo would cause an experiment to run best-of-k scoring while appearing de-noised.config.py:__post_init__validatesaggregate_attemptsat dataclass construction time with a clear error message, so a bad baked config kills the sidecar at boot rather than silently producing inflated scores.test_harbor_runner.py: NewTestAggregateAttemptsValidationclass covers invalid value rejection and round-trip acceptance of both valid values.Confidence Score: 5/5
Safe to merge — the change is a targeted input guard on a single dataclass field with no side effects on existing valid configurations.
The default value 'best' passes the new guard, so all existing callers are unaffected. The only dynamic construction site (HarborConfig(**config.harbor) in serve.py) will now fail at sidecar boot with a clear message if a baked config contains an invalid value, which is exactly the intended behavior. The validation logic, error message, and tests are all correct and complete.
No files require special attention.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["HarborConfig(**config.harbor)"] --> B["__post_init__"] B --> C{aggregate_attempts\nin 'best', 'mean'?} C -- Yes --> D["Config object ready\n(sidecar boot continues)"] C -- No --> E["ValueError raised\n'aggregate_attempts must be best or mean, got ...'"] E --> F["Sidecar exits at boot\n(loud failure, clear message)"] D --> G["runner uses aggregate_attempts\nto select scoring mode"] G --> H{"== 'mean'?"} H -- Yes --> I["De-noised mean-of-k scoring"] H -- No --> J["Best-of-k scoring"]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A["HarborConfig(**config.harbor)"] --> B["__post_init__"] B --> C{aggregate_attempts\nin 'best', 'mean'?} C -- Yes --> D["Config object ready\n(sidecar boot continues)"] C -- No --> E["ValueError raised\n'aggregate_attempts must be best or mean, got ...'"] E --> F["Sidecar exits at boot\n(loud failure, clear message)"] D --> G["runner uses aggregate_attempts\nto select scoring mode"] G --> H{"== 'mean'?"} H -- Yes --> I["De-noised mean-of-k scoring"] H -- No --> J["Best-of-k scoring"]Reviews (2): Last reviewed commit: "fix(harbor): reject invalid aggregate_at..." | Re-trigger Greptile