fix(harbor): refuse to score all-zero when nested trials match no task names by shehabyasser-scale · Pull Request #16 · scaleapi/vero

shehabyasser-scale · 2026-07-03T10:15:57Z

Stacked on #9. Found live: a TB2 optimization trial scored 0.0 on every sample, baseline-passing anchors included, because the vero dataset stored bare task names (break-filter-js-from-html) while harbor records the canonical form (terminal-bench/break-filter-js-from-html). The nested runs executed fine; _collate matched nothing, silently wrote an all-zero experiment, burned budget, and produced a spurious reward. GAIA never hit this only because its dataset names happen to carry the gaia/ prefix.

Silent all-zeros from a keying mismatch are indistinguishable from "the agent failed everything", which corrupts optimization results invisibly. _collate now raises when tasks were just run and either no trials were produced at all, or trials exist but none match (message shows both name forms). Partial matches and resume-with-nothing-pending keep existing behavior.

4 new tests; 14 pass. Companion PR (build-time name validation) coming separately.

🤖 Generated with Claude Code

Greptile Summary

This PR adds a guard in _collate to raise RuntimeError when a nested harbor run either produced no trial results or produced results whose task names match none of the requested ones — preventing silent all-zero scoring from a key-naming mismatch (bare task names vs. canonical org/name form).

runner.py: _collate gains an optional ran: list[str] parameter; when non-empty, two new pre-loop checks raise before writing any results to disk if trials are missing or if none of the ran task names appear in the trial map. produce_sample_results passes ran=[t for _, t in pending] so only freshly-executed tasks trigger the guard.
test_harbor_runner.py: Four new tests in TestCollateMismatchGuard cover: full mismatch raises, no-trials raises, partial match proceeds, and resume-with-nothing-ran skips the guard.

Confidence Score: 5/5

Safe to merge — the guard is additive and only fires when a run produced results that match none of the requested task names, a case that previously silently corrupted optimization scores.

The change is narrowly scoped: a pre-loop check in _collate that raises before writing any results when the task-name keying is completely wrong or the run produced nothing. All four branches of the guard are exercised by the new tests, and the existing test suite continues to pass unchanged.

No files require special attention.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/runner.py	Adds `ran` guard to `_collate` that raises before any results are written when all task names are mismatched or no trials exist; logic is correct and handles all edge cases (empty ran, partial match, resume).
vero/tests/test_harbor_runner.py	Four new tests cover the guard's four branches (full mismatch, no-trials, partial match, resume-noop); all use proper monkeypatching of VERO_HOME_DIR or _is_done to avoid real filesystem reads.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant PS as produce_sample_results
    participant RH as _run_harbor
    participant C as _collate
    participant LT as _load_trials

    PS->>PS: build pending (filter _is_done)
    alt pending non-empty
        PS->>RH: run harbor CLI with pending task names
    end
    PS->>C: "_collate(jobs_dir, pairs, params, ran=pending_names)"
    C->>LT: load trials from jobs_dir
    LT-->>C: "trials dict {task_name: result}"
    alt ran is non-empty (new guard)
        alt no trials at all
            C-->>PS: RuntimeError(no trial results)
        else trials exist but none match ran
            C-->>PS: RuntimeError(none match requested task names)
        end
    end
    loop each (sample_id, task_name) in pairs
        C->>C: _is_done? skip if resume
        C->>C: _sample_result(trials.get(task_name))
        C->>C: save_sample_result
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant PS as produce_sample_results
    participant RH as _run_harbor
    participant C as _collate
    participant LT as _load_trials

    PS->>PS: build pending (filter _is_done)
    alt pending non-empty
        PS->>RH: run harbor CLI with pending task names
    end
    PS->>C: "_collate(jobs_dir, pairs, params, ran=pending_names)"
    C->>LT: load trials from jobs_dir
    LT-->>C: "trials dict {task_name: result}"
    alt ran is non-empty (new guard)
        alt no trials at all
            C-->>PS: RuntimeError(no trial results)
        else trials exist but none match ran
            C-->>PS: RuntimeError(none match requested task names)
        end
    end
    loop each (sample_id, task_name) in pairs
        C->>C: _is_done? skip if resume
        C->>C: _sample_result(trials.get(task_name))
        C->>C: save_sample_result
    end

_{Reviews (2): Last reviewed commit: "test(harbor): isolate VERO_HOME_DIR in t..." | Re-trigger Greptile}

…k names Found live: a TB2 optimization trial recorded 0.0 for every sample, baseline-passing anchors included, because the dataset stored bare task names while harbor records the canonical '<org>/<name>' form. The nested runs executed fine; _collate simply matched nothing and wrote an all-zero experiment that was indistinguishable from total agent failure. It burned the trial's budget and produced a spurious reward before anyone noticed. _collate now raises when tasks were just run and either (a) the nested job produced no trial results at all, or (b) trials exist but none match the requested names (a keying mismatch, with both name forms in the message). Partial matches keep the existing behavior (missing tasks are recorded as error samples), and resume calls with nothing pending skip the guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Greptile on #16: _is_done fell back to the real home directory; a leftover matching session there would silently skip samples. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale mentioned this pull request Jul 3, 2026

feat(harbor): validate partition task names against task_source at build time #17

Open

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread vero/tests/test_harbor_runner.py

This was referenced Jul 3, 2026

feat(harbor): mean-of-k attempt aggregation for de-noised eval scores #18

Open

fix(harbor): reject invalid aggregate_attempts values at construction #22

Open

fix(harbor): salvage completed trials when the nested run times out #23

Open

test(harbor): isolate VERO_HOME_DIR in the partial-match collate test

bc453ee

Greptile on #16: _is_done fell back to the real home directory; a leftover matching session there would silently skip samples. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harbor): refuse to score all-zero when nested trials match no task names#16

fix(harbor): refuse to score all-zero when nested trials match no task names#16
shehabyasser-scale wants to merge 2 commits into
harbor-3-compiler-fixesfrom
harbor-3-collate-mismatch-guard

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 3, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading