Skip to content

fix(harbor): refuse to score all-zero when nested trials match no task names#16

Open
shehabyasser-scale wants to merge 2 commits into
harbor-3-compiler-fixesfrom
harbor-3-collate-mismatch-guard
Open

fix(harbor): refuse to score all-zero when nested trials match no task names#16
shehabyasser-scale wants to merge 2 commits into
harbor-3-compiler-fixesfrom
harbor-3-collate-mismatch-guard

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #9. Found live: a TB2 optimization trial scored 0.0 on every sample, baseline-passing anchors included, because the vero dataset stored bare task names (break-filter-js-from-html) while harbor records the canonical form (terminal-bench/break-filter-js-from-html). The nested runs executed fine; _collate matched nothing, silently wrote an all-zero experiment, burned budget, and produced a spurious reward. GAIA never hit this only because its dataset names happen to carry the gaia/ prefix.

Silent all-zeros from a keying mismatch are indistinguishable from "the agent failed everything", which corrupts optimization results invisibly. _collate now raises when tasks were just run and either no trials were produced at all, or trials exist but none match (message shows both name forms). Partial matches and resume-with-nothing-pending keep existing behavior.

4 new tests; 14 pass. Companion PR (build-time name validation) coming separately.

🤖 Generated with Claude Code

Greptile Summary

This PR adds a guard in _collate to raise RuntimeError when a nested harbor run either produced no trial results or produced results whose task names match none of the requested ones — preventing silent all-zero scoring from a key-naming mismatch (bare task names vs. canonical org/name form).

  • runner.py: _collate gains an optional ran: list[str] parameter; when non-empty, two new pre-loop checks raise before writing any results to disk if trials are missing or if none of the ran task names appear in the trial map. produce_sample_results passes ran=[t for _, t in pending] so only freshly-executed tasks trigger the guard.
  • test_harbor_runner.py: Four new tests in TestCollateMismatchGuard cover: full mismatch raises, no-trials raises, partial match proceeds, and resume-with-nothing-ran skips the guard.

Confidence Score: 5/5

Safe to merge — the guard is additive and only fires when a run produced results that match none of the requested task names, a case that previously silently corrupted optimization scores.

The change is narrowly scoped: a pre-loop check in _collate that raises before writing any results when the task-name keying is completely wrong or the run produced nothing. All four branches of the guard are exercised by the new tests, and the existing test suite continues to pass unchanged.

No files require special attention.

Important Files Changed

Filename Overview
vero/src/vero/harbor/runner.py Adds ran guard to _collate that raises before any results are written when all task names are mismatched or no trials exist; logic is correct and handles all edge cases (empty ran, partial match, resume).
vero/tests/test_harbor_runner.py Four new tests cover the guard's four branches (full mismatch, no-trials, partial match, resume-noop); all use proper monkeypatching of VERO_HOME_DIR or _is_done to avoid real filesystem reads.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant PS as produce_sample_results
    participant RH as _run_harbor
    participant C as _collate
    participant LT as _load_trials

    PS->>PS: build pending (filter _is_done)
    alt pending non-empty
        PS->>RH: run harbor CLI with pending task names
    end
    PS->>C: "_collate(jobs_dir, pairs, params, ran=pending_names)"
    C->>LT: load trials from jobs_dir
    LT-->>C: "trials dict {task_name: result}"
    alt ran is non-empty (new guard)
        alt no trials at all
            C-->>PS: RuntimeError(no trial results)
        else trials exist but none match ran
            C-->>PS: RuntimeError(none match requested task names)
        end
    end
    loop each (sample_id, task_name) in pairs
        C->>C: _is_done? skip if resume
        C->>C: _sample_result(trials.get(task_name))
        C->>C: save_sample_result
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant PS as produce_sample_results
    participant RH as _run_harbor
    participant C as _collate
    participant LT as _load_trials

    PS->>PS: build pending (filter _is_done)
    alt pending non-empty
        PS->>RH: run harbor CLI with pending task names
    end
    PS->>C: "_collate(jobs_dir, pairs, params, ran=pending_names)"
    C->>LT: load trials from jobs_dir
    LT-->>C: "trials dict {task_name: result}"
    alt ran is non-empty (new guard)
        alt no trials at all
            C-->>PS: RuntimeError(no trial results)
        else trials exist but none match ran
            C-->>PS: RuntimeError(none match requested task names)
        end
    end
    loop each (sample_id, task_name) in pairs
        C->>C: _is_done? skip if resume
        C->>C: _sample_result(trials.get(task_name))
        C->>C: save_sample_result
    end
Loading

Reviews (2): Last reviewed commit: "test(harbor): isolate VERO_HOME_DIR in t..." | Re-trigger Greptile

…k names

Found live: a TB2 optimization trial recorded 0.0 for every sample,
baseline-passing anchors included, because the dataset stored bare task
names while harbor records the canonical '<org>/<name>' form. The nested
runs executed fine; _collate simply matched nothing and wrote an all-zero
experiment that was indistinguishable from total agent failure. It burned
the trial's budget and produced a spurious reward before anyone noticed.

_collate now raises when tasks were just run and either (a) the nested
job produced no trial results at all, or (b) trials exist but none match
the requested names (a keying mismatch, with both name forms in the
message). Partial matches keep the existing behavior (missing tasks are
recorded as error samples), and resume calls with nothing pending skip
the guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread vero/tests/test_harbor_runner.py
Greptile on #16: _is_done fell back to the real home directory; a
leftover matching session there would silently skip samples.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant