Skip to content

feat(eval): corpus #19 — 3 v0.2 tracked-refs tasks (DRAFT spec)#57

Merged
w1ne merged 2 commits intodevelopfrom
feat/corpus-v0.2-tracked-refs
May 3, 2026
Merged

feat(eval): corpus #19 — 3 v0.2 tracked-refs tasks (DRAFT spec)#57
w1ne merged 2 commits intodevelopfrom
feat/corpus-v0.2-tracked-refs

Conversation

@w1ne
Copy link
Copy Markdown
Owner

@w1ne w1ne commented May 3, 2026

Summary

First slice of workstream #19 (corpus expansion) per the v0.2-to-v1.0 gap-closure roadmap. Adds 3 v0.2 tracked-refs corpus tasks under eval/tasks/, each exercising a natural-form case that requires v0.2 capability (canonical face refs surviving transforms or unambiguous booleans):

  • fillet-translated-boxbox(...).translate(...).fillet(r, { face: 'top' })
  • subtract-then-fillet-rimbox(...).subtract(cyl).fillet(r, { face: 'top' }) (the canonical "subtract.fillet" case from the gap-closure roadmap)
  • chamfer-rotated-wedgebox(...).rotate([1,0,0], θ).chamfer(d, { face: 'top' })

Each task ships with prompt.md, solution-expert.kcad.ts, and harness.ts. New eval/corpus-v0.2.test.ts mirrors the seed-task sanity pattern (eval/runner.test.ts) and asserts each expert solution scores 100% via runTask + MockAgentClient.

Spec + plan are marked DRAFT v1 in their bodies (docs/superpowers/specs/2026-05-03-corpus-v0.2-tracked-refs-design.md, docs/superpowers/plans/2026-05-03-corpus-v0.2-tracked-refs.md). The brainstorm step was skipped — the gap-closure roadmap text was specific enough that the spec is a focused operationalization rather than open exploration. Open questions for controller review are listed in the spec's "Open questions" section (3 vs 5 tasks, difficulty band assignments, tilt-correctness rubric tightness, naming).

Why this slice

The dispatch order doc (PR #56) calls for Wave 1 = #19 + #21 + #22 in parallel. This is the first concrete #19 deliverable: the v0.2 module's 3-task slice (range "3–5" per the gap-closure roadmap §Corpus design). Three tasks rather than five keeps the slice reviewable; the missing 0–2 are punted to a follow-up pending controller feedback on the rubric shape.

Test Plan

  • npm run typecheck — clean
  • npm run build:cli — clean
  • npm test — 903 pass, 25 skipped, 0 failed (was 870 before; +3 corpus tests + a few transitive expansions)
  • Each expert solution evaluates clean via kernelcad evaluate --json
  • Each expert solution scores gate_pass=true, score=1 via runTask+MockAgentClient
  • Branch rebased on develop tip (3a309be)

Scope notes

Included: 3 task folders + 1 vitest sanity check + CHANGELOG [Unreleased] entry. package-lock.json updated to reflect the v0.2.0 version bump (the lockfile was stale at 0.1.0 in the worktree until npm install ran).

Out of scope:

  • Real-API agent runs (corpus-author-time concern; lives at G1 pre-flight per the roadmap).
  • Difficulty-band wiring beyond per-task self-classification (no harness behavior depends on bands yet).
  • Tasks for v0.3+ modules (those modules haven't shipped).

Companion PR

Depends on PR #56 landing (the gap-closure scaffolding) for the corpus-design spec it references. If #56 is held for review, this PR can stand alone — it doesn't import any of #56's files at runtime.

@w1ne w1ne force-pushed the feat/corpus-v0.2-tracked-refs branch from c3e2f7c to e0ef382 Compare May 3, 2026 09:33
@w1ne w1ne enabled auto-merge (squash) May 3, 2026 09:33
@w1ne w1ne merged commit c19f938 into develop May 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant