feat(eval): corpus #19 — 3 v0.2 tracked-refs tasks (DRAFT spec) by w1ne · Pull Request #57 · w1ne/kernelCAD-web

w1ne · 2026-05-03T01:33:54Z

Summary

First slice of workstream #19 (corpus expansion) per the v0.2-to-v1.0 gap-closure roadmap. Adds 3 v0.2 tracked-refs corpus tasks under eval/tasks/, each exercising a natural-form case that requires v0.2 capability (canonical face refs surviving transforms or unambiguous booleans):

fillet-translated-box — box(...).translate(...).fillet(r, { face: 'top' })
subtract-then-fillet-rim — box(...).subtract(cyl).fillet(r, { face: 'top' }) (the canonical "subtract.fillet" case from the gap-closure roadmap)
chamfer-rotated-wedge — box(...).rotate([1,0,0], θ).chamfer(d, { face: 'top' })

Each task ships with prompt.md, solution-expert.kcad.ts, and harness.ts. New eval/corpus-v0.2.test.ts mirrors the seed-task sanity pattern (eval/runner.test.ts) and asserts each expert solution scores 100% via runTask + MockAgentClient.

Spec + plan are marked DRAFT v1 in their bodies (docs/superpowers/specs/2026-05-03-corpus-v0.2-tracked-refs-design.md, docs/superpowers/plans/2026-05-03-corpus-v0.2-tracked-refs.md). The brainstorm step was skipped — the gap-closure roadmap text was specific enough that the spec is a focused operationalization rather than open exploration. Open questions for controller review are listed in the spec's "Open questions" section (3 vs 5 tasks, difficulty band assignments, tilt-correctness rubric tightness, naming).

Why this slice

The dispatch order doc (PR #56) calls for Wave 1 = #19 + #21 + #22 in parallel. This is the first concrete #19 deliverable: the v0.2 module's 3-task slice (range "3–5" per the gap-closure roadmap §Corpus design). Three tasks rather than five keeps the slice reviewable; the missing 0–2 are punted to a follow-up pending controller feedback on the rubric shape.

Test Plan

npm run typecheck — clean
npm run build:cli — clean
npm test — 903 pass, 25 skipped, 0 failed (was 870 before; +3 corpus tests + a few transitive expansions)
Each expert solution evaluates clean via kernelcad evaluate --json
Each expert solution scores gate_pass=true, score=1 via runTask+MockAgentClient
Branch rebased on develop tip (3a309be)

Scope notes

Included: 3 task folders + 1 vitest sanity check + CHANGELOG [Unreleased] entry. package-lock.json updated to reflect the v0.2.0 version bump (the lockfile was stale at 0.1.0 in the worktree until npm install ran).

Out of scope:

Real-API agent runs (corpus-author-time concern; lives at G1 pre-flight per the roadmap).
Difficulty-band wiring beyond per-task self-classification (no harness behavior depends on bands yet).
Tasks for v0.3+ modules (those modules haven't shipped).

Companion PR

Depends on PR #56 landing (the gap-closure scaffolding) for the corpus-design spec it references. If #56 is held for review, this PR can stand alone — it doesn't import any of #56's files at runtime.

w1ne added 2 commits May 3, 2026 11:33

docs: corpus expansion #19 v0.2 tracked-refs spec + plan (DRAFT)

95a5fa7

feat(eval): 3 v0.2 tracked-refs corpus tasks

e0ef382

w1ne force-pushed the feat/corpus-v0.2-tracked-refs branch from c3e2f7c to e0ef382 Compare May 3, 2026 09:33

w1ne enabled auto-merge (squash) May 3, 2026 09:33

w1ne merged commit c19f938 into develop May 3, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): corpus #19 — 3 v0.2 tracked-refs tasks (DRAFT spec)#57

feat(eval): corpus #19 — 3 v0.2 tracked-refs tasks (DRAFT spec)#57
w1ne merged 2 commits intodevelopfrom
feat/corpus-v0.2-tracked-refs

w1ne commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

w1ne commented May 3, 2026

Summary

Why this slice

Test Plan

Scope notes

Companion PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant