feat: v1 finalize Track C — debt sweep (7 ACs)#73
Merged
Conversation
Closed
theagenticguy
pushed a commit
that referenced
this pull request
May 10, 2026
## Track C — debt sweep (7 ACs)
Closes the debt-sweep leg of v1-finalize per
`.erpaval/specs/006-v1-finalize/`.
Spec: [spec.md§Track C](.erpaval/specs/006-v1-finalize/spec.md).
ADR:
[0014-scip-references-and-embedder-fingerprint](docs/adr/0014-scip-references-and-embedder-fingerprint.md).
PR-split: A → C → B → D ordering — this is leg 3 of 4.
### What changes
- **AC-C-1 parse-cache LRU eviction** — `evictIfOverCap(cacheDir,
capBytes)` lists shards, sorts mtime-asc, deletes oldest until ≤0.9×cap.
Wired into `writeCacheEntry` post-write; gated on
`CODEHUB_PARSE_CACHE_MAX_BYTES` (default 1 GiB; 0 disables). Eviction
errors swallowed silently — cache failure is never fatal. JSDoc at
`content-cache.ts:133` updated to point at the new helper. +12 tests.
- **AC-C-2 stringArrayField round-trip symmetry** — `stringArrayOrNull`
writer + 3 readers (`duckdb-adapter.ts:setStringArrayField`,
`graphdb-adapter.ts:setStringArrayFieldGd`,
`analyze.ts:stringArrayField`) now preserve `[]` distinct from
`undefined`. The columns are already typed arrays (DuckDB `TEXT[]`,
GraphDb `STRING[]`); the fix removes the `length > 0` coalescing. New
`medium-with-empty-keywords` parity fixture + a difference assertion
that proves `graphHash({keywords: []}) ≠ graphHash({})`. DuckDB binder
needed an explicit `LIST(VARCHAR)` type-hint to bind empty arrays —
caught and fixed alongside.
- **AC-C-3 SageMaker rebuild-on-switch refusal** — `embedder_model_id
TEXT` column added to `store_meta` (DuckDb) + StoreMeta NODE TABLE
(GraphDb) via append-only DDL + `ALTER TABLE IF NOT EXISTS` migration.
`Store.{getMeta,setMeta}` round-trip the field.
`assertEmbedderCompatible` lives in
`@opencodehub/embedder/fingerprint.ts`; cli `runQuery` exits 2 with the
frozen remediation hint, MCP `runQuery` returns a new
`EMBEDDER_MISMATCH` envelope, both honor `--force-backend-mismatch` /
`force_backend_mismatch`. graphHash invariant unaffected (store_meta is
not part of the hash). +5 fingerprint tests.
- **AC-C-4 openDefaultEmbedder factory consolidation** — new
`packages/embedder/src/factory.ts` exports `openDefaultEmbedder({
allowOnnxFallback?: boolean })`. Replaces the duplicated 6-line block at
`packages/cli/src/commands/query.ts:122-127` and
`packages/mcp/src/tools/query.ts:453-458`. Ingestion's fuller variant
(offline flag + ONNX variant + pool + canary) intentionally diverges
with a one-line comment pointing at the factory. +4 tests covering
HTTP-priority + ONNX-fallback + EmbedderNotSetupError + ONNX-failure
branches.
- **AC-C-5 SCIP REFERENCES + TYPE_OF emission** — `TYPE_OF` appended at
position 25 of `RelationType` union, `RELATION_TYPES` array,
`ALL_RELATION_TYPES` (DuckDb), and `RELATION_KINDS` (GraphDb) per the
append-only rule. `deriveEdges` widens to emit `REFERENCES` for non-call
non-DEF non-IMPORT occurrences whose enclosing scope is function-like.
New `emitRelations` sibling in `scip-index.ts` consumes
`derived.relations` and writes IMPLEMENTS + TYPE_OF graph edges via the
same `symbolDef`-resolved caller→callee join shape (`+1` boundary
translation per `scip-0-indexed-vs-graph-1-indexed.md`). Existing
`incremental-determinism.test.ts` is self-consistent (asserts cross-run
hash stability, not against a frozen golden) so no fixture file regen is
needed; the first SCIP re-index after merge produces the documented
one-time content delta. Large parity fixture auto-extends from 24 → 25
edge kinds via `getAllRelationTypes()`.
- **AC-C-6 four READMEs** —
`packages/{cli,mcp,ingestion,scanners}/README.md` (62-80 lines each)
following the `packages/policy/README.md` template (Surface / table /
Design). Root README cross-links updated. scanners README cites the
20-scanner P1+P2 breakdown post-Track-B.
- **AC-C-7 .gitmodules debt closed as stale** — file was removed when
`packages/gym` moved to `opencodehub-testbed` (commit 378f79f).
`.erpaval/debt.md` updated to status `CLOSED-STALE`.
### AC summary (Track C — 7 of 7)
| AC | What |
|---|---|
| C-1 | parse-cache LRU eviction (env-gated, default 1 GiB) |
| C-2 | stringArrayField round-trip symmetry ([] vs absent) |
| C-3 | embedder fingerprint refusal + EMBEDDER_MISMATCH envelope |
| C-4 | openDefaultEmbedder factory consolidation |
| C-5 | SCIP REFERENCES + TYPE_OF (position 25, append-only);
emitRelations |
| C-6 | 4 package READMEs + root README cross-links |
| C-7 | .gitmodules debt closed as stale |
### Validation
- **`mise run check` exits 0.**
- `pnpm -r exec tsc --noEmit` clean.
- `bash scripts/check-banned-strings.sh` PASS.
- 244/244 storage tests + 1 skip (lbug binding absent on dev box).
- 80/80 embedder tests (was 71; +9 new).
- 607/607 ingestion tests (parse-cache eviction +12 new).
- 58/58 scip-ingest tests; 73/73 core-types; 235/235 cli.
- graphHash byte-identity holds: cross-adapter parity green for
`medium-with-empty-keywords` and the 25-edge-kind sweep on the DuckDb
leg; GraphDb leg skip-clean as expected without `@ladybugdb/core`
binding on dev box.
### graphHash content delta (one-time)
Per ADR 0014 + spec W-A-2: the first SCIP re-index after this PR merges
produces additional REFERENCES + IMPLEMENTS + TYPE_OF edges. Expected,
documented as a v1.0 minor bump (schema-shape preserved via append-only;
only content changes). Existing OCH stores need `codehub analyze
--force` to pick up the new edges.
### Compound lesson extracted
`.erpaval/solutions/best-practices/no-spec-coordinate-leakage-into-source.md`
— ERPAVal `AC-*` / `M-*` / `W-*` / `CL-*` prefixes belong in commits, PR
bodies, and ADR `## References` sections, NOT in JSDoc, inline comments,
CLI flag help, MCP tool descriptions, or test names. The leakage
compounds because LLM clients pick up the vocabulary and start citing it
back. Sweep `rg -n "AC-[A-Z]-[0-9]" packages/` before every PR-open.
Track A's already-merged `AC-A-*` leakage is flagged for a separate
cleanup PR (out of scope for this Track C diff to keep the review
focused).
### Out of scope (queued for follow-on PRs)
- Track D — dogfood polish (semgrep.yml, osv.yml split,
och-self-scan.yml, code-pack release asset, lefthook polish, mise
och:self-* tasks)
- chore(repo): scrub Track-A `AC-A-*` spec coordinates from production
source (mechanical sweep, separate session)
🤖 Squashed via
[bonk-ai](https://github.com/theagenticguy/ai-gateway/blob/main/scripts/bot-push.py).
Co-authored-by: bonk-ai[bot] <269762587+bonk-ai[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Track C — debt sweep (7 ACs)
Closes the debt-sweep leg of v1-finalize per
.erpaval/specs/006-v1-finalize/.Spec: spec.md§Track C.
ADR: 0014-scip-references-and-embedder-fingerprint.
PR-split: A → C → B → D ordering — this is leg 3 of 4.
What changes
evictIfOverCap(cacheDir, capBytes)lists shards, sorts mtime-asc, deletes oldest until ≤0.9×cap. Wired intowriteCacheEntrypost-write; gated onCODEHUB_PARSE_CACHE_MAX_BYTES(default 1 GiB; 0 disables). Eviction errors swallowed silently — cache failure is never fatal. JSDoc atcontent-cache.ts:133updated to point at the new helper. +12 tests.stringArrayOrNullwriter + 3 readers (duckdb-adapter.ts:setStringArrayField,graphdb-adapter.ts:setStringArrayFieldGd,analyze.ts:stringArrayField) now preserve[]distinct fromundefined. The columns are already typed arrays (DuckDBTEXT[], GraphDbSTRING[]); the fix removes thelength > 0coalescing. Newmedium-with-empty-keywordsparity fixture + a difference assertion that provesgraphHash({keywords: []}) ≠ graphHash({}). DuckDB binder needed an explicitLIST(VARCHAR)type-hint to bind empty arrays — caught and fixed alongside.embedder_model_id TEXTcolumn added tostore_meta(DuckDb) + StoreMeta NODE TABLE (GraphDb) via append-only DDL +ALTER TABLE IF NOT EXISTSmigration.Store.{getMeta,setMeta}round-trip the field.assertEmbedderCompatiblelives in@opencodehub/embedder/fingerprint.ts; clirunQueryexits 2 with the frozen remediation hint, MCPrunQueryreturns a newEMBEDDER_MISMATCHenvelope, both honor--force-backend-mismatch/force_backend_mismatch. graphHash invariant unaffected (store_meta is not part of the hash). +5 fingerprint tests.packages/embedder/src/factory.tsexportsopenDefaultEmbedder({ allowOnnxFallback?: boolean }). Replaces the duplicated 6-line block atpackages/cli/src/commands/query.ts:122-127andpackages/mcp/src/tools/query.ts:453-458. Ingestion's fuller variant (offline flag + ONNX variant + pool + canary) intentionally diverges with a one-line comment pointing at the factory. +4 tests covering HTTP-priority + ONNX-fallback + EmbedderNotSetupError + ONNX-failure branches.TYPE_OFappended at position 25 ofRelationTypeunion,RELATION_TYPESarray,ALL_RELATION_TYPES(DuckDb), andRELATION_KINDS(GraphDb) per the append-only rule.deriveEdgeswidens to emitREFERENCESfor non-call non-DEF non-IMPORT occurrences whose enclosing scope is function-like. NewemitRelationssibling inscip-index.tsconsumesderived.relationsand writes IMPLEMENTS + TYPE_OF graph edges via the samesymbolDef-resolved caller→callee join shape (+1boundary translation perscip-0-indexed-vs-graph-1-indexed.md). Existingincremental-determinism.test.tsis self-consistent (asserts cross-run hash stability, not against a frozen golden) so no fixture file regen is needed; the first SCIP re-index after merge produces the documented one-time content delta. Large parity fixture auto-extends from 24 → 25 edge kinds viagetAllRelationTypes().packages/{cli,mcp,ingestion,scanners}/README.md(62-80 lines each) following thepackages/policy/README.mdtemplate (Surface / table / Design). Root README cross-links updated. scanners README cites the 20-scanner P1+P2 breakdown post-Track-B.packages/gymmoved toopencodehub-testbed(commit 378f79f)..erpaval/debt.mdupdated to statusCLOSED-STALE.AC summary (Track C — 7 of 7)
Validation
mise run checkexits 0.pnpm -r exec tsc --noEmitclean.bash scripts/check-banned-strings.shPASS.medium-with-empty-keywordsand the 25-edge-kind sweep on the DuckDb leg; GraphDb leg skip-clean as expected without@ladybugdb/corebinding on dev box.graphHash content delta (one-time)
Per ADR 0014 + spec W-A-2: the first SCIP re-index after this PR merges produces additional REFERENCES + IMPLEMENTS + TYPE_OF edges. Expected, documented as a v1.0 minor bump (schema-shape preserved via append-only; only content changes). Existing OCH stores need
codehub analyze --forceto pick up the new edges.Compound lesson extracted
.erpaval/solutions/best-practices/no-spec-coordinate-leakage-into-source.md— ERPAValAC-*/M-*/W-*/CL-*prefixes belong in commits, PR bodies, and ADR## Referencessections, NOT in JSDoc, inline comments, CLI flag help, MCP tool descriptions, or test names. The leakage compounds because LLM clients pick up the vocabulary and start citing it back. Sweeprg -n "AC-[A-Z]-[0-9]" packages/before every PR-open. Track A's already-mergedAC-A-*leakage is flagged for a separate cleanup PR (out of scope for this Track C diff to keep the review focused).Out of scope (queued for follow-on PRs)
AC-A-*spec coordinates from production source (mechanical sweep, separate session)🤖 Squashed via bonk-ai.