Skip to content

feat: v1 finalize Track C — debt sweep (7 ACs)#73

Merged
theagenticguy merged 1 commit into
mainfrom
feat/v1-finalize-track-c
May 9, 2026
Merged

feat: v1 finalize Track C — debt sweep (7 ACs)#73
theagenticguy merged 1 commit into
mainfrom
feat/v1-finalize-track-c

Conversation

@bonk-ai
Copy link
Copy Markdown
Contributor

@bonk-ai bonk-ai Bot commented May 9, 2026

Track C — debt sweep (7 ACs)

Closes the debt-sweep leg of v1-finalize per .erpaval/specs/006-v1-finalize/.
Spec: spec.md§Track C.
ADR: 0014-scip-references-and-embedder-fingerprint.
PR-split: A → C → B → D ordering — this is leg 3 of 4.

What changes

  • AC-C-1 parse-cache LRU evictionevictIfOverCap(cacheDir, capBytes) lists shards, sorts mtime-asc, deletes oldest until ≤0.9×cap. Wired into writeCacheEntry post-write; gated on CODEHUB_PARSE_CACHE_MAX_BYTES (default 1 GiB; 0 disables). Eviction errors swallowed silently — cache failure is never fatal. JSDoc at content-cache.ts:133 updated to point at the new helper. +12 tests.
  • AC-C-2 stringArrayField round-trip symmetrystringArrayOrNull writer + 3 readers (duckdb-adapter.ts:setStringArrayField, graphdb-adapter.ts:setStringArrayFieldGd, analyze.ts:stringArrayField) now preserve [] distinct from undefined. The columns are already typed arrays (DuckDB TEXT[], GraphDb STRING[]); the fix removes the length > 0 coalescing. New medium-with-empty-keywords parity fixture + a difference assertion that proves graphHash({keywords: []}) ≠ graphHash({}). DuckDB binder needed an explicit LIST(VARCHAR) type-hint to bind empty arrays — caught and fixed alongside.
  • AC-C-3 SageMaker rebuild-on-switch refusalembedder_model_id TEXT column added to store_meta (DuckDb) + StoreMeta NODE TABLE (GraphDb) via append-only DDL + ALTER TABLE IF NOT EXISTS migration. Store.{getMeta,setMeta} round-trip the field. assertEmbedderCompatible lives in @opencodehub/embedder/fingerprint.ts; cli runQuery exits 2 with the frozen remediation hint, MCP runQuery returns a new EMBEDDER_MISMATCH envelope, both honor --force-backend-mismatch / force_backend_mismatch. graphHash invariant unaffected (store_meta is not part of the hash). +5 fingerprint tests.
  • AC-C-4 openDefaultEmbedder factory consolidation — new packages/embedder/src/factory.ts exports openDefaultEmbedder({ allowOnnxFallback?: boolean }). Replaces the duplicated 6-line block at packages/cli/src/commands/query.ts:122-127 and packages/mcp/src/tools/query.ts:453-458. Ingestion's fuller variant (offline flag + ONNX variant + pool + canary) intentionally diverges with a one-line comment pointing at the factory. +4 tests covering HTTP-priority + ONNX-fallback + EmbedderNotSetupError + ONNX-failure branches.
  • AC-C-5 SCIP REFERENCES + TYPE_OF emissionTYPE_OF appended at position 25 of RelationType union, RELATION_TYPES array, ALL_RELATION_TYPES (DuckDb), and RELATION_KINDS (GraphDb) per the append-only rule. deriveEdges widens to emit REFERENCES for non-call non-DEF non-IMPORT occurrences whose enclosing scope is function-like. New emitRelations sibling in scip-index.ts consumes derived.relations and writes IMPLEMENTS + TYPE_OF graph edges via the same symbolDef-resolved caller→callee join shape (+1 boundary translation per scip-0-indexed-vs-graph-1-indexed.md). Existing incremental-determinism.test.ts is self-consistent (asserts cross-run hash stability, not against a frozen golden) so no fixture file regen is needed; the first SCIP re-index after merge produces the documented one-time content delta. Large parity fixture auto-extends from 24 → 25 edge kinds via getAllRelationTypes().
  • AC-C-6 four READMEspackages/{cli,mcp,ingestion,scanners}/README.md (62-80 lines each) following the packages/policy/README.md template (Surface / table / Design). Root README cross-links updated. scanners README cites the 20-scanner P1+P2 breakdown post-Track-B.
  • AC-C-7 .gitmodules debt closed as stale — file was removed when packages/gym moved to opencodehub-testbed (commit 378f79f). .erpaval/debt.md updated to status CLOSED-STALE.

AC summary (Track C — 7 of 7)

AC What
C-1 parse-cache LRU eviction (env-gated, default 1 GiB)
C-2 stringArrayField round-trip symmetry ([] vs absent)
C-3 embedder fingerprint refusal + EMBEDDER_MISMATCH envelope
C-4 openDefaultEmbedder factory consolidation
C-5 SCIP REFERENCES + TYPE_OF (position 25, append-only); emitRelations
C-6 4 package READMEs + root README cross-links
C-7 .gitmodules debt closed as stale

Validation

  • mise run check exits 0.
  • pnpm -r exec tsc --noEmit clean.
  • bash scripts/check-banned-strings.sh PASS.
  • 244/244 storage tests + 1 skip (lbug binding absent on dev box).
  • 80/80 embedder tests (was 71; +9 new).
  • 607/607 ingestion tests (parse-cache eviction +12 new).
  • 58/58 scip-ingest tests; 73/73 core-types; 235/235 cli.
  • graphHash byte-identity holds: cross-adapter parity green for medium-with-empty-keywords and the 25-edge-kind sweep on the DuckDb leg; GraphDb leg skip-clean as expected without @ladybugdb/core binding on dev box.

graphHash content delta (one-time)

Per ADR 0014 + spec W-A-2: the first SCIP re-index after this PR merges produces additional REFERENCES + IMPLEMENTS + TYPE_OF edges. Expected, documented as a v1.0 minor bump (schema-shape preserved via append-only; only content changes). Existing OCH stores need codehub analyze --force to pick up the new edges.

Compound lesson extracted

.erpaval/solutions/best-practices/no-spec-coordinate-leakage-into-source.md — ERPAVal AC-* / M-* / W-* / CL-* prefixes belong in commits, PR bodies, and ADR ## References sections, NOT in JSDoc, inline comments, CLI flag help, MCP tool descriptions, or test names. The leakage compounds because LLM clients pick up the vocabulary and start citing it back. Sweep rg -n "AC-[A-Z]-[0-9]" packages/ before every PR-open. Track A's already-merged AC-A-* leakage is flagged for a separate cleanup PR (out of scope for this Track C diff to keep the review focused).

Out of scope (queued for follow-on PRs)

  • Track D — dogfood polish (semgrep.yml, osv.yml split, och-self-scan.yml, code-pack release asset, lefthook polish, mise och:self-* tasks)
  • chore(repo): scrub Track-A AC-A-* spec coordinates from production source (mechanical sweep, separate session)

🤖 Squashed via bonk-ai.

@theagenticguy theagenticguy merged commit 077440c into main May 9, 2026
17 checks passed
@theagenticguy theagenticguy deleted the feat/v1-finalize-track-c branch May 9, 2026 21:18
@github-actions github-actions Bot mentioned this pull request May 9, 2026
theagenticguy pushed a commit that referenced this pull request May 10, 2026
## Track C — debt sweep (7 ACs)

Closes the debt-sweep leg of v1-finalize per
`.erpaval/specs/006-v1-finalize/`.
Spec: [spec.md§Track C](.erpaval/specs/006-v1-finalize/spec.md).
ADR:
[0014-scip-references-and-embedder-fingerprint](docs/adr/0014-scip-references-and-embedder-fingerprint.md).
PR-split: A → C → B → D ordering — this is leg 3 of 4.

### What changes

- **AC-C-1 parse-cache LRU eviction** — `evictIfOverCap(cacheDir,
capBytes)` lists shards, sorts mtime-asc, deletes oldest until ≤0.9×cap.
Wired into `writeCacheEntry` post-write; gated on
`CODEHUB_PARSE_CACHE_MAX_BYTES` (default 1 GiB; 0 disables). Eviction
errors swallowed silently — cache failure is never fatal. JSDoc at
`content-cache.ts:133` updated to point at the new helper. +12 tests.
- **AC-C-2 stringArrayField round-trip symmetry** — `stringArrayOrNull`
writer + 3 readers (`duckdb-adapter.ts:setStringArrayField`,
`graphdb-adapter.ts:setStringArrayFieldGd`,
`analyze.ts:stringArrayField`) now preserve `[]` distinct from
`undefined`. The columns are already typed arrays (DuckDB `TEXT[]`,
GraphDb `STRING[]`); the fix removes the `length > 0` coalescing. New
`medium-with-empty-keywords` parity fixture + a difference assertion
that proves `graphHash({keywords: []}) ≠ graphHash({})`. DuckDB binder
needed an explicit `LIST(VARCHAR)` type-hint to bind empty arrays —
caught and fixed alongside.
- **AC-C-3 SageMaker rebuild-on-switch refusal** — `embedder_model_id
TEXT` column added to `store_meta` (DuckDb) + StoreMeta NODE TABLE
(GraphDb) via append-only DDL + `ALTER TABLE IF NOT EXISTS` migration.
`Store.{getMeta,setMeta}` round-trip the field.
`assertEmbedderCompatible` lives in
`@opencodehub/embedder/fingerprint.ts`; cli `runQuery` exits 2 with the
frozen remediation hint, MCP `runQuery` returns a new
`EMBEDDER_MISMATCH` envelope, both honor `--force-backend-mismatch` /
`force_backend_mismatch`. graphHash invariant unaffected (store_meta is
not part of the hash). +5 fingerprint tests.
- **AC-C-4 openDefaultEmbedder factory consolidation** — new
`packages/embedder/src/factory.ts` exports `openDefaultEmbedder({
allowOnnxFallback?: boolean })`. Replaces the duplicated 6-line block at
`packages/cli/src/commands/query.ts:122-127` and
`packages/mcp/src/tools/query.ts:453-458`. Ingestion's fuller variant
(offline flag + ONNX variant + pool + canary) intentionally diverges
with a one-line comment pointing at the factory. +4 tests covering
HTTP-priority + ONNX-fallback + EmbedderNotSetupError + ONNX-failure
branches.
- **AC-C-5 SCIP REFERENCES + TYPE_OF emission** — `TYPE_OF` appended at
position 25 of `RelationType` union, `RELATION_TYPES` array,
`ALL_RELATION_TYPES` (DuckDb), and `RELATION_KINDS` (GraphDb) per the
append-only rule. `deriveEdges` widens to emit `REFERENCES` for non-call
non-DEF non-IMPORT occurrences whose enclosing scope is function-like.
New `emitRelations` sibling in `scip-index.ts` consumes
`derived.relations` and writes IMPLEMENTS + TYPE_OF graph edges via the
same `symbolDef`-resolved caller→callee join shape (`+1` boundary
translation per `scip-0-indexed-vs-graph-1-indexed.md`). Existing
`incremental-determinism.test.ts` is self-consistent (asserts cross-run
hash stability, not against a frozen golden) so no fixture file regen is
needed; the first SCIP re-index after merge produces the documented
one-time content delta. Large parity fixture auto-extends from 24 → 25
edge kinds via `getAllRelationTypes()`.
- **AC-C-6 four READMEs** —
`packages/{cli,mcp,ingestion,scanners}/README.md` (62-80 lines each)
following the `packages/policy/README.md` template (Surface / table /
Design). Root README cross-links updated. scanners README cites the
20-scanner P1+P2 breakdown post-Track-B.
- **AC-C-7 .gitmodules debt closed as stale** — file was removed when
`packages/gym` moved to `opencodehub-testbed` (commit 378f79f).
`.erpaval/debt.md` updated to status `CLOSED-STALE`.

### AC summary (Track C — 7 of 7)

| AC | What |
|---|---|
| C-1 | parse-cache LRU eviction (env-gated, default 1 GiB) |
| C-2 | stringArrayField round-trip symmetry ([] vs absent) |
| C-3 | embedder fingerprint refusal + EMBEDDER_MISMATCH envelope |
| C-4 | openDefaultEmbedder factory consolidation |
| C-5 | SCIP REFERENCES + TYPE_OF (position 25, append-only);
emitRelations |
| C-6 | 4 package READMEs + root README cross-links |
| C-7 | .gitmodules debt closed as stale |

### Validation

- **`mise run check` exits 0.**
- `pnpm -r exec tsc --noEmit` clean.
- `bash scripts/check-banned-strings.sh` PASS.
- 244/244 storage tests + 1 skip (lbug binding absent on dev box).
- 80/80 embedder tests (was 71; +9 new).
- 607/607 ingestion tests (parse-cache eviction +12 new).
- 58/58 scip-ingest tests; 73/73 core-types; 235/235 cli.
- graphHash byte-identity holds: cross-adapter parity green for
`medium-with-empty-keywords` and the 25-edge-kind sweep on the DuckDb
leg; GraphDb leg skip-clean as expected without `@ladybugdb/core`
binding on dev box.

### graphHash content delta (one-time)

Per ADR 0014 + spec W-A-2: the first SCIP re-index after this PR merges
produces additional REFERENCES + IMPLEMENTS + TYPE_OF edges. Expected,
documented as a v1.0 minor bump (schema-shape preserved via append-only;
only content changes). Existing OCH stores need `codehub analyze
--force` to pick up the new edges.

### Compound lesson extracted


`.erpaval/solutions/best-practices/no-spec-coordinate-leakage-into-source.md`
— ERPAVal `AC-*` / `M-*` / `W-*` / `CL-*` prefixes belong in commits, PR
bodies, and ADR `## References` sections, NOT in JSDoc, inline comments,
CLI flag help, MCP tool descriptions, or test names. The leakage
compounds because LLM clients pick up the vocabulary and start citing it
back. Sweep `rg -n "AC-[A-Z]-[0-9]" packages/` before every PR-open.
Track A's already-merged `AC-A-*` leakage is flagged for a separate
cleanup PR (out of scope for this Track C diff to keep the review
focused).

### Out of scope (queued for follow-on PRs)

- Track D — dogfood polish (semgrep.yml, osv.yml split,
och-self-scan.yml, code-pack release asset, lefthook polish, mise
och:self-* tasks)
- chore(repo): scrub Track-A `AC-A-*` spec coordinates from production
source (mechanical sweep, separate session)

🤖 Squashed via
[bonk-ai](https://github.com/theagenticguy/ai-gateway/blob/main/scripts/bot-push.py).

Co-authored-by: bonk-ai[bot] <269762587+bonk-ai[bot]@users.noreply.github.com>
@github-actions github-actions Bot mentioned this pull request May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant