Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .erpaval/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ development sessions. Solutions are reusable; specs are per-feature.
- [Smoke-testing a workspace cli requires packing every publishable workspace dep](solutions/best-practices/workspace-tarball-pack-all-publishables.md) — `npm install -g <cli.tgz>` falls back to registry for un-packed transitive workspace deps, dragging in the previously-published versions and masking install-graph regressions. Pack everything publishable, every time.
- [GitHub Actions top-level permissions cap every job](solutions/conventions/workflow-call-permissions-ceiling.md) — workflow's top-level `permissions:` is a ceiling; per-job blocks can narrow but not grant. `id-token: write` declared at job level silently no-ops if missing from top level. Diagnostic: read the "GITHUB_TOKEN Permissions" group in the failing job's log.

- [lbug COPY FROM (subquery) bulk-load pattern](solutions/conventions/lbug-copy-from-subquery-bulk-load.md) — type-safe bulk inserts via COPY subquery; sentinel row, STRING[] never null + sentinel STRING[] never empty (LIST(ANY) trap), maxDBSize cap (8 TiB default exhausts VA), readOnly cannot run CREATE_FTS_INDEX, src/dst not from/to, eid not id alias.

## Specs

- [001-scip-replaces-lsp](specs/001-scip-replaces-lsp/spec.md) — rip-and-replace LSP with SCIP for TS/Py/Go/Rust/Java. Task map: [tasks.md](specs/001-scip-replaces-lsp/tasks.md).
102 changes: 102 additions & 0 deletions .erpaval/solutions/conventions/lbug-copy-from-subquery-bulk-load.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
---
name: lbug-copy-from-subquery-bulk-load
description: lbug v0.16.1 COPY FROM (subquery) pattern for type-safe bulk node+edge inserts — avoids INT64/DOUBLE confusion, sentinel pattern, and from/to keyword collision
metadata:
type: project
---

## Pattern: `COPY <Table> FROM (UNWIND $rows AS r ... RETURN ...)`

lbug v0.16.1's `UNWIND` + `CREATE/MERGE` path infers struct-field types per-row
from JS values (`Number.isInteger(v) → INT64`, else `DOUBLE`). Any confidence=1.0
edge lands as INT64 bit-pattern in a DOUBLE column and round-trips as garbage.

**Fix**: `COPY <Table> FROM (UNWIND $rows AS r WITH r WHERE r.id <> '<SENTINEL>' RETURN ...)`.
COPY reads column types from the DDL; CAST in the RETURN clause converts string-encoded
numerics to the correct type; per-row inference never runs.

### Node inserts

```cypher
COPY CodeNode FROM (UNWIND $rows AS r
WITH r WHERE r.id <> '__SENTINEL__'
RETURN r.id, r.kind, ..., CAST(r.start_line AS INT32), ..., CAST(r.cohesion AS DOUBLE), ...)
```

Row encoding rules:
- INT32 columns: pass value as `String(Math.trunc(v))` or `null`
- DOUBLE columns: pass value as `String(v)` or `null`
- BOOL columns: pass `true`/`false`/`null`
- STRING[] columns: **never pass `null`** — pass `[]` for absent arrays, or lbug infers
LIST(ANY) and throws "Trying to create a vector with ANY type"
- STRING columns: pass string value or `null`

Sentinel row requirement: prepend a row with `id = SENTINEL_ID` and concrete typed values
for every column (strings for numerics, `false` for bools, `[]` for string arrays, `""` for
strings). The `WITH r WHERE r.id <> SENTINEL_ID` filters it before any storage write.
Purpose: seeds struct-field type inference for lbug's binder so all-null batches don't fail.

### Edge inserts

```cypher
COPY DEFINES(id, confidence, reason, step) FROM (UNWIND $rows AS r
WITH r WHERE r.eid <> '__EDGE_SENTINEL__'
RETURN r.src, r.dst, r.eid, CAST(r.confidence AS DOUBLE), r.reason, CAST(r.step AS INT32))
```

Critical rules:
- Use `src`/`dst` (not `from`/`to`) as struct field names — `from` and `to` are Cypher
reserved keywords; lbug silently misinterprets `r.from`/`r.to` in a RETURN clause
- Use `eid` (not `id`) for the edge id field in the row struct — lbug matches column name
`id` against the CodeNode primary key and tries to do a node lookup instead of treating
it as a rel property. Alias `r.eid` maps to the `id` rel property via positional column list
- The `COPY E(id, confidence, reason, step)` column list is required to bind positional
RETURN columns to rel properties correctly
- Sentinel's `src`/`dst` can be empty strings `""` — filtered by `WHERE r.eid <> SENTINEL`

### Compatibility notes

- `COPY FROM (subquery)` requires the subquery to have a RETURN clause
- `WITH r WHERE` inside UNWIND is valid Cypher inside a COPY subquery
- `IGNORE_ERRORS=true` silently drops rows where FROM/TO node lookup fails — avoid using
it as a crutch; fix the sentinel instead so the sentinel itself is filtered before lookup
- The mmap "Buffer manager exception: Mmap for size 8796093022208 failed" is virtual
address-space exhaustion. lbug's default `maxDBSize` is `1 << 43` = 8 TiB per
`Database`; on 64-bit Linux the user has ~128 TiB of VA, so ~16 concurrent DBs
exhaust the address space (kernel `MAP_FAILED`). Fix: pass an explicit
`maxDBSize` (5th `Database` ctor arg, MUST be a power of 2) — 16 GiB is plenty
for OCH-scale graphs and drops virtual reserve 512×. Also pass `bufferManagerSize`
(2nd arg) — native default is `min(systemMem, maxDBSize) * 0.8`, often >50 GiB;
cap at 2 GiB for headroom across concurrent test pools without surfacing the
sibling error "Buffer manager exception: Unable to allocate memory! The buffer
pool is full and no memory could be freed!" Cite: kuzudb/kuzu#1826.
- `Database.close()` is what triggers `~VMRegion` → `munmap()`; relying on JS GC
alone leaks the mapping between tests. Always `await db.close()` before opening
the next instance pointing at a different path.

### "Trying to create a vector with ANY type" — sentinel STRING[] must be non-empty

The lesson above says STRING[] columns must never be `null` and should be `[]` for
absent arrays. That handles per-row binding, but the **sentinel row's** STRING[]
fields must additionally be **non-empty** (e.g. `["__sentinel__"]`) — lbug's
struct-field type inference looks at the FIRST row's array contents to fix the
LIST element type. An empty-array sentinel (`[]`) yields `LIST(ANY)`, and any
later data row with a string then throws "Trying to create a vector with ANY type".
The seed value never reaches storage because `WITH r WHERE r.id <> SENTINEL`
filters the row before COPY. Reproduces with a 3-column table and 3 rows:
sentinel `kw=[]`, n1 `kw=[]`, n2 `kw=["a","b"]` → fails. Switching the
sentinel to `kw=["__seed__"]` fixes it.

### Read-only opens cannot run `CALL CREATE_FTS_INDEX` / `CALL CREATE_VECTOR_INDEX`

lbug rejects writes against a Database opened with `readOnly=true`, including the
`CALL CREATE_FTS_INDEX(...)` and `CALL CREATE_VECTOR_INDEX(...)` admin procedures
that adapters use to ensure search-side indexes exist. Surfaces as "Cannot
execute write operations in a read-only database!" the moment a reader calls
`search()` or `vectorSearch()`. Fix: build both indexes at the end of `bulkLoad`
(when the connection is read-write) and have the lazy-ensure helpers no-op in
readOnly mode. Readers query the existing index; the index already exists
because every write path runs through bulkLoad.

**Why:** [[scip-replaces-lsp]] — same pattern of lbug's binding layer having non-obvious
per-row type inference that requires careful workarounds.
32 changes: 19 additions & 13 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,19 +80,25 @@ This repo ships a Claude Code plugin at `plugins/opencodehub/` — it
provides a `code-analyst` subagent and 10 skills. Install via
`codehub init` (writes `.mcp.json` + links the plugin).

## Storage backend — graph-default

`CODEHUB_STORE` is unset by default. OpenCodeHub probes
`@ladybugdb/core` and uses the graph-database backend when the binding
is available; otherwise it falls back to DuckDB with a one-shot stderr
advisory (gated on TTY or `OCH_VERBOSE=1`). Set `CODEHUB_STORE=duck` to
force the legacy layout (single DuckDB file backs both graph + temporal
views) or `CODEHUB_STORE=lbug` to require the graph-database backend.

When both `graph.duckdb` and `graph.lbug` exist as siblings in the same
`<repo>/.codehub/`, the newer-mtime file wins. See ADR 0013
(`docs/adr/0013-m7-default-flip-and-abstraction.md`) for the rationale
and the AGE/Memgraph/Neo4j/Neptune community-adapter escape hatch.
## Storage backend — lbug graph + DuckDB temporal

The graph tier is always `@ladybugdb/core` (`graph.lbug`); the temporal
tier — cochanges, structured symbol summaries, and the
`codehub query --sql` escape hatch — is always DuckDB
(`temporal.duckdb`). Both files live under `<repo>/.codehub/`. There is
no env-var, no probe, no fallback; if the lbug binding fails to load,
`open()` throws `GraphDbBindingError` and the operation aborts. See
ADR 0016 (`docs/adr/0016-duckdb-graph-rip.md`) for the rationale and the
AGE/Memgraph/Neo4j/Neptune community-adapter contract that survives the
rip-out (the segregated `IGraphStore` / `ITemporalStore` interfaces stay
exactly because community-fork adapters are a deliberate escape hatch).

`IGraphStore` lives only on `GraphDbStore`; `DuckDbStore` implements
`ITemporalStore` only. Embeddings live in `graph.lbug` and stream into a
per-call DuckDB temp table at pack time so the byte-identical Parquet
sidecar still works (see `packages/pack/src/embeddings-sidecar.ts`).
Future temporal swap (e.g. SQLite-WASM) only needs a new `ITemporalStore`
implementor — no graph-tier change.

## Parse runtime — WASM-only, vendored grammars

Expand Down
7 changes: 6 additions & 1 deletion docs/adr/0011-graph-db-backend.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# ADR 0011 — Graph-DB backend (LadybugDB phase-1)

- Status: **Accepted** — 2026-05-05 (Proposed) → flipped on the M3 merge.
- Status: **Partially superseded** by [ADR 0016](./0016-duckdb-graph-rip.md)
on 2026-05-16. The "DuckDB-default plus LadybugDB opt-in" framing is
obsolete; lbug is the unconditional graph backend after the rip. The
LadybugDB integration shape and `IGraphStore` design introduced here
are unchanged.
- Was: **Accepted** on 2026-05-05 and flipped on the M3 merge.
- Authors: Laith Al-Saadoon + Claude.
- Branch: `feat/v1-m3-m4`.
- Supersedes nothing. Interacts with ADR 0001 (DuckDB backend stays the
Expand Down
8 changes: 7 additions & 1 deletion docs/adr/0013-m7-default-flip-and-abstraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,13 @@
> in-tree because they were authored in parallel branches and accepted
> on the same release. The next ADR uses 0014.

- Status: **Accepted** — 2026-05-09 (Proposed) → flipped on the
- Status: **Superseded** by [ADR 0016](./0016-duckdb-graph-rip.md)
on 2026-05-16. The auto-probe, dual-artifact arbitration, and
`CODEHUB_STORE` resolver introduced here are gone. lbug is the only
graph backend; DuckDB serves the temporal tier. The
IGraphStore/ITemporalStore segregation survives because community
adapters (AGE, Memgraph, Neo4j, Neptune) target it.
- Was: **Accepted** on 2026-05-09 and flipped on the
`feat/v1-finalize-track-a` merge (PR #71).
- Authors: Laith Al-Saadoon + Claude.
- Branch: `feat/v1-finalize-track-a`.
Expand Down
146 changes: 146 additions & 0 deletions docs/adr/0016-duckdb-graph-rip.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# ADR 0016 — Rip out the DuckDB graph backend; lbug-only graph, DuckDB temporal-only

- Status: **Accepted** — 2026-05-16.
- Authors: Laith Al-Saadoon + Claude.
- Branch: `feat/duckdb-graph-rip`.
- Supersedes: [ADR 0013 — M7 default flip and storage abstraction](./0013-m7-default-flip-and-abstraction.md)
in its entirety; partially supersedes [ADR 0011 — graph-db backend](./0011-graph-db-backend.md)
(the "DuckDB-as-graph default" passages).

## Context

ADR 0011 introduced `@ladybugdb/core` (lbug) as a second `IGraphStore`
backend behind a `CODEHUB_STORE` env-var selector. ADR 0013 flipped the
default to graph-default with auto-probe-and-fallback semantics: when
`CODEHUB_STORE` was unset, the resolver imported `@ladybugdb/core` and
preferred lbug on success, otherwise fell back to DuckDB-as-graph. A
dual-artifact detector picked the newer-mtime file when both
`graph.duckdb` and `graph.lbug` existed in `<repo>/.codehub/`. The
DuckDB graph adapter therefore lived as a permanent fallback path,
maintained alongside the lbug adapter.

Two things changed at the start of the 2026-05 dogfood cycle:

1. **lbug bulk-load became feature-complete.** A separate session landed
the `COPY <Table> FROM (UNWIND $rows ...)` pattern that DuckDB
already had — type-safe ingestion of nodes and edges through lbug's
bulk path. After that, every `IGraphStore` surface — bulk-load, all
15 typed finders, BM25 search, HNSW vector search, traversals,
embeddings — runs on lbug; the v1.0 conformance suite passes against
lbug; the cross-adapter parity tests existed only to keep DuckDB
honest.
2. **The dual-write code carried real cost.** ~1900 LOC of graph-tier
code in `duckdb-adapter.ts`, the ~3500-LOC parity test suite, the
resolver/probe/dual-artifact apparatus, the env-var, the docs that
tried to keep `codehub-graph` as a backend axis. Every `analyze`
path took two branches and every architectural claim ("storage is
pluggable") had to defend a backend that nobody set explicitly.

The user's framing was "rip out the DuckDB fallback for graph store …
keep the generic / abstractions but I don't want all this code for
duckdb unless it's the temporal/genuine tabular type stuff. and in fact
maybe even that should be sqlite wasm or something."

## Decision

**`IGraphStore` lives only on `GraphDbStore`; `DuckDbStore` implements
`ITemporalStore` only.** The interface segregation introduced in
session-33f24f (see
[`solutions/architecture-patterns/igraphstore-itemporalstore-segregation.md`](../../.erpaval/solutions/architecture-patterns/igraphstore-itemporalstore-segregation.md))
was anticipating exactly this split — community AGE / Memgraph / Neo4j /
Neptune adapters target `IGraphStore` only and pair with the
DuckDB-backed `ITemporalStore`. After this rip-out, that's also the
in-tree shape: lbug owns `IGraphStore`, DuckDB owns `ITemporalStore`,
and the in-tree adapters stop demonstrating the structural-typing-via-
`implements both` case.

Concrete shape after the rip:

- `openStore({path})` always returns `{graph: GraphDbStore, temporal:
DuckDbStore, graphFile, temporalFile, close}`. No `backend` field on
the result envelope; no `backend?` option on the input.
- The graph artifact is `<dir>/graph.lbug`. The temporal artifact is
`<dir>/temporal.duckdb`. `paths.describeArtifacts()` takes no arguments
and returns `{graphFile: "graph.lbug", temporalFile: "temporal.duckdb"}`.
- `resolveDbPath` is renamed `resolveGraphPath` and returns the lbug
filename.
- `CODEHUB_STORE` is gone. The env var is no longer consulted anywhere
in storage. The resolver, the dynamic-import probe of `@ladybugdb/core`,
the dual-artifact mtime arbitration, the `_lbugFallbackWarned` /
`_dualArtifactWarned` advisory state, and the
`_resetStoreResolverCache` test escape hatch are all deleted.
- The MCP `sql` tool's `cypher` field becomes unconditionally available;
it routes to `store.graph.execCypher(...)`.
- Embeddings live in `graph.lbug`. The pack embeddings sidecar streams
rows from `store.graph.listEmbeddings()` into a per-call DuckDB
`CREATE TEMP TABLE embeddings_export` on `temporal.duckdb`, then runs
the existing deterministic
`COPY (... ORDER BY ...) TO '<path>' (FORMAT PARQUET, COMPRESSION ZSTD)`,
then drops the temp table. The byte-identity contract for
`embeddings.parquet` is preserved.
- The conformance suite at `packages/storage/src/test-utils/conformance.ts`
(`assertIGraphStoreConformance(name, factory)`) and the parity-harness
rebuilder at `packages/storage/src/test-utils/parity-harness.ts` stay.
They are the v1.0 contract for community adapters; deleting them
would contradict the segregation ADR's promise.

## Backwards compatibility

None. Existing `<repo>/.codehub/graph.duckdb` files are no longer read.
Users re-run `codehub analyze` to write `graph.lbug` from scratch.
There is no stale-artifact warning, no legacy alias, no kill-switch.
This is a single-user dogfood repo today; the cost of a hard cutover is
"re-run analyze once."

## Operational impact

- **Platform reach narrows to lbug's 5 prebuilt targets**:
`darwin-arm64`, `darwin-x64`, `linux-arm64`, `linux-x64`, `win32-x64`.
Alpine/musl and 32-bit Linux ARM users need a source build via
`cmake-js`. `codehub doctor` now hard-fails on missing binding (was
warn-and-continue in the auto-probe era).
- **lbug's 8 TiB virtual mmap per Database can exhaust the 47-bit user
virtual address space on 64-bit Linux** when many test pools open
concurrently — surfaces as `Buffer manager exception: Mmap for size
8796093022208 failed`. Wave 1 chased this in detail and confirmed
with a probe: lbug's `maxDBSize` defaults to `1 << 43` and the
request is reserved at `Database` construction (not lazy).
`bufferManagerSize` defaults to `min(systemMem, maxDBSize) * 0.8`.
The pool now passes both as explicit constructor args
(16 GiB `maxDbBytes`, 2 GiB `bufferManagerBytes`) so concurrent
Databases do not exhaust VA. See
[`solutions/conventions/lbug-copy-from-subquery-bulk-load.md`](../../.erpaval/solutions/conventions/lbug-copy-from-subquery-bulk-load.md)
for the citations (kuzudb/kuzu#1826, the upstream Database constructor,
`BufferPoolConstants::DEFAULT_VM_REGION_MAX_SIZE`).
- **Sentinel STRING[] columns must be non-empty in lbug bulk-load.**
An empty-array sentinel (`[]`) makes lbug's struct-field type
inference resolve to `LIST(ANY)`, and any later data row with a
string then throws "Trying to create a vector with ANY type". The
fix is `["__sentinel__"]`; the seed value is filtered before COPY by
the existing `WITH r WHERE r.id <> SENTINEL`.
- **lbug rejects writes against a Database opened with `readOnly=true`,
including `CALL CREATE_FTS_INDEX(...)` and `CALL CREATE_VECTOR_INDEX(...)`.**
These are now built at the end of `bulkLoad` (write phase). The
`ensureFtsIndex` / `ensureVectorIndex` lazy helpers no-op in
readOnly mode; readers query the existing index built by the most
recent write.

## Future work

The user's "maybe sqlite-wasm for temporal" comment is captured here as
forward-work: replacing `DuckDbStore` with a JS-only `ITemporalStore`
implementor (e.g. `sql.js`, `wa-sqlite`) would drop the last native
binding from the temporal tier and let OCH ship as pure-JS at the
distributed-boundary. The interface contract — `exec(sql, params)`,
`bulkLoadCochanges`, `lookupCochangesForFile`, `bulkLoadSymbolSummaries`,
`exportEmbeddingsToParquet` — is small enough to port; only the
deterministic Parquet writer would need investigation (sql.js does not
ship a `COPY ... TO PARQUET` analog out of the box). Not in scope for
this ADR.

## Numbers

Net diff for this rip: ~5,800 deletions, ~150 insertions. Workspace
test count after: 1931 passing, 0 failing, 2 skipped (one platform-
gated lbug vector probe + one platform-gated embedder probe). Storage
package: 148/0/1 over three consecutive runs — no flake.
2 changes: 1 addition & 1 deletion packages/analysis/src/test-utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ function sortEdges(edges: readonly FakeEdge[]): FakeEdge[] {
* code under test.
*/
export class FakeStore implements IGraphStore {
readonly dialect: GraphDialect = "none";
readonly dialect: GraphDialect = "cypher";
readonly nodes: FakeNode[] = [];
readonly edges: FakeEdge[] = [];

Expand Down
11 changes: 6 additions & 5 deletions packages/cli/src/commands/analyze-carry-forward.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ import {
KnowledgeGraph,
type NodeId,
} from "@opencodehub/core-types";
import { DuckDbStore, resolveDbPath, resolveRepoMetaDir } from "@opencodehub/storage";
import { openStore, resolveGraphPath, resolveRepoMetaDir } from "@opencodehub/storage";
import { loadPreviousGraph } from "./analyze.js";

/**
Expand Down Expand Up @@ -164,11 +164,12 @@ async function seedPriorIndex(repoPath: string): Promise<{
});

await mkdir(resolveRepoMetaDir(repoPath), { recursive: true });
const store = new DuckDbStore(resolveDbPath(repoPath));
const store = await openStore({ path: resolveGraphPath(repoPath) });
try {
await store.open();
await store.createSchema();
await store.bulkLoad(graph);
await store.graph.open();
await store.temporal.open();
await store.graph.createSchema();
await store.graph.bulkLoad(graph);
} finally {
await store.close();
}
Expand Down
Loading
Loading