fix(storage): ingest-sarif upsert no longer clobbers a finding's host node into a placeholder#167
Merged
Conversation
… node into a placeholder
A function/method/class that contains a scanner finding silently vanished
from the graph, replaced by a <placeholder> Route node. Cross-module
context/impact then treated the host as an unreferenced island — the
field-report Issue 1 symptom (get_bedrock_client orphaned, blast-radius
broken across module boundaries).
Root cause (found by execution, not inspection — two prior hypotheses, a
SCIP alias mismatch and an lbug COPY string-coercion drop, were both
refuted by repro):
codehub analyze does a replace-mode bulkLoad of all real nodes (the host
Function persists fine). Then scan + ingest-sarif does a SECOND bulkLoad
in UPSERT mode with a graph holding ONLY Finding nodes + FOUND_IN edges.
A finding inside get_bedrock_client adds Finding -> Function:...get_bedrock_client.
That Function id is absent from the upsert batch, so
synthesizePlaceholderNodes minted a kind:Route placeholder for it, and
upsert mergeNodes (DETACH DELETE + re-insert) DESTROYED the real Function.
Confirmed on disk: id Function:...get_bedrock_client stored kind=Route,
name=<the id>, lines=null. The decorator and src-layout were red herrings;
the discriminator is 'host of a scanner finding', so this corrupted every
indexed repo, not just this one.
Fix: synthesizePlaceholderNodes takes an alreadyPersisted id set. In upsert
mode #bulkLoadOnce computes it via filterExistingNodeIds(edgeEndpointIds(edges))
(reuses listNodes({ids})) and skips synthesis for endpoints that already
exist in the store. Genuine orphans (unresolved FETCHES) still get a
placeholder so the edge COPY's PK constraint holds.
Verified end-to-end: after re-analyze, context get_bedrock_client shows 11
inbound (incl mcp_server.nova_web_grounding) + 5 outbound; impact tiers it
CRITICAL crossing the module boundary. Two regression tests in
graphdb-roundtrip.test.ts: (1) upsert edge to existing node keeps it a
Function, (2) genuine orphan endpoint still synthesizes. Storage 161/161
pass (1 pre-existing vector skip), tsc + biome clean.
4 tasks
theagenticguy
added a commit
that referenced
this pull request
May 29, 2026
…-module edges) (#170) ## Summary Two bugs in Python import handling left cross-module `CALLS`/`IMPORTS` edges unresolved, so a multi-file Python package read as disconnected islands in the graph. Both are part of the field-report's **Issue 1** (Bugs A + B) and are the foundation cross-module `context`/`impact` depend on. ### Bug A — multi-line parenthesized imports silently dropped `extractPyImports` was line-based. A multi-line `from m import (\n a,\n b,\n)` matched the from-regex on the **first line only** with `rest = "("` → 0 names → the whole import discarded. black/ruff wrap every long import list this way, so most real modules lost their imports entirely. **Fix:** `joinLogicalLines()` collapses physical lines into logical lines across an open paren (depth count) or a trailing backslash, before the per-line regex runs. ### Bug B — src-layout dotted absolute imports stub as `<external>` `resolveImportTarget` only handled `./`, `../`, `/` specifiers. A dotted **absolute** import (`pkg.client` → `src/pkg/client.py`) never resolved, so it was emitted as a `CodeElement:<external>` stub instead of linking the real file. **Fix:** `resolveDottedAbsoluteImport()` converts dots→slashes and probes the module at the repo root and under detected src-layout roots (`discoverSourceRoots`), gated to `namespace` import-semantics languages (Python). A dotted specifier that resolves to no in-repo file is still treated as third-party (external stub) — unchanged. ## Test plan - [x] **End-to-end on `ngs-research-agent`:** zero `<external>` stubs for `ngs_research_agent.client`; real file→file IMPORTS edges (`mcp_server.py → client.py`, `test_mcp_server.py → client.py`) now land. - [x] **Bug A regression** (`python.test.ts`): multi-line parenthesized + backslash-continued + single-line-parens import extraction. - [x] **Bug B regression** (`parse.test.ts`): a src-layout dotted import produces a file IMPORTS edge and **no** `<external>` stub. - [x] `@opencodehub/ingestion` 604/604; `tsc` + `biome` clean. ## Relationship to PR #167 #167 fixed the `ingest-sarif` node-clobber that hid the host Function node. This PR fixes the *import resolution* so cross-module edges between real nodes actually bind. Together they restore trustworthy cross-module blast-radius for Python.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Data-corruption fix. Any function, method, or class that contains a scanner finding was silently deleted from the code graph and replaced by a
<placeholder>Route node. Cross-modulecontext/impactthen treated the host symbol as an unreferenced island.This is the root cause of the field report's Issue 1 (headline):
get_bedrock_clientshowed up orphaned into<external>/<unresolved>nodes, and blast-radius didn't cross module boundaries.Root cause (found by execution, not inspection)
Two earlier hypotheses were refuted by repro before this one was confirmed:
encodeNodeColpasses all strings through identically. A 2-node and a full-584-node bulkLoad both persisted the node fine.The actual mechanism:
codehub analyzedoes a replace-modebulkLoadof all real nodes. The hostFunctionpersists correctly here (verified).scan+ingest-sarifdoes a secondbulkLoad(findingGraph, { mode: "upsert" })wherefindingGraphholds only Finding nodes + FOUND_IN edges.client.py:165sits insideget_bedrock_client, so ingest addsFinding → Function:…get_bedrock_client. That Function id is not in the upsert batch.synthesizePlaceholderNodesminted akind:Routeplaceholder for the missing endpoint, and upsertmergeNodes(DETACH DELETE+ re-insert) destroyed the real Function.Confirmed on disk: id
Function:…get_bedrock_clientstoredkind=Route,name=<the id>,lines=null— the placeholder shape. The@cachedecorator andsrc/layout were red herrings; the discriminator is "host of a scanner finding", so this corrupted every indexed repo.Fix
synthesizePlaceholderNodesnow takes an optionalalreadyPersistedid set. In upsert mode,#bulkLoadOncecomputes it viafilterExistingNodeIds(edgeEndpointIds(edges))(a thin wrapper over the existinglistNodes({ ids })finder) and skips synthesis for any edge endpoint that already exists in the store. Genuine orphans (unresolvedFETCHESplaceholders) still get synthesized so the edge COPY's PK constraint holds. Replace mode is unchanged (store was just truncated →undefined).Test plan
context get_bedrock_clientresolves the Function with 11 inbound (incl. cross-modulemcp_server.py:nova_web_grounding) + 5 outbound;impacttiers it CRITICAL, 124 reachable, crossing the module boundary. Before: orphaned, island.graphdb-roundtrip.test.ts: (1) upsert with aFOUND_INedge to an existing node keeps it aFunction(not clobbered into a Route placeholder); (2) a genuinely-orphanFETCHESendpoint still synthesizes a placeholder.@opencodehub/storage: 161/161 pass (1 pre-existing vector-backend skip).tsc --noEmitworkspace clean;biome checkclean.Follow-ups (separate PRs, from the same field report)
Issues 2–6 (vulture
.venvexclude, MCP-only→CLI wrappers,sqldocs,statusretrieval mode, doctorbandit[sarif]probe) plus the confirmed Python import-parser bugs (multi-line parenthesized imports dropped; src-layout dotted-import resolution) are queued.