chunkshop 0.8.0
Code intelligence goes wide and precise. The codeparse layer grows from 5 to
10 tree-sitter languages (adds Rust, C, C++, C#, Ruby); cross-file edges get
import-aware resolution (ambiguous name matches narrow to the module the
caller actually imports); and the code graph is hardened — an orphan-edge
bug is fixed and locked down by corpus-scale invariants validated against real
codebases (chunkshop's own Rust tree + Postgres 16.3, 250k call sites, zero
orphans). Also lands typed edge_kind, provenance tagging, and scope_chain
display metadata on the code_edges/symbol path.
Added
code_relationships: import-aware narrowing of ambiguous cross-file edges. When a callee/base name matches symbols in more than one file, the resolver previously fanned out oneCALLS/INHERITS/IMPLEMENTSedge per candidate (resolution='ambiguous_name'). It now consults the caller file's imports (already parsed but previously discarded) and, when exactly one candidate's module is imported, emits a single precise edge taggedresolution='import_resolved'at the unique-match confidence band. The narrowing is conservative and language-agnostic — it matches a candidate file's stem against the caller's import tokens (works for Pythonfrom a import x, Rustuse crate::a::x, C#include "a.h", etc.); zero or multiple import-supported candidates keep the existing fan-out, so no edge is ever dropped.provenancestays'heuristic'(an import-narrowed edge is a stronger heuristic, not AST/SCIP truth) — consumers that want to rank it higher key onevidence.resolution. This is the Python-path read of #42; SCIP/stack-graphs resolution remains a Rust follow-up.codeparse: tree-sitter extractors for Rust, C, C++, C#, and Ruby. The[code]extra now shipstree-sitter-rust,tree-sitter-c,tree-sitter-cpp,tree-sitter-c-sharp, andtree-sitter-rubyalongside the existing Python/Java/Go/TS/JS grammars, takingsymbol_awarechunking +code_relationshipscross-file edges from 5 to 10 real-parser languages. Rust groups methods under theirimpl/traittype (struct/enum→class,trait→interface); C emits functions + structs; C++ adds inline + out-of-line (Class::method) methods and namespaces; C# mapsclass/interface/method_declarationand resolves calls viainvocation_expression; Ruby mapsdef/class/modulewith best-effortcall-node call detection. Each extractor attributes calls to the outermost emitted symbol (no orphan edges) and parses lazily — the base install is unaffected, andregex_fallback.pyremains the safety net when the[code]extra is absent. A parametrized invariant test enforces no-orphan-callers + in-bounds spans across all five.code_relationshipsextractor: typededge_kindcolumn oncode_edges(CS-2). The PGcode_edgestable now carries a typed, codegraph-alignededge_kindcolumn (12-valueCHECKconstraint:contains,calls,imports,exports,extends,implements,references,type_of,returns,instantiates,overrides,decorates) alongside the existing uppercaseedge_typecolumn. Today's three emission paths (CALLS,INHERITS,IMPLEMENTS) map tocalls,extends,implements; the other nine values are valid against the constraint but unfilled until CS-1 ports the 20-language extractor stack.chunkshop.extractors.code_relationshipsexposesEdgeKind(Literal),EDGE_KINDS(tuple), andedge_type_to_kind()as the source-of-truth for the ontology.chunkshop impact-of --edge-kind <kind>filter. New CLI option validated against the 12-value EdgeKind set; ANDs into the recursive-CTE WHERE alongside the existing--edge-type.--edge-kindisNoneby default — pre-CS-2 invocations are byte-identical.codeparse: Go / TypeScript / JavaScript now parse via tree-sitter (#40). The[code]extra shipstree-sitter-go,tree-sitter-typescript, andtree-sitter-javascriptalongside the existing Python + Java grammars. The lossy per-language regex extractors for these three are replaced with real tree-sitter walks matching thepython.py/java.pypattern. Concretely: Go methods now resolve their receiver type asparent_name(func (c *Calculator) Add→parent_name='Calculator',symbol_type='method') instead of landing as parentless functions; TS/JS methods carry real multi-lineline_start/line_endspans instead of single-line collapses; struct/interface types are typed asclass/interface.regex_fallback.pystays as the safety net — environments without the[code]extra (or where a grammar raises) fall through to regex transparently, andParseResult.parserreports"regex"vs"tree-sitter"accordingly.symbol_awarechunker:scope_chaindisplay metadata (#41). Every symbol chunk now carriesmetadata.scope_chain— a human-readable enclosing-scope path (svc > UserService > get_user) derived from the same inputs asfqnvia the newchunkshop.codeparse.build_scope_chain.fqnstays the machine-readable graph join key;scope_chainis the UI/search-result display string. Additive — no existing metadata key changes.code_relationshipsextractor:provenance+provenance_metadatacolumns oncode_edges(CS-5). The PGcode_edgestable now carries provenance tagging — a typedprovenance text NOT NULL DEFAULT 'ast'column (3-valueCHECK:'ast' | 'scip' | 'heuristic') plus aprovenance_metadata jsonb NOT NULL DEFAULT '{}'column for per-edge per-channel context (e.g.,{synthesizedBy: 'react-render', componentName: 'App'}once CS-3 synthesizers land). Every edge from today's AST extractor is hardcoded toprovenance='ast'with empty metadata. Foundation for CS-3 — without provenance, an AST-truth edge and a heuristic-guess edge are indistinguishable, and per codegraph's CLAUDE.md "partial coverage is WORSE than none" if you can't tell which is which.chunkshop.extractors.code_relationshipsexposesProvenance(Literal) andPROVENANCES(tuple).
Notes
edge_typeis unchanged: same column name, same uppercase values, same primary-key membership, same write semantics. Existing readers (chunkshop impact-of --edge-type,pg-raggraphconsumers,pg-raggraph/tests/integration/test_chunkshop_bridge.py) continue working untouched.- Cross-backend extension (MariaDB / SQLite / ClickHouse) is a separate follow-up brief blocked by a backend-agnostic
code_edgesDDL refactor — seeskill-output/mission-brief/Mission-Brief-cs2-cross-backend.md. - Rust parity is a separate follow-up brief — see
skill-output/mission-brief/Mission-Brief-cs2-rust-parity.md. - CS-5 is strictly additive on top of CS-2 —
edge_type,edge_kind, and thecode_edgesPRIMARY KEY are byte-identical to the post-CS-2 state. - Cross-backend extension (MariaDB / SQLite / ClickHouse) is a separate follow-up brief — see
skill-output/mission-brief/Mission-Brief-cs5-cross-backend.md. Should be bundled withMission-Brief-cs2-cross-backend.mdsince they share the backend-agnostic DDL-seam refactor. - Rust parity is a separate follow-up brief — see
skill-output/mission-brief/Mission-Brief-cs5-rust-parity.md. Blocked onMission-Brief-cs2-rust-parity.md(which creates therust/chunkshop/src/extractors/directory CS-5's Rust port lives in). - No CLI surface in this PR —
chunkshop impact-of --provenance <kind>filter is YAGNI until CS-3 produces non-AST edges to filter against.
Changed
code_relationships: name-heuristic cross-file edges now taggedprovenance='heuristic'(#42 SC-004). Previously everyfinalize()edge was hardcodedprovenance='ast'. Now only AST-direct intra-file edges (evidence.resolution == 'intra_file') keep'ast'; cross-file edges resolved by unique-/ambiguous-name matching (CALLS, INHERITS, IMPLEMENTS) are tagged'heuristic'. This separates name-heuristic edges from a future Rust stack-graphs resolver ('scip') sharing the samecode_edgestable. Schema unchanged — theprovenanceCHECK already permitted'heuristic'. The_emitchokepoint param is typedProvenance(notstr).- A/B emission contract: §4.6 verdict qualified by new §4.6.1. The "NAIVE WINS" verdict (PR #45) tested 2 of 3 retrieval modes defined in §4.2 (
naive_vector+graph_leg-as-primary). Thehybridmode (vector-first then graph-expansion — chunkshop's intended production shape, per §4.2 "optional but recommended") was not run. The new §4.6.1 documents this gap, explains why graph-as-primary's failure profile (NER fallback to whitespace tokens skipped 7/12 questions by construction) doesn't extrapolate to hybrid, and puts §3.8's "freeze edge-tier work / deprioritize RM-C / reconsider facts/cooccur" directive ON HOLD pending a hybrid-mode re-run. Tracking issue filed against pg-raggraph.
Fixed
codeparse: calls inside nested functions now attribute to the outermost emitted symbol (Risk 1). Previously_enclosing_functionreturned the innermost function, so a call inside a nested function produced aCALLSedge whosecaller_node_idreferenced a symbol that was never emitted (an orphan edge source). Fixed for Python and the ECMAScript family (TypeScript + JavaScript, which share the walker). Go/Java were already structurally safe (no nested function declarations).codeparse: Python symbol spans now include decorator lines (Risk 2). A decorateddef/classpreviously began at thedef/classline, dropping@decoratorlines from the symbol'soriginal_contentandstart_linemetadata. The span now starts at the first decorator (thedecorated_definitionnode).- Added a corpus-scale invariant test (no orphan callers, in-bounds spans, no parse crashes, deterministic node_ids) over chunkshop's own source tree, plus realistic per-language fixtures exercising nesting + decorators. This is the regression net that hardens the extractor pattern before it is replicated across new languages (sub-project A).
- Real-code validation for the new extractors:
test_rust_corpus.pyparses chunkshop's ownrust/tree (~120 files, ~1.3k symbols, ~9.4k calls) in CI; env-gatedtest_c_corpus.pyvalidated against Postgres 16.3src/(1,269.cfiles, 25,431 symbols, 250,000 call sites) — 0 orphans, 0 out-of-bounds, 0 crashes, 0 regex fallback. Go/Java closure/lambda orphan-safety and Rustuse-based import narrowing also get dedicated tests.