Skip to content

chunkshop 0.8.0

Choose a tag to compare

@github-actions github-actions released this 30 May 15:43
· 44 commits to main since this release

Code intelligence goes wide and precise. The codeparse layer grows from 5 to
10 tree-sitter languages (adds Rust, C, C++, C#, Ruby); cross-file edges get
import-aware resolution (ambiguous name matches narrow to the module the
caller actually imports); and the code graph is hardened — an orphan-edge
bug is fixed and locked down by corpus-scale invariants validated against real
codebases (chunkshop's own Rust tree + Postgres 16.3, 250k call sites, zero
orphans). Also lands typed edge_kind, provenance tagging, and scope_chain
display metadata on the code_edges/symbol path.

Added

  • code_relationships: import-aware narrowing of ambiguous cross-file edges. When a callee/base name matches symbols in more than one file, the resolver previously fanned out one CALLS/INHERITS/IMPLEMENTS edge per candidate (resolution='ambiguous_name'). It now consults the caller file's imports (already parsed but previously discarded) and, when exactly one candidate's module is imported, emits a single precise edge tagged resolution='import_resolved' at the unique-match confidence band. The narrowing is conservative and language-agnostic — it matches a candidate file's stem against the caller's import tokens (works for Python from a import x, Rust use crate::a::x, C #include "a.h", etc.); zero or multiple import-supported candidates keep the existing fan-out, so no edge is ever dropped. provenance stays 'heuristic' (an import-narrowed edge is a stronger heuristic, not AST/SCIP truth) — consumers that want to rank it higher key on evidence.resolution. This is the Python-path read of #42; SCIP/stack-graphs resolution remains a Rust follow-up.
  • codeparse: tree-sitter extractors for Rust, C, C++, C#, and Ruby. The [code] extra now ships tree-sitter-rust, tree-sitter-c, tree-sitter-cpp, tree-sitter-c-sharp, and tree-sitter-ruby alongside the existing Python/Java/Go/TS/JS grammars, taking symbol_aware chunking + code_relationships cross-file edges from 5 to 10 real-parser languages. Rust groups methods under their impl/trait type (struct/enumclass, traitinterface); C emits functions + structs; C++ adds inline + out-of-line (Class::method) methods and namespaces; C# maps class/interface/method_declaration and resolves calls via invocation_expression; Ruby maps def/class/module with best-effort call-node call detection. Each extractor attributes calls to the outermost emitted symbol (no orphan edges) and parses lazily — the base install is unaffected, and regex_fallback.py remains the safety net when the [code] extra is absent. A parametrized invariant test enforces no-orphan-callers + in-bounds spans across all five.
  • code_relationships extractor: typed edge_kind column on code_edges (CS-2). The PG code_edges table now carries a typed, codegraph-aligned edge_kind column (12-value CHECK constraint: contains, calls, imports, exports, extends, implements, references, type_of, returns, instantiates, overrides, decorates) alongside the existing uppercase edge_type column. Today's three emission paths (CALLS, INHERITS, IMPLEMENTS) map to calls, extends, implements; the other nine values are valid against the constraint but unfilled until CS-1 ports the 20-language extractor stack. chunkshop.extractors.code_relationships exposes EdgeKind (Literal), EDGE_KINDS (tuple), and edge_type_to_kind() as the source-of-truth for the ontology.
  • chunkshop impact-of --edge-kind <kind> filter. New CLI option validated against the 12-value EdgeKind set; ANDs into the recursive-CTE WHERE alongside the existing --edge-type. --edge-kind is None by default — pre-CS-2 invocations are byte-identical.
  • codeparse: Go / TypeScript / JavaScript now parse via tree-sitter (#40). The [code] extra ships tree-sitter-go, tree-sitter-typescript, and tree-sitter-javascript alongside the existing Python + Java grammars. The lossy per-language regex extractors for these three are replaced with real tree-sitter walks matching the python.py / java.py pattern. Concretely: Go methods now resolve their receiver type as parent_name (func (c *Calculator) Addparent_name='Calculator', symbol_type='method') instead of landing as parentless functions; TS/JS methods carry real multi-line line_start/line_end spans instead of single-line collapses; struct/interface types are typed as class/interface. regex_fallback.py stays as the safety net — environments without the [code] extra (or where a grammar raises) fall through to regex transparently, and ParseResult.parser reports "regex" vs "tree-sitter" accordingly.
  • symbol_aware chunker: scope_chain display metadata (#41). Every symbol chunk now carries metadata.scope_chain — a human-readable enclosing-scope path (svc > UserService > get_user) derived from the same inputs as fqn via the new chunkshop.codeparse.build_scope_chain. fqn stays the machine-readable graph join key; scope_chain is the UI/search-result display string. Additive — no existing metadata key changes.
  • code_relationships extractor: provenance + provenance_metadata columns on code_edges (CS-5). The PG code_edges table now carries provenance tagging — a typed provenance text NOT NULL DEFAULT 'ast' column (3-value CHECK: 'ast' | 'scip' | 'heuristic') plus a provenance_metadata jsonb NOT NULL DEFAULT '{}' column for per-edge per-channel context (e.g., {synthesizedBy: 'react-render', componentName: 'App'} once CS-3 synthesizers land). Every edge from today's AST extractor is hardcoded to provenance='ast' with empty metadata. Foundation for CS-3 — without provenance, an AST-truth edge and a heuristic-guess edge are indistinguishable, and per codegraph's CLAUDE.md "partial coverage is WORSE than none" if you can't tell which is which. chunkshop.extractors.code_relationships exposes Provenance (Literal) and PROVENANCES (tuple).

Notes

  • edge_type is unchanged: same column name, same uppercase values, same primary-key membership, same write semantics. Existing readers (chunkshop impact-of --edge-type, pg-raggraph consumers, pg-raggraph/tests/integration/test_chunkshop_bridge.py) continue working untouched.
  • Cross-backend extension (MariaDB / SQLite / ClickHouse) is a separate follow-up brief blocked by a backend-agnostic code_edges DDL refactor — see skill-output/mission-brief/Mission-Brief-cs2-cross-backend.md.
  • Rust parity is a separate follow-up brief — see skill-output/mission-brief/Mission-Brief-cs2-rust-parity.md.
  • CS-5 is strictly additive on top of CS-2 — edge_type, edge_kind, and the code_edges PRIMARY KEY are byte-identical to the post-CS-2 state.
  • Cross-backend extension (MariaDB / SQLite / ClickHouse) is a separate follow-up brief — see skill-output/mission-brief/Mission-Brief-cs5-cross-backend.md. Should be bundled with Mission-Brief-cs2-cross-backend.md since they share the backend-agnostic DDL-seam refactor.
  • Rust parity is a separate follow-up brief — see skill-output/mission-brief/Mission-Brief-cs5-rust-parity.md. Blocked on Mission-Brief-cs2-rust-parity.md (which creates the rust/chunkshop/src/extractors/ directory CS-5's Rust port lives in).
  • No CLI surface in this PR — chunkshop impact-of --provenance <kind> filter is YAGNI until CS-3 produces non-AST edges to filter against.

Changed

  • code_relationships: name-heuristic cross-file edges now tagged provenance='heuristic' (#42 SC-004). Previously every finalize() edge was hardcoded provenance='ast'. Now only AST-direct intra-file edges (evidence.resolution == 'intra_file') keep 'ast'; cross-file edges resolved by unique-/ambiguous-name matching (CALLS, INHERITS, IMPLEMENTS) are tagged 'heuristic'. This separates name-heuristic edges from a future Rust stack-graphs resolver ('scip') sharing the same code_edges table. Schema unchanged — the provenance CHECK already permitted 'heuristic'. The _emit chokepoint param is typed Provenance (not str).
  • A/B emission contract: §4.6 verdict qualified by new §4.6.1. The "NAIVE WINS" verdict (PR #45) tested 2 of 3 retrieval modes defined in §4.2 (naive_vector + graph_leg-as-primary). The hybrid mode (vector-first then graph-expansion — chunkshop's intended production shape, per §4.2 "optional but recommended") was not run. The new §4.6.1 documents this gap, explains why graph-as-primary's failure profile (NER fallback to whitespace tokens skipped 7/12 questions by construction) doesn't extrapolate to hybrid, and puts §3.8's "freeze edge-tier work / deprioritize RM-C / reconsider facts/cooccur" directive ON HOLD pending a hybrid-mode re-run. Tracking issue filed against pg-raggraph.

Fixed

  • codeparse: calls inside nested functions now attribute to the outermost emitted symbol (Risk 1). Previously _enclosing_function returned the innermost function, so a call inside a nested function produced a CALLS edge whose caller_node_id referenced a symbol that was never emitted (an orphan edge source). Fixed for Python and the ECMAScript family (TypeScript + JavaScript, which share the walker). Go/Java were already structurally safe (no nested function declarations).
  • codeparse: Python symbol spans now include decorator lines (Risk 2). A decorated def/class previously began at the def/class line, dropping @decorator lines from the symbol's original_content and start_line metadata. The span now starts at the first decorator (the decorated_definition node).
  • Added a corpus-scale invariant test (no orphan callers, in-bounds spans, no parse crashes, deterministic node_ids) over chunkshop's own source tree, plus realistic per-language fixtures exercising nesting + decorators. This is the regression net that hardens the extractor pattern before it is replicated across new languages (sub-project A).
  • Real-code validation for the new extractors: test_rust_corpus.py parses chunkshop's own rust/ tree (~120 files, ~1.3k symbols, ~9.4k calls) in CI; env-gated test_c_corpus.py validated against Postgres 16.3 src/ (1,269 .c files, 25,431 symbols, 250,000 call sites) — 0 orphans, 0 out-of-bounds, 0 crashes, 0 regex fallback. Go/Java closure/lambda orphan-safety and Rust use-based import narrowing also get dedicated tests.