Skip to content

chunkshop 0.6.1 — Rust parity + connector fixes

Choose a tag to compare

@TheYonk TheYonk released this 26 May 13:36
· 177 commits to main since this release

chunkshop 0.6.1 is a cumulative release. It rolls up everything since the last published release (0.4.2): the 0.5.0 search surface, the 0.6.0 connector platform + code understanding, and the 0.6.1 Rust parity + connector papercuts. Both packages move in lockstep — the chunkshop wheel (PyPI) and the chunkshop-rs crate (crates.io) are published at this version together.


0.6.1 — Rust parity + connector papercuts

No Python-package API changes; this closes the v0.6.0 Python-vs-Rust behavioural gap and fixes two GitHub-connector papercuts.

Rust — SP-1 sync primitives + RawStore parity (RM-B). SyncMode / IncrementalSource / PrunableSource / StaleCursorError / Document.fingerprint, the pg_table tuple cursor (boundary-row safety), the s3 ETag IncrementalSource, the http depth-crawl + ETag/Last-Modified cursor + robots.txt, and the RawStore primitive (filesystem + S3). 40 new Rust tests; a Python↔Rust parity test asserts identical chunk output.

Connectors — GitHub fixes.

  • Auto-detect default branch (#27). Omit branch to resolve the repo's default_branch via GET /repos; a wrong pinned branch falls back to the default and retries instead of 404-ing. Opt out with branch_strict: true.
  • Clone-based walk (#28). Set clone: true to git clone --depth 1 the branch once and walk the tree locally instead of one /contents API call per file — a single fetch regardless of file count. Bounded by max_clone_mb (default 200); falls back to the REST walk if git is unavailable.

Security. The 3 Dependabot advisories tracked in #11 (urllib3 high+high, idna medium) were remediated in 0.6.0 (urllib3 2.7.0, idna 3.16); zero open alerts.


0.6.0 — Connector platform + code understanding

Adds the SP-1 plugin foundation (IncrementalSource / PrunableSource / SyncMode Protocols + entry-point registry with per-plugin import isolation + RawStore primitive + OAuth interfaces), four verified-tier connectors via the new in-monorepo chunkshop-connectors plugin package (blob, rss, github, gdrive — the last two with real-OAuth e2e demos), and 23 experimental-tier stubs registered behind the same seam.

Code understanding. Two new code-aware chunkers — code_aware (stdlib-AST for Python) and symbol_aware (tree-sitter for Python/Java + regex fallback for Go/TS/JS) — plus two code-aware extractors: code_relationships (cross-file edge resolution + write_edges materialization) and code_summary (lede/callable/first-N-sentences backends). New CLI search --by-symbol filter and chunkshop impact-of subcommand with recursive-CTE N-hop traversal; the runner auto-fires extractor.finalize() to materialize the code_edges table.

File rich-parsing (SP-3). PDF / DOCX / PPTX / XLSX / HTML behind opt-in extras, plus depth-bounded URL crawl with ETag + Last-Modified incremental cursors.

Test coverage: core suite 530 → 701+, connectors suite 7 → 140 (verified-tier behavioral + experimental smoke + chunker×extractor orthogonality matrix + attribution audit + e2e). Real-world 5-KB integration demo (3 GitHub repos + cross-cutting MD + 5 arxiv PDFs + 4 LLM-MD + ClickHouse → 5 hybrid-searchable KBs in ~14 minutes).

Verified benefit: 98.2% token reduction vs grep+load with higher precision@5 across 10 realistic engineering queries (docs/benchmarks/grep-vs-hybrid-2026-05-25.md).

Full per-commit changelog: docs/CHANGES-2026-05-25.md. Agent reference (self-contained doc for LLM consumers): docs/AGENT_REFERENCE.md.


0.5.0 — Search lands

Adds lede v0.4 hint-biased extraction, a hybrid retrieval surface across all four backends, and a configurable "Fast-mode" RAG path that summarizes retrieved chunks before they reach an LLM (~90% input-token savings on real corpora). All Python; the Rust crate version was bumped in lockstep but unchanged.

lede v0.4 hint-biased extraction. hints / hint_focus / hint_mode flow through summary_embed to bias extractive summaries toward query terms (no-hint output byte-identical, golden-tested). Per-document hints via CallableSummarizer's *_from_meta kwargs. New lede_top_terms extractor (ranked salient terms with real composite scores). Hint expansion (expand_hints + expand: block) via the optional [lede-spacy] extra.

Hybrid search surface (chunkshop.search). semantic_search / keyword_search / hybrid_search on Postgres, SQLite, MariaDB (full ranked FTS) and ClickHouse (degraded token-filter), with RRF + weighted fusion, candidate over-fetch, and a metadata/source/tags where filter. summarize_hits collapses the top-K retrieved chunks into one heading-aware, query-biased Fast-mode summary.

Search product (configurable). target.fts: {enabled, language} opt-in FTS index (built at ingest / validated on append) across all four backends — default off, absent fts ⇒ ingest byte-identical. New chunkshop search CLI (--query, --k, --return chunks|summary+chunks|summary, --legs, --where KEY=VALUE, --json) and a typed SearchResult + search() API. Guides: docs/hybrid-search.md, docs/fast-mode-rag-benchmarks.md.


Both packages publish from a single tag push. See docs/RELEASING.md for verification steps.