chunkshop 0.6.1 — Rust parity + connector fixes
chunkshop 0.6.1 is a cumulative release. It rolls up everything since the last published release (0.4.2): the 0.5.0 search surface, the 0.6.0 connector platform + code understanding, and the 0.6.1 Rust parity + connector papercuts. Both packages move in lockstep — the chunkshop wheel (PyPI) and the chunkshop-rs crate (crates.io) are published at this version together.
0.6.1 — Rust parity + connector papercuts
No Python-package API changes; this closes the v0.6.0 Python-vs-Rust behavioural gap and fixes two GitHub-connector papercuts.
Rust — SP-1 sync primitives + RawStore parity (RM-B). SyncMode / IncrementalSource / PrunableSource / StaleCursorError / Document.fingerprint, the pg_table tuple cursor (boundary-row safety), the s3 ETag IncrementalSource, the http depth-crawl + ETag/Last-Modified cursor + robots.txt, and the RawStore primitive (filesystem + S3). 40 new Rust tests; a Python↔Rust parity test asserts identical chunk output.
Connectors — GitHub fixes.
- Auto-detect default branch (#27). Omit
branchto resolve the repo'sdefault_branchviaGET /repos; a wrong pinned branch falls back to the default and retries instead of 404-ing. Opt out withbranch_strict: true. - Clone-based walk (#28). Set
clone: truetogit clone --depth 1the branch once and walk the tree locally instead of one/contentsAPI call per file — a single fetch regardless of file count. Bounded bymax_clone_mb(default 200); falls back to the REST walk ifgitis unavailable.
Security. The 3 Dependabot advisories tracked in #11 (urllib3 high+high, idna medium) were remediated in 0.6.0 (urllib3 2.7.0, idna 3.16); zero open alerts.
0.6.0 — Connector platform + code understanding
Adds the SP-1 plugin foundation (IncrementalSource / PrunableSource / SyncMode Protocols + entry-point registry with per-plugin import isolation + RawStore primitive + OAuth interfaces), four verified-tier connectors via the new in-monorepo chunkshop-connectors plugin package (blob, rss, github, gdrive — the last two with real-OAuth e2e demos), and 23 experimental-tier stubs registered behind the same seam.
Code understanding. Two new code-aware chunkers — code_aware (stdlib-AST for Python) and symbol_aware (tree-sitter for Python/Java + regex fallback for Go/TS/JS) — plus two code-aware extractors: code_relationships (cross-file edge resolution + write_edges materialization) and code_summary (lede/callable/first-N-sentences backends). New CLI search --by-symbol filter and chunkshop impact-of subcommand with recursive-CTE N-hop traversal; the runner auto-fires extractor.finalize() to materialize the code_edges table.
File rich-parsing (SP-3). PDF / DOCX / PPTX / XLSX / HTML behind opt-in extras, plus depth-bounded URL crawl with ETag + Last-Modified incremental cursors.
Test coverage: core suite 530 → 701+, connectors suite 7 → 140 (verified-tier behavioral + experimental smoke + chunker×extractor orthogonality matrix + attribution audit + e2e). Real-world 5-KB integration demo (3 GitHub repos + cross-cutting MD + 5 arxiv PDFs + 4 LLM-MD + ClickHouse → 5 hybrid-searchable KBs in ~14 minutes).
Verified benefit: 98.2% token reduction vs grep+load with higher precision@5 across 10 realistic engineering queries (docs/benchmarks/grep-vs-hybrid-2026-05-25.md).
Full per-commit changelog: docs/CHANGES-2026-05-25.md. Agent reference (self-contained doc for LLM consumers): docs/AGENT_REFERENCE.md.
0.5.0 — Search lands
Adds lede v0.4 hint-biased extraction, a hybrid retrieval surface across all four backends, and a configurable "Fast-mode" RAG path that summarizes retrieved chunks before they reach an LLM (~90% input-token savings on real corpora). All Python; the Rust crate version was bumped in lockstep but unchanged.
lede v0.4 hint-biased extraction. hints / hint_focus / hint_mode flow through summary_embed to bias extractive summaries toward query terms (no-hint output byte-identical, golden-tested). Per-document hints via CallableSummarizer's *_from_meta kwargs. New lede_top_terms extractor (ranked salient terms with real composite scores). Hint expansion (expand_hints + expand: block) via the optional [lede-spacy] extra.
Hybrid search surface (chunkshop.search). semantic_search / keyword_search / hybrid_search on Postgres, SQLite, MariaDB (full ranked FTS) and ClickHouse (degraded token-filter), with RRF + weighted fusion, candidate over-fetch, and a metadata/source/tags where filter. summarize_hits collapses the top-K retrieved chunks into one heading-aware, query-biased Fast-mode summary.
Search product (configurable). target.fts: {enabled, language} opt-in FTS index (built at ingest / validated on append) across all four backends — default off, absent fts ⇒ ingest byte-identical. New chunkshop search CLI (--query, --k, --return chunks|summary+chunks|summary, --legs, --where KEY=VALUE, --json) and a typed SearchResult + search() API. Guides: docs/hybrid-search.md, docs/fast-mode-rag-benchmarks.md.
Both packages publish from a single tag push. See docs/RELEASING.md for verification steps.