Skip to content

chunkshop 0.7.0 — fact extractors + fact-search + caveman reducer

Choose a tag to compare

@TheYonk TheYonk released this 28 May 12:20
· 150 commits to main since this release

chunkshop 0.7.0 is a cumulative release covering everything since the last published GitHub release (0.6.1): the 0.6.2 gdrive file_ids papercut and the 0.7.0 agent-memory fact-extraction batteries-included drop. Both packages move in lockstep — the chunkshop wheel (PyPI) and the chunkshop-rs crate (crates.io) are published at this version together.


0.7.0 — Fact extractors + fact-search + caveman reducer

Agent-memory fact extraction goes batteries-included, plus a read-time reducer, a fact query command, and a security fix in the github connector. The chunkshop wheel and chunkshop-rs crate are bumped in lockstep; the Rust crate has no behavioural change this cycle (all code changes are Python-side).

Bundled fact extractors (mem0-style). Previously the ConsolidationChunker could only produce facts via the passthrough default (none), a bring-your-own CallableConsolidator, or the bundled extractive consolidator (propositions with null subject/predicate/object). Two first-class, CPU-only extractors now ship:

  • consolidator: { mode: lede } — salient-sentence propositions with a rank-decay confidence (needs the [lede] extra).
  • consolidator: { mode: lede_spacy } — dependency-parsed subject/predicate/object triples (verb-lemma predicate; 1.0 full-SVO / 0.6 partial confidence; captures direct, prepositional, and copular objects). Needs the [lede-spacy] extra + a spaCy model.

Both are hybrid with the existing CallableConsolidator escape hatch and support a confidence_floor (drop low-confidence facts before embedding). An optional summarizer slot fills the episode summary independently of the fact extractor.

caveman reducer. A new dependency-free fluff/stopword reducer on the summarizer contract (chunkshop.summarizers.caveman), swappable anywhere lede is. Exposed at read time via chunkshop search --compress (off by default), which strips low-information tokens from the produced summary.

fact-search command + fact-aware search. chunkshop fact-search queries kind='fact' rows and returns each fact with its originating chunk → doc breadcrumb (and optional lede summary), with a --confidence-floor. Normal chunkshop search now excludes facts by default (--include-facts to opt back in), via a new metadata_not WHERE predicate (IS DISTINCT FROM, a no-op for non-memory tables).

cooccurrence extractor (Tier-1 graph edges). A spaCy-free extractor that pairs rake keyphrases (nodes) co-occurring within lede-salient sentences into weak, undirected co_occurs candidates, emitted as metadata['cooccur'] ({a, b, weight}, word-boundary matched) for a downstream graph consumer to materialize.

Security — github connector scrubs inlined PATs (#31). The verified GitHub connector inlined the PAT into the HTTPS clone URL; on a git failure the token leaked through CalledProcessError's argv/stderr into logs and exception trackers. Inlined credentials are now scrubbed from argv + captured stdout/stderr before the error is raised.


0.6.2 — Connectors papercut

No core API changes; the chunkshop wheel and chunkshop-rs crate are bumped in lockstep with no behavioural change to either.

Connectors — gdrive explicit file_ids selection mode. The verified Google Drive connector gains a second selection mode for single-file / multi-select ingest (e.g. the rows a UI file-picker selected), alongside the existing folder_id/query folder walk:

  • file_ids: [<id>, ...] — ingest exactly the given Drive file IDs. Each is fetched directly via files.get (no folder walk, no /changes feed). Mutually exclusive with folder_id/query.
  • Modified-time delta sync. Cursor is a {file_id: modifiedTime} map; on re-sync only files whose modifiedTime advanced are re-emitted, and unchanged files retain their prior entry via the IncrementalSource merge contract.
  • reprocess: true — force re-emit of every selected file regardless of modifiedTime, so the sink overwrites even unchanged documents.

Both packages publish from a single tag push. See docs/RELEASING.md for verification steps.