chunkshop 0.9.0
Two correctness/perf fixes. The minor bump is for the search-pool default flip
(#64): read pooling is now on by default, so search callers that previously opened
a fresh connection per call now reuse warm connections — a transparent, opt-out
behavioral change worth a version signal. Also fixes symbol_aware silently
dropping symbols for path-less documents (#69). Python-only this release; the Rust
crate is a lockstep version bump with no functional change.
Fixed
-
symbol_awareno longer mislabels path-less code asunsupported_language
(#69). The chunker resolved language only from a file extension in
doc.metadata.path/source_pathor a path-shapeddoc.id. Callers passing a
synthetic id /stele://URI with no path (e.g. pg-raggraph / bento) got
fallback_reason="unsupported_language"and zero symbols for ordinary
sources — ~45% of.pyand ~90% of.ts/.tsxfiles in a real repo. Language
is now resolved in layers, most- to least-explicit:- a new
SymbolAwareChunker.languageconfig override (validated against the
known language tags), - a
doc.metadata['language']hint (exact tag or an extension alias like
"tsx"/".ts"/"py"), - a broadened set of path-like metadata keys (
file_path,filename,uri,
url,rel_path, … in addition topath/source_path), - a path-shaped
doc.id(unchanged), then - a conservative content heuristic covering all ten supported languages
(python, java, go, typescript, javascript, rust, c, cpp, csharp, ruby).
The content heuristic scores language-distinctive markers and returns a result
only on a clear, unambiguous winner — prose and near-ties return nothing and
fall back as before, sosymbol_aware_fallbacknow fires only for genuinely
unsupported input. A wrong guess can never do worse than the prior
unknown-language fallback. - a new
Changed
-
Read-connection pool is on by default (#64). The opt-in
CHUNKSHOP_SEARCH_POOLpool — the biggest single search win (hybrid median
30.8 ms → 10.5 ms, −66%, byte-identical ranking) — now ships on and is
transparent. SetCHUNKSHOP_SEARCH_POOL=0(alsofalse/no/off) to restore
the historical per-call connect. Made safe with three guards:- Retry-once on a broken connection. A reused idle connection that turns
out dead (server restart / idle timeout) is discarded and the query retried
once on a fresh connection, so a restart self-heals instead of surfacing an
OperationalError. A fresh connection that fails is a real error and is not
retried. Validated against real Postgres by terminating a pooled backend. - Fork reset. An
os.register_at_forkchild handler drops inherited
connections (psycopg sockets do not surviveos.fork). The subprocess
orchestrator spawns via exec and never inherits the pool. - Max-idle-age recycle. A connection idle past 300 s is recycled on acquire
rather than handed out stale.
- Retry-once on a broken connection. A reused idle connection that turns
Added
SymbolAwareChunker.language— force one language for every document in a
cell, bypassing per-document detection. Must be a known codeparse language tag;
rejected at config-load otherwise.