Fixes a regression introduced by 0.9.0's path-less language detection (#69):
generated and minified files that 0.8.3 skipped were being symbol-parsed,
flooding downstream consumers with chunks and OOM-ing them (pg-raggraph#79,
bento). Python-only; the Rust crate is a lockstep version bump with no
functional change.
Fixed
-
symbol_awareno longer over-parses generated / minified files (#71).
0.9.0's content heuristic started classifying machine-emitted files as code
that 0.8.3 skipped. A 143 KB generated.ts(3,000 trivial functions) became
3,000 symbol chunks; consumers that embed every chunk OOM'd. Two complementary
guards:- Content-detection guard (path-less only).
detect_language_from_contentnow returnsNonefor files that look
machine-emitted — an@generated/sourceMappingURLmarker, or a minified
(very long, >2000-char) line — so they fall back tosentence_aware
(bounded) instead of being symbol-parsed. Explicit signals (cfg.language,
metadata['language'], a real path) bypass the guard, so a caller can
still force such a file through. - Per-file symbol-chunk cap. When a document would emit more than
max_symbols_per_filesymbol chunks, the chunker logs a warning and falls
back tosentence_awarewithfallback_reason="too_many_symbols". Catches
pathological generated files regardless of how the language was resolved.
Under defaults, the 2,500-function generated
.tsnow yields 62 bounded
chunks (was 2,500) and a 46 KB minified one-liner yields 23; normal code is
untouched. - Content-detection guard (path-less only).
Added
SymbolAwareChunker.max_symbols_per_file(default2000,nullto
disable, must be>= 1) — caps symbol chunks per document. The default
catches generated files while leaving even very large hand-written sources
alone (real code rarely exceeds a few hundred top-level symbols).