feat: AST-aware chunking for code files via tree-sitter#449
Merged
Conversation
Add opt-in AST-aware chunk boundary detection for code files using
web-tree-sitter. When enabled with `--chunk-strategy auto`, code files
(.ts, .tsx, .js, .jsx, .py, .go, .rs) are chunked at function, class,
and import boundaries instead of arbitrary text positions. Default
behavior (`regex`) is unchanged — no surprises on upgrade.
In testing on QMD's own codebase, AST mode split 42% fewer function
bodies across chunk boundaries compared to regex-only chunking.
Usage:
qmd embed --chunk-strategy auto
qmd query "search terms" --chunk-strategy auto
What's included:
- Language detection from file extension with support for TypeScript,
JavaScript (including arrow functions and function expressions),
Python, Go, and Rust
- Per-language tree-sitter queries with scored break points aligned to
the existing markdown scale (class=100, function=90, type=80, import=60)
- AST break points merged with regex break points — highest score wins
at each position, so embedded markdown (comments, docstrings) still
benefits from regex patterns
- Refactored chunking core: chunkDocumentWithBreakPoints() extracted,
mergeBreakPoints() added, async chunkDocumentAsync() wrapper for AST
- ChunkStrategy type ("auto" | "regex") threaded through
generateEmbeddings(), hybridQuery(), structuredSearch(), CLI, and SDK
- getASTStatus() health check wired into `qmd status`
- Parse failures log a warning and fall back to regex — never crash
Hardening:
- Grammar packages are optionalDependencies with pinned versions to
prevent ABI breaks from semver drift
- web-tree-sitter is a direct dependency (pinned)
- Errors are logged (not silently swallowed) for debuggability
- Tested on both Node.js and Bun (Bun is actually faster)
Testing:
- 26 unit tests (test/ast.test.ts) — all 4 languages, error handling
- 7 integration tests (test/store.test.ts) — merge, equivalence, bypass
- Standalone test-ast-chunking.mjs with 63 synthetic tests and a
real-collection performance scanner (npx tsx test-ast-chunking.mjs ~/code)
- Validated end-to-end with qmd embed + qmd query on QMD's own codebase
- Zero markdown regressions across all test paths
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
63c7cc7 to
244ddf5
Compare
Contributor
Author
|
A second phase can follow if this is found to be useful in production, enriching search results with actual function names, class names, and signatures from the AST. It's built out but in the wings, there's an extractSymbols() stub in this PR waiting for that further work if it comes. |
Owner
|
Great contribution — the AST-aware chunking is exactly what QMD needed for code search. Clean implementation with solid fallback behavior. Thanks! |
jamesrisberg
added a commit
to jamesrisberg/qmd
that referenced
this pull request
Mar 29, 2026
Builds on AST-aware chunking (tobi#449) to extract symbol metadata (functions, classes, interfaces, types, enums) during indexing and surface them through search results. Key changes: - Single-pass symbol extraction via tree-sitter during embedding - Symbols stored in content_vectors and read at query time (no re-parse) - Sequential enrichment to avoid unbounded WASM memory pressure - Symbol-enriched embeddings for improved semantic retrieval - CLI, JSON, MCP, and REST endpoints all return symbols - Store interface/proxy updated for symbols parameter - WASM parser cleanup via try/finally (no leak on exception) - DB migration only swallows "duplicate column" errors - Standalone test script replaced with proper vitest suite Tests: 60 passing (50 unit + 10 integration) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Code files (
.ts,.tsx,.js,.jsx,.py,.go,.rs) can now be chunked at function, class, and import boundaries using tree-sitter AST parsing, instead of relying solely on regex-based markdown break points. Opt in with--chunk-strategy auto. Default behavior (regex) is unchanged. Existing collections just needqmd embed --chunk-strategy autoto benefit from AST-aware chunking.Why: QMD's regex chunker was designed for markdown. When indexing code, it splits at blank lines and paragraph boundaries, which regularly cuts function bodies in half. AST-aware chunking detects actual language structure and prefers splitting there instead. In testing on QMD's own codebase, AST mode split 42% fewer function bodies than regex.
How
web-tree-sitter(pure WASM, no native compilation) parses code files. Grammar packages for TypeScript, Python, Go, and Rust are included as optional dependencies with pinned versions.src/ast.tshandles language detection from file extension, lazy parser/grammar initialization (cached after first load), and per-language tree-sitter queries that emit scored break points.findBestCutoff()decay algorithm works unchanged.const handler = () => {}) are detected as chunk boundaries.qmd statusreports which tree-sitter grammars are available.Default behavior is
regex(unchanged). Users opt into AST chunking with:qmd embed --chunk-strategy auto qmd query "search terms" --chunk-strategy autoScore table (aligned with existing markdown scale):
Testing
test/ast.test.ts— all 4 languages, error handling, edge casestest/store.test.ts— merge, equivalence, strategy bypasstest-ast-chunking.mjswith 63 synthetic tests + real-collection performance scannerqmd embed --chunk-strategy auto+qmd queryon QMD's own codebase (71 files, 567 chunks, no errors).mdfiles produce identical chunks)Dependency Size Note
The grammar packages add ~72 MB to
node_modules, but only ~5 MB of.wasmfiles are actually used at runtime. The rest is native prebuilds and grammar source that npm downloads but QMD never touches. This is a consequence of how the grammar packages are structured.If install size is a concern, an alternative approach is to bundle just the 5 MB of
.wasmfiles directly in the QMD repo (assets/grammars/) and drop the grammar packages entirely. This would require resolving grammars fromimport.meta.urlinstead ofrequire.resolve(). Happy to implement that if preferred.For context: QMD's existing
node_modulesis ~212 MB (mostlynode-llama-cpp), and users download ~2 GB of GGUF models on first run.Changes
package.jsonweb-tree-sitter(dep) + 4 grammar packages (optionalDeps, pinned)src/ast.tssrc/store.tssrc/cli/qmd.ts--chunk-strategyflag, AST status inqmd statussrc/index.tschunkStrategyin SDKtest/ast.test.tstest/store.test.tstest-ast-chunking.mjsCHANGELOG.md[Unreleased]CLAUDE.mdREADME.md--chunk-strategy, SDK optionThis contribution was developed with AI assistance (Claude Code, Codex 5.4, and Grok 4.2: the whole gang).