feat: AST-aware chunking for code files via tree-sitter by jamesrisberg · Pull Request #449 · tobi/qmd

jamesrisberg · 2026-03-21T18:23:25Z

Summary

Code files (.ts, .tsx, .js, .jsx, .py, .go, .rs) can now be chunked at function, class, and import boundaries using tree-sitter AST parsing, instead of relying solely on regex-based markdown break points. Opt in with --chunk-strategy auto. Default behavior (regex) is unchanged. Existing collections just need qmd embed --chunk-strategy auto to benefit from AST-aware chunking.

Why: QMD's regex chunker was designed for markdown. When indexing code, it splits at blank lines and paragraph boundaries, which regularly cuts function bodies in half. AST-aware chunking detects actual language structure and prefers splitting there instead. In testing on QMD's own codebase, AST mode split 42% fewer function bodies than regex.

How

web-tree-sitter (pure WASM, no native compilation) parses code files. Grammar packages for TypeScript, Python, Go, and Rust are included as optional dependencies with pinned versions.
src/ast.ts handles language detection from file extension, lazy parser/grammar initialization (cached after first load), and per-language tree-sitter queries that emit scored break points.
AST break points are merged with the existing regex break points — highest score wins at each position. The existing findBestCutoff() decay algorithm works unchanged.
Arrow functions and function expressions (const handler = () => {}) are detected as chunk boundaries.
Parse failures or missing grammars log a warning and fall back to regex-only. Never crashes.
qmd status reports which tree-sitter grammars are available.

Default behavior is regex (unchanged). Users opt into AST chunking with:

qmd embed --chunk-strategy auto
qmd query "search terms" --chunk-strategy auto

Score table (aligned with existing markdown scale):

AST Node	Score
Class / interface / struct / impl / trait	100
Function / method / arrow function	90
Type alias / enum	80
Import / use declaration	60

Testing

26 unit tests in test/ast.test.ts — all 4 languages, error handling, edge cases
7 integration tests in test/store.test.ts — merge, equivalence, strategy bypass
Standalone test-ast-chunking.mjs with 63 synthetic tests + real-collection performance scanner
Tested on both Node.js and Bun — all tests pass on both runtimes
Validated end-to-end: qmd embed --chunk-strategy auto + qmd query on QMD's own codebase (71 files, 567 chunks, no errors)
0 markdown regressions (all .md files produce identical chunks)

Dependency Size Note

The grammar packages add ~72 MB to node_modules, but only ~5 MB of .wasm files are actually used at runtime. The rest is native prebuilds and grammar source that npm downloads but QMD never touches. This is a consequence of how the grammar packages are structured.

If install size is a concern, an alternative approach is to bundle just the 5 MB of .wasm files directly in the QMD repo (assets/grammars/) and drop the grammar packages entirely. This would require resolving grammars from import.meta.url instead of require.resolve(). Happy to implement that if preferred.

For context: QMD's existing node_modules is ~212 MB (mostly node-llama-cpp), and users download ~2 GB of GGUF models on first run.

Changes

File	Change
`package.json`	Add `web-tree-sitter` (dep) + 4 grammar packages (optionalDeps, pinned)
`src/ast.ts`	New — language detection, AST parsing, health check, symbol stub
`src/store.ts`	Refactor chunking core, add async AST wrapper, thread options
`src/cli/qmd.ts`	Add `--chunk-strategy` flag, AST status in `qmd status`
`src/index.ts`	Expose `chunkStrategy` in SDK
`test/ast.test.ts`	New — 26 AST unit tests
`test/store.test.ts`	7 new integration tests
`test-ast-chunking.mjs`	New — standalone benchmark + real-collection scanner
`CHANGELOG.md`	Entry under `[Unreleased]`
`CLAUDE.md`	Architecture note
`README.md`	Document AST chunking, `--chunk-strategy`, SDK option

This contribution was developed with AI assistance (Claude Code, Codex 5.4, and Grok 4.2: the whole gang).

Add opt-in AST-aware chunk boundary detection for code files using web-tree-sitter. When enabled with `--chunk-strategy auto`, code files (.ts, .tsx, .js, .jsx, .py, .go, .rs) are chunked at function, class, and import boundaries instead of arbitrary text positions. Default behavior (`regex`) is unchanged — no surprises on upgrade. In testing on QMD's own codebase, AST mode split 42% fewer function bodies across chunk boundaries compared to regex-only chunking. Usage: qmd embed --chunk-strategy auto qmd query "search terms" --chunk-strategy auto What's included: - Language detection from file extension with support for TypeScript, JavaScript (including arrow functions and function expressions), Python, Go, and Rust - Per-language tree-sitter queries with scored break points aligned to the existing markdown scale (class=100, function=90, type=80, import=60) - AST break points merged with regex break points — highest score wins at each position, so embedded markdown (comments, docstrings) still benefits from regex patterns - Refactored chunking core: chunkDocumentWithBreakPoints() extracted, mergeBreakPoints() added, async chunkDocumentAsync() wrapper for AST - ChunkStrategy type ("auto" | "regex") threaded through generateEmbeddings(), hybridQuery(), structuredSearch(), CLI, and SDK - getASTStatus() health check wired into `qmd status` - Parse failures log a warning and fall back to regex — never crash Hardening: - Grammar packages are optionalDependencies with pinned versions to prevent ABI breaks from semver drift - web-tree-sitter is a direct dependency (pinned) - Errors are logged (not silently swallowed) for debuggability - Tested on both Node.js and Bun (Bun is actually faster) Testing: - 26 unit tests (test/ast.test.ts) — all 4 languages, error handling - 7 integration tests (test/store.test.ts) — merge, equivalence, bypass - Standalone test-ast-chunking.mjs with 63 synthetic tests and a real-collection performance scanner (npx tsx test-ast-chunking.mjs ~/code) - Validated end-to-end with qmd embed + qmd query on QMD's own codebase - Zero markdown regressions across all test paths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jamesrisberg · 2026-03-22T05:34:57Z

A second phase can follow if this is found to be useful in production, enriching search results with actual function names, class names, and signatures from the AST. It's built out but in the wings, there's an extractSymbols() stub in this PR waiting for that further work if it comes.

tobi · 2026-03-29T00:01:11Z

Great contribution — the AST-aware chunking is exactly what QMD needed for code search. Clean implementation with solid fallback behavior. Thanks!

Builds on AST-aware chunking (tobi#449) to extract symbol metadata (functions, classes, interfaces, types, enums) during indexing and surface them through search results. Key changes: - Single-pass symbol extraction via tree-sitter during embedding - Symbols stored in content_vectors and read at query time (no re-parse) - Sequential enrichment to avoid unbounded WASM memory pressure - Symbol-enriched embeddings for improved semantic retrieval - CLI, JSON, MCP, and REST endpoints all return symbols - Store interface/proxy updated for symbols parameter - WASM parser cleanup via try/finally (no leak on exception) - DB migration only swallows "duplicate column" errors - Standalone test script replaced with proper vitest suite Tests: 60 passing (50 unit + 10 integration) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jamesrisberg force-pushed the feat/ast-aware-chunking branch from 63c7cc7 to 244ddf5 Compare March 22, 2026 05:22

jamesrisberg marked this pull request as ready for review March 22, 2026 05:26

tobi merged commit 1fb2e28 into tobi:main Mar 29, 2026

This was referenced Mar 29, 2026

feat: symbol extraction and search enrichment (Phase 2) #484

Closed

chore: migrate AST chunking tests to vitest #485

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AST-aware chunking for code files via tree-sitter#449

feat: AST-aware chunking for code files via tree-sitter#449
tobi merged 1 commit intotobi:mainfrom
jamesrisberg:feat/ast-aware-chunking

jamesrisberg commented Mar 21, 2026 •

edited

Loading

Uh oh!

jamesrisberg commented Mar 22, 2026

Uh oh!

tobi commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesrisberg commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How

Testing

Dependency Size Note

Changes

Uh oh!

jamesrisberg commented Mar 22, 2026

Uh oh!

tobi commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesrisberg commented Mar 21, 2026 •

edited

Loading