Skip to content

feat: AST-aware chunking for code files via tree-sitter#449

Merged
tobi merged 1 commit intotobi:mainfrom
jamesrisberg:feat/ast-aware-chunking
Mar 29, 2026
Merged

feat: AST-aware chunking for code files via tree-sitter#449
tobi merged 1 commit intotobi:mainfrom
jamesrisberg:feat/ast-aware-chunking

Conversation

@jamesrisberg
Copy link
Copy Markdown
Contributor

@jamesrisberg jamesrisberg commented Mar 21, 2026

Summary

Code files (.ts, .tsx, .js, .jsx, .py, .go, .rs) can now be chunked at function, class, and import boundaries using tree-sitter AST parsing, instead of relying solely on regex-based markdown break points. Opt in with --chunk-strategy auto. Default behavior (regex) is unchanged. Existing collections just need qmd embed --chunk-strategy auto to benefit from AST-aware chunking.

Why: QMD's regex chunker was designed for markdown. When indexing code, it splits at blank lines and paragraph boundaries, which regularly cuts function bodies in half. AST-aware chunking detects actual language structure and prefers splitting there instead. In testing on QMD's own codebase, AST mode split 42% fewer function bodies than regex.

How

  • web-tree-sitter (pure WASM, no native compilation) parses code files. Grammar packages for TypeScript, Python, Go, and Rust are included as optional dependencies with pinned versions.
  • src/ast.ts handles language detection from file extension, lazy parser/grammar initialization (cached after first load), and per-language tree-sitter queries that emit scored break points.
  • AST break points are merged with the existing regex break points — highest score wins at each position. The existing findBestCutoff() decay algorithm works unchanged.
  • Arrow functions and function expressions (const handler = () => {}) are detected as chunk boundaries.
  • Parse failures or missing grammars log a warning and fall back to regex-only. Never crashes.
  • qmd status reports which tree-sitter grammars are available.

Default behavior is regex (unchanged). Users opt into AST chunking with:

qmd embed --chunk-strategy auto
qmd query "search terms" --chunk-strategy auto

Score table (aligned with existing markdown scale):

AST Node Score
Class / interface / struct / impl / trait 100
Function / method / arrow function 90
Type alias / enum 80
Import / use declaration 60

Testing

  • 26 unit tests in test/ast.test.ts — all 4 languages, error handling, edge cases
  • 7 integration tests in test/store.test.ts — merge, equivalence, strategy bypass
  • Standalone test-ast-chunking.mjs with 63 synthetic tests + real-collection performance scanner
  • Tested on both Node.js and Bun — all tests pass on both runtimes
  • Validated end-to-end: qmd embed --chunk-strategy auto + qmd query on QMD's own codebase (71 files, 567 chunks, no errors)
  • 0 markdown regressions (all .md files produce identical chunks)

Dependency Size Note

The grammar packages add ~72 MB to node_modules, but only ~5 MB of .wasm files are actually used at runtime. The rest is native prebuilds and grammar source that npm downloads but QMD never touches. This is a consequence of how the grammar packages are structured.

If install size is a concern, an alternative approach is to bundle just the 5 MB of .wasm files directly in the QMD repo (assets/grammars/) and drop the grammar packages entirely. This would require resolving grammars from import.meta.url instead of require.resolve(). Happy to implement that if preferred.

For context: QMD's existing node_modules is ~212 MB (mostly node-llama-cpp), and users download ~2 GB of GGUF models on first run.

Changes

File Change
package.json Add web-tree-sitter (dep) + 4 grammar packages (optionalDeps, pinned)
src/ast.ts New — language detection, AST parsing, health check, symbol stub
src/store.ts Refactor chunking core, add async AST wrapper, thread options
src/cli/qmd.ts Add --chunk-strategy flag, AST status in qmd status
src/index.ts Expose chunkStrategy in SDK
test/ast.test.ts New — 26 AST unit tests
test/store.test.ts 7 new integration tests
test-ast-chunking.mjs New — standalone benchmark + real-collection scanner
CHANGELOG.md Entry under [Unreleased]
CLAUDE.md Architecture note
README.md Document AST chunking, --chunk-strategy, SDK option

This contribution was developed with AI assistance (Claude Code, Codex 5.4, and Grok 4.2: the whole gang).

Add opt-in AST-aware chunk boundary detection for code files using
web-tree-sitter. When enabled with `--chunk-strategy auto`, code files
(.ts, .tsx, .js, .jsx, .py, .go, .rs) are chunked at function, class,
and import boundaries instead of arbitrary text positions. Default
behavior (`regex`) is unchanged — no surprises on upgrade.

In testing on QMD's own codebase, AST mode split 42% fewer function
bodies across chunk boundaries compared to regex-only chunking.

Usage:
  qmd embed --chunk-strategy auto
  qmd query "search terms" --chunk-strategy auto

What's included:
- Language detection from file extension with support for TypeScript,
  JavaScript (including arrow functions and function expressions),
  Python, Go, and Rust
- Per-language tree-sitter queries with scored break points aligned to
  the existing markdown scale (class=100, function=90, type=80, import=60)
- AST break points merged with regex break points — highest score wins
  at each position, so embedded markdown (comments, docstrings) still
  benefits from regex patterns
- Refactored chunking core: chunkDocumentWithBreakPoints() extracted,
  mergeBreakPoints() added, async chunkDocumentAsync() wrapper for AST
- ChunkStrategy type ("auto" | "regex") threaded through
  generateEmbeddings(), hybridQuery(), structuredSearch(), CLI, and SDK
- getASTStatus() health check wired into `qmd status`
- Parse failures log a warning and fall back to regex — never crash

Hardening:
- Grammar packages are optionalDependencies with pinned versions to
  prevent ABI breaks from semver drift
- web-tree-sitter is a direct dependency (pinned)
- Errors are logged (not silently swallowed) for debuggability
- Tested on both Node.js and Bun (Bun is actually faster)

Testing:
- 26 unit tests (test/ast.test.ts) — all 4 languages, error handling
- 7 integration tests (test/store.test.ts) — merge, equivalence, bypass
- Standalone test-ast-chunking.mjs with 63 synthetic tests and a
  real-collection performance scanner (npx tsx test-ast-chunking.mjs ~/code)
- Validated end-to-end with qmd embed + qmd query on QMD's own codebase
- Zero markdown regressions across all test paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jamesrisberg jamesrisberg force-pushed the feat/ast-aware-chunking branch from 63c7cc7 to 244ddf5 Compare March 22, 2026 05:22
@jamesrisberg jamesrisberg marked this pull request as ready for review March 22, 2026 05:26
@jamesrisberg
Copy link
Copy Markdown
Contributor Author

A second phase can follow if this is found to be useful in production, enriching search results with actual function names, class names, and signatures from the AST. It's built out but in the wings, there's an extractSymbols() stub in this PR waiting for that further work if it comes.

@tobi tobi merged commit 1fb2e28 into tobi:main Mar 29, 2026
@tobi
Copy link
Copy Markdown
Owner

tobi commented Mar 29, 2026

Great contribution — the AST-aware chunking is exactly what QMD needed for code search. Clean implementation with solid fallback behavior. Thanks!

jamesrisberg added a commit to jamesrisberg/qmd that referenced this pull request Mar 29, 2026
Builds on AST-aware chunking (tobi#449) to extract symbol metadata
(functions, classes, interfaces, types, enums) during indexing
and surface them through search results.

Key changes:
- Single-pass symbol extraction via tree-sitter during embedding
- Symbols stored in content_vectors and read at query time (no re-parse)
- Sequential enrichment to avoid unbounded WASM memory pressure
- Symbol-enriched embeddings for improved semantic retrieval
- CLI, JSON, MCP, and REST endpoints all return symbols
- Store interface/proxy updated for symbols parameter
- WASM parser cleanup via try/finally (no leak on exception)
- DB migration only swallows "duplicate column" errors
- Standalone test script replaced with proper vitest suite

Tests: 60 passing (50 unit + 10 integration)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants