feat(chunking): list-aware break point scanner by galligan · Pull Request #540 · tobi/qmd

galligan · 2026-04-08T13:57:42Z

Pseudo-stack

Cross-fork PRs can't use GitHub's stacked-PR mechanism, so all four PRs in this series target main. The logical stack:

main
└── #538  fix: code fence pairing               (foundation)
    └── #539  refactor: rename to ProtectedRegion
        ├── #540  feat: list-aware chunking        ← you are here (parallel with #541)
        └── #541  feat: XML tag regions             (parallel with #540)

This PR: #540. Stacked on #539 which is stacked on #538. The diff below shows commits from both of those PRs in addition to the list scanner — they'll collapse once the lower PRs land and I rebase. Parallel to #541 (XML tag regions); they branch off #539 independently and don't depend on each other.

Full context: qmd chunker improvements — four-PR series overview

Summary

Replaces the two naive list patterns in BREAK_PATTERNS with a stack-based scanner that tracks nested list frames and emits depth-weighted break points plus a list-end transition signal. Long lists now split cleanly at item boundaries instead of mid-item.

The problem

The existing list detection was two flat regex patterns:

[/
[-*]\s/g, 5, 'list'],        // unordered
[/
\d+\.\s/g, 5, 'numlist'],    // ordered

Two consequences:

Score 5 is so low it almost always loses. Any nearby blank line (20), hr (60), heading (50–100), or codeblock boundary (80) outranks it. On long prose-heavy lists, the chunker happily splits mid-item because a `

` scored higher than the list-item boundary.

Zero structural awareness. The regex doesn't know about indentation, so nested sublists are invisible. - sub-item inside a - top-item list gets no break point at all. The ordered 1) form isn't detected either. End-of-list transitions (the single most valuable break point in a document with lists) get no special treatment.

Net effect: lists were the worst-handled structure in the chunker.

The fix

findListBreakPoints(text: string): BreakPoint[] — a line-by-line state machine that maintains a stack of ListFrame { indent, contentCol } entries and emits:

Break point	Score	Meaning
`list-end`	75	Transition from list back to non-list flow. Third-highest score after h1/h2 — list endings are excellent split points.
`list-item-0`	70	Top-level list item start. Matches h4; items at the top of a list are as good a break as a level-4 heading.
`list-item-1`	45	Second-level (first sublist) item. Above blank (20), below h6 (50).
`list-item-2`	25	Third-level and deeper. Just above blank — used only when nothing better is nearby.

These scores sit in gaps in the existing score table so comparisons stay unambiguous.

What it handles

Unordered markers - and * (matches existing behavior; + intentionally not supported)
Ordered markers 1. and 1) (the parenthesis form was previously missed entirely)
Mixed marker characters at the same indent are treated as one list. CommonMark says - foo → * bar at the same indent ends one list and starts another; we don't follow that rule because it would insert a high-priority break where the user visually sees no break at all. For chunking, "same indent = same list" produces better results.
Nested sublists with proper depth tracking. `- foo
- bar
  - baz` correctly produces depth 0/1/2 break points.
Mixed nesting (unordered top with ordered sublist, or vice versa) works with the same stack logic.
Blank lines inside items don't terminate the list — state is preserved until a non-indented non-list line appears.
Column-0 non-list lines terminate the list and emit list-end at the preceding newline.
List at end of document emits list-end at text.length.

What it deliberately doesn't handle

Each of these was evaluated against a strict "don't degrade existing behavior" bar and deferred:

Loose vs tight list distinction — affects rendering, not chunking.
Lazy continuation — a column-0 non-list line that CommonMark folds back into the preceding item. Treated as list-end. The wrong answer is a slightly degraded chunk, not a broken one.
4-space indented code blocks inside items — ambiguous with sublist continuation. Modern docs use fenced code.
Tab indentation — neither the old regex nor the new one handles `
- item. The only tab pattern the old regex did match was - ` (dash followed by literal tab as separator), which is a typing pattern nobody uses in practice. Space-separated markers only.
Marker-type transitions at same indent — see "mixed marker characters" above; intentional deviation from CommonMark.

A block comment on the scanner documents these limitations so the next person doesn't chase them.

Integration

chunkDocument and chunkDocumentAsync now merge findListBreakPoints output with scanBreakPoints before passing to chunkDocumentWithBreakPoints. mergeBreakPoints already handles the "higher score wins at same position" case, so the merge is trivial:

const regexPoints = scanBreakPoints(content);
const listPoints = findListBreakPoints(content);
const breakPoints = mergeBreakPoints(regexPoints, listPoints);

In the async path, AST points continue to layer on top via a second mergeBreakPoints call when chunkStrategy === 'auto'.

Removed

The two old patterns are deleted from BREAK_PATTERNS:

[/
[-*]\s/g, 5, 'list'],
[/
\d+\.\s/g, 5, 'numlist'],

An existing test in the scanBreakPoints block that asserted list detection was updated to assert non-detection (with a pointer to findListBreakPoints).

Regression analysis

The only pattern that used to score and no longer does is - (dash followed by a literal tab character as the marker separator). This is not a typing pattern that occurs in real markdown — every editor inserts spaces, and every style guide specifies spaces. The "tab-indented list" concern is a non-issue because the old regex never matched tab-indented items in the first place ( - foo was invisible to it).

Everything the old code detected, the new code detects and scores higher. Patterns the old code missed (nested sublists, 1) form, list-end transitions) are now handled properly.

Tests

16 new tests in test/store.test.ts under describe("findListBreakPoints", ...):

Empty input → no break points
Pure prose → no break points
Single unordered list → item + list-end break points, correct scores
Single ordered list with 1.
Single ordered list with 1)
Mixed marker characters at same indent → one list
Nested unordered list → depth 0 and depth 1 scores
Three-deep nesting → depth 0, 1, 2 scores
Mixed nesting (unordered top + ordered sublist)
List followed by prose → list-end at correct position
List at end of document → list-end at text end
Single blank line between items → does not terminate
Blank then non-list prose → terminates list
+ markers → not detected (decision 4)
Position convention → pos is the before the line, matching scanBreakPoints
Integration test: chunkDocument on a 200-item list → splits land on item boundaries

Test plan

npx vitest run test/store.test.ts passes (219/219, was 203 + 16 new)
npx vitest run test/ast-chunking.test.ts passes (12/12)
npx tsc -p tsconfig.build.json --noEmit clean
CI green

Code fence detection only matched exactly ``` and toggled open/close on every match, so fences opened with 4+ backticks were never recognized, tilde fences were ignored, and a stray ``` inside a longer fence could prematurely close it. Chunks could then split inside code blocks. findCodeFences now follows CommonMark pairing: the closing fence must use the same character as the opener, be at least as long, and carry no info string. Tilde fences are recognized. Only column-0 fences are detected; indented fences are not.

Pure rename, no behavior change. CodeFenceRegion becomes ProtectedRegion with an optional `kind` tag (set to 'fence' by findCodeFences). This opens the seam for future passes to contribute other kinds of protected regions without changing the chunker's core contract. Renames: - interface CodeFenceRegion -> ProtectedRegion (adds optional kind) - isInsideCodeFence -> isInsideProtectedRegion - findBestCutoff param: codeFences -> protectedRegions - chunkDocumentWithBreakPoints param: codeFences -> protectedRegions findCodeFences keeps its name as one producer of protected regions. No external callers — the symbols are not re-exported from src/index.ts, so the rename is contained.

Mirrors the fix applied in 66e70c0 ("fix(test): reset _productionMode in getDefaultDbPath test"). The createStore-throws test in store.test.ts has the same isolation issue as the parallel test in store.helpers.unit.test.ts: bun runs all test files in a single process so _productionMode state leaks between files. If a previous test file sets production mode, this test fails because getDefaultDbPath returns a real path instead of throwing. Adds the same _resetProductionModeForTesting() call right before the expectation. Test passes deterministically regardless of file ordering. Surfaced when stacked feature branches above this PR shifted bun's test file ordering enough to trigger the latent failure.

Replaces the two naive list patterns in BREAK_PATTERNS with a stack-based scanner that tracks nested list frames and emits depth-weighted break points plus a list-end transition break point. Old behavior: [/\n[-*]\s/g, 5, 'list'] [/\n\d+\.\s/g, 5, 'numlist'] Both scored every list-item start at 5, so the break point almost always lost to nearby heading/blank/codeblock scores and chunks landed mid-item on long lists. Nested sublists and the ordered `1)` form were not detected at all. New scanner (findListBreakPoints): - depth 0 item (top-level): score 70 - depth 1 item (first sublist): score 45 - depth 2+ item (deeper): score 25 - list-end (list -> non-list transition): score 75 Scope: - Unordered markers: `-`, `*` (matches previous behavior; `+` not supported — agents and modern docs don't use it) - Ordered markers: `1.` and `1)` (new: `1)` was never detected) - Mixed marker characters at the same indent are treated as one list (simpler than CommonMark's split rule, better for chunking) - Nested sublists with proper depth tracking (new) - Blank lines inside items don't terminate the list - Column-0 non-list lines terminate the list and emit list-end Deliberately deferred: - Loose vs tight list distinction (rendering concern, no chunking impact) - Lazy continuation (column-0 line that CommonMark folds back into the preceding item) - 4-space indented code blocks inside items (ambiguous with continuation; defer) - Tab-as-marker-separator (`-\t`); not a regression since neither old nor new matches tab indentation Integration: chunkDocument and chunkDocumentAsync now merge findListBreakPoints output with scanBreakPoints before passing to chunkDocumentWithBreakPoints. mergeBreakPoints already handles "higher score wins at same position." AST points continue to layer on top in the async path. 16 new tests in test/store.test.ts covering empty input, prose, unordered/ordered/mixed lists, three-deep nesting, mixed marker nesting, list-end at prose and EOF, blank-line continuation, `+` rejection, position convention, and an end-to-end integration test through chunkDocument confirming long lists split at item boundaries.

galligan added 2 commits April 8, 2026 09:12

This was referenced Apr 8, 2026

feat(chunking): XML tag break point scanner #541

Draft

refactor(chunking): rename CodeFenceRegion to ProtectedRegion #539

Draft

galligan added 2 commits April 8, 2026 15:14

galligan force-pushed the feat/chunking-list-aware branch from d692006 to df78b4a Compare April 8, 2026 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chunking): list-aware break point scanner#540

feat(chunking): list-aware break point scanner#540
galligan wants to merge 4 commits intotobi:mainfrom
galligan:feat/chunking-list-aware

galligan commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

galligan commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pseudo-stack

Summary

The problem

The fix

What it handles

What it deliberately doesn't handle

Integration

Removed

Regression analysis

Tests

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

galligan commented Apr 8, 2026 •

edited

Loading