feat(chunking): list-aware break point scanner#540
Draft
feat(chunking): list-aware break point scanner#540
Conversation
Code fence detection only matched exactly ``` and toggled open/close on every match, so fences opened with 4+ backticks were never recognized, tilde fences were ignored, and a stray ``` inside a longer fence could prematurely close it. Chunks could then split inside code blocks. findCodeFences now follows CommonMark pairing: the closing fence must use the same character as the opener, be at least as long, and carry no info string. Tilde fences are recognized. Only column-0 fences are detected; indented fences are not.
Pure rename, no behavior change. CodeFenceRegion becomes ProtectedRegion with an optional `kind` tag (set to 'fence' by findCodeFences). This opens the seam for future passes to contribute other kinds of protected regions without changing the chunker's core contract. Renames: - interface CodeFenceRegion -> ProtectedRegion (adds optional kind) - isInsideCodeFence -> isInsideProtectedRegion - findBestCutoff param: codeFences -> protectedRegions - chunkDocumentWithBreakPoints param: codeFences -> protectedRegions findCodeFences keeps its name as one producer of protected regions. No external callers — the symbols are not re-exported from src/index.ts, so the rename is contained.
This was referenced Apr 8, 2026
Mirrors the fix applied in 66e70c0 ("fix(test): reset _productionMode in getDefaultDbPath test"). The createStore-throws test in store.test.ts has the same isolation issue as the parallel test in store.helpers.unit.test.ts: bun runs all test files in a single process so _productionMode state leaks between files. If a previous test file sets production mode, this test fails because getDefaultDbPath returns a real path instead of throwing. Adds the same _resetProductionModeForTesting() call right before the expectation. Test passes deterministically regardless of file ordering. Surfaced when stacked feature branches above this PR shifted bun's test file ordering enough to trigger the latent failure.
Replaces the two naive list patterns in BREAK_PATTERNS with a
stack-based scanner that tracks nested list frames and emits
depth-weighted break points plus a list-end transition break point.
Old behavior:
[/\n[-*]\s/g, 5, 'list']
[/\n\d+\.\s/g, 5, 'numlist']
Both scored every list-item start at 5, so the break point almost
always lost to nearby heading/blank/codeblock scores and chunks
landed mid-item on long lists. Nested sublists and the ordered `1)`
form were not detected at all.
New scanner (findListBreakPoints):
- depth 0 item (top-level): score 70
- depth 1 item (first sublist): score 45
- depth 2+ item (deeper): score 25
- list-end (list -> non-list transition): score 75
Scope:
- Unordered markers: `-`, `*` (matches previous behavior; `+` not
supported — agents and modern docs don't use it)
- Ordered markers: `1.` and `1)` (new: `1)` was never detected)
- Mixed marker characters at the same indent are treated as one
list (simpler than CommonMark's split rule, better for chunking)
- Nested sublists with proper depth tracking (new)
- Blank lines inside items don't terminate the list
- Column-0 non-list lines terminate the list and emit list-end
Deliberately deferred:
- Loose vs tight list distinction (rendering concern, no chunking
impact)
- Lazy continuation (column-0 line that CommonMark folds back into
the preceding item)
- 4-space indented code blocks inside items (ambiguous with
continuation; defer)
- Tab-as-marker-separator (`-\t`); not a regression since neither
old nor new matches tab indentation
Integration: chunkDocument and chunkDocumentAsync now merge
findListBreakPoints output with scanBreakPoints before passing to
chunkDocumentWithBreakPoints. mergeBreakPoints already handles
"higher score wins at same position." AST points continue to layer
on top in the async path.
16 new tests in test/store.test.ts covering empty input, prose,
unordered/ordered/mixed lists, three-deep nesting, mixed marker
nesting, list-end at prose and EOF, blank-line continuation, `+`
rejection, position convention, and an end-to-end integration test
through chunkDocument confirming long lists split at item boundaries.
d692006 to
df78b4a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pseudo-stack
Cross-fork PRs can't use GitHub's stacked-PR mechanism, so all four PRs in this series target
main. The logical stack:This PR: #540. Stacked on #539 which is stacked on #538. The diff below shows commits from both of those PRs in addition to the list scanner — they'll collapse once the lower PRs land and I rebase. Parallel to #541 (XML tag regions); they branch off #539 independently and don't depend on each other.
Summary
Replaces the two naive list patterns in
BREAK_PATTERNSwith a stack-based scanner that tracks nested list frames and emits depth-weighted break points plus a list-end transition signal. Long lists now split cleanly at item boundaries instead of mid-item.The problem
The existing list detection was two flat regex patterns:
Two consequences:
` scored higher than the list-item boundary.
- sub-iteminside a- top-itemlist gets no break point at all. The ordered1)form isn't detected either. End-of-list transitions (the single most valuable break point in a document with lists) get no special treatment.Net effect: lists were the worst-handled structure in the chunker.
The fix
findListBreakPoints(text: string): BreakPoint[]— a line-by-line state machine that maintains a stack ofListFrame { indent, contentCol }entries and emits:list-endlist-item-0list-item-1list-item-2These scores sit in gaps in the existing score table so comparisons stay unambiguous.
What it handles
-and*(matches existing behavior;+intentionally not supported)1.and1)(the parenthesis form was previously missed entirely)- foo→* barat the same indent ends one list and starts another; we don't follow that rule because it would insert a high-priority break where the user visually sees no break at all. For chunking, "same indent = same list" produces better results.list-endat the preceding newline.list-endattext.length.What it deliberately doesn't handle
Each of these was evaluated against a strict "don't degrade existing behavior" bar and deferred:
. The only tab pattern the old regex did match was- ` (dash followed by literal tab as separator), which is a typing pattern nobody uses in practice. Space-separated markers only.A block comment on the scanner documents these limitations so the next person doesn't chase them.
Integration
chunkDocumentandchunkDocumentAsyncnow mergefindListBreakPointsoutput withscanBreakPointsbefore passing tochunkDocumentWithBreakPoints.mergeBreakPointsalready handles the "higher score wins at same position" case, so the merge is trivial:In the async path, AST points continue to layer on top via a second
mergeBreakPointscall whenchunkStrategy === 'auto'.Removed
The two old patterns are deleted from
BREAK_PATTERNS:An existing test in the
scanBreakPointsblock that asserted list detection was updated to assert non-detection (with a pointer tofindListBreakPoints).Regression analysis
The only pattern that used to score and no longer does is
-(dash followed by a literal tab character as the marker separator). This is not a typing pattern that occurs in real markdown — every editor inserts spaces, and every style guide specifies spaces. The "tab-indented list" concern is a non-issue because the old regex never matched tab-indented items in the first place (- foowas invisible to it).Everything the old code detected, the new code detects and scores higher. Patterns the old code missed (nested sublists,
1)form, list-end transitions) are now handled properly.Tests
16 new tests in
test/store.test.tsunderdescribe("findListBreakPoints", ...):1.1)+markers → not detected (decision 4)posis thebefore the line, matchingscanBreakPointschunkDocumenton a 200-item list → splits land on item boundariesTest plan
npx vitest run test/store.test.tspasses (219/219, was 203 + 16 new)npx vitest run test/ast-chunking.test.tspasses (12/12)npx tsc -p tsconfig.build.json --noEmitclean