fix(memory/chunker): split oversized single lines at word boundaries#1891
Conversation
📝 WalkthroughWalkthroughsplit_on_lines now enforces max_chars for long single lines by splitting them at whitespace word boundaries; unit tests check both multi-chunk splitting for long whitespace-separated lines and preservation of single oversize tokens. ChangesWord-Boundary Line Splitting
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 417-428: Run rustfmt (cargo fmt --all) to fix the formatting
mismatch in the test block around chunk_markdown and the assert! call; reformat
the multi-line assert! invocation in the test that builds text = "word
".repeat(5000) and calls chunk_markdown(&text, 50) so it conforms to rustfmt
style (or run the formatter to automatically adjust the assert! argument layout
and indentation).
- Around line 435-441: The code in chunker.rs around the test using long_word,
chunk_markdown, and the chunks iterator chain is not formatted according to
rustfmt; run rustfmt by executing `cargo fmt --all` (or `cargo fmt`) to reformat
the `.iter().map().collect().join()` chain and the surrounding test, then re-run
tests/CI to confirm the formatting warning is resolved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 38566159-9799-48f8-a43d-4b94a550b30b
📒 Files selected for processing (1)
src/openhuman/memory/chunker.rs
Reformat assert! and iterator chains in chunker tests per cargo fmt, addressing CodeRabbit review comments on PR tinyhumansai#1891.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 216-247: The new line-splitting flow in chunker.rs (the loop over
text.lines() using variables current, chunks, max_chars) lacks required
diagnostics; add tracing/log debug statements with the stable prefix
"[memory::chunker]" at each control path: when you flush current because
current.len() + line.len() + 1 > max_chars, when a whole line fits and you
append it (include current.len(), line.len(), chunks.len()), when a line is
oversized and you enter the word-splitting branch, when you append a word to an
empty or non-empty current, and when you flush due to a word causing overflow;
use size/counter fields (e.g., current_len, line_len, word_len, chunk_count) and
the crate's log/tracing macros (trace!/debug!) rather than logging raw text.
- Around line 437-450: The current test oversize_single_word_emitted_intact()
only checks that the long_word appears after joining chunks, which allows it to
be split across chunk boundaries; update the assertion to require that at least
one element of chunks (from chunk_markdown(&text, 50)) contains the entire
long_word as a contiguous substring (e.g., chunk.content.contains(&long_word))
or equals it when trimmed, and optionally add an assertion that no chunk
contains only a strict prefix or suffix of long_word to ensure the token was not
split across chunks; reference the oversize_single_word_emitted_intact test, the
long_word variable, the text variable and the chunks vector when making these
checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0c6e6887-fb9b-4db6-a7f4-004ff79a666c
📒 Files selected for processing (1)
src/openhuman/memory/chunker.rs
| fn oversize_single_word_emitted_intact() { | ||
| // A single 500-char "word" with no whitespace must not be split mid-word | ||
| let long_word = "a".repeat(500); | ||
| let text = format!("hello {} world", long_word); | ||
| let chunks = chunk_markdown(&text, 50); // max_chars=200 | ||
| let all_content: String = chunks | ||
| .iter() | ||
| .map(|c| c.content.clone()) | ||
| .collect::<Vec<_>>() | ||
| .join(""); | ||
| assert!( | ||
| all_content.contains(&long_word), | ||
| "oversize word was corrupted by splitting" | ||
| ); |
There was a problem hiding this comment.
Strengthen the oversize-word test to assert “intact in one chunk,” not just “present after join.”
join("") can still contain long_word even if it was split into adjacent pieces, so this test doesn’t fully protect the no-mid-word-split contract. Assert that at least one chunk contains the full token as a contiguous standalone segment (or equals it after trim), and optionally assert no chunk contains a strict prefix/suffix split pattern.
Proposed test hardening
let chunks = chunk_markdown(&text, 50); // max_chars=200
- let all_content: String = chunks
- .iter()
- .map(|c| c.content.clone())
- .collect::<Vec<_>>()
- .join("");
- assert!(
- all_content.contains(&long_word),
- "oversize word was corrupted by splitting"
- );
+ assert!(
+ chunks.iter().any(|c| c.content.trim() == long_word),
+ "oversize word should appear intact in a single chunk"
+ );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/memory/chunker.rs` around lines 437 - 450, The current test
oversize_single_word_emitted_intact() only checks that the long_word appears
after joining chunks, which allows it to be split across chunk boundaries;
update the assertion to require that at least one element of chunks (from
chunk_markdown(&text, 50)) contains the entire long_word as a contiguous
substring (e.g., chunk.content.contains(&long_word)) or equals it when trimmed,
and optionally add an assertion that no chunk contains only a strict prefix or
suffix of long_word to ensure the token was not split across chunks; reference
the oversize_single_word_emitted_intact test, the long_word variable, the text
variable and the chunks vector when making these checks.
# Conflicts: # src/openhuman/memory/chunker.rs
Summary
Fixed
split_on_linesto split oversized single lines at word boundaries instead of emitting them as one giant chunk.Problem
A single line longer than
max_chars(e.g. a 25,000-char paragraph with no newlines) was emitted as one chunk, ignoring themax_tokensbudget entirely. This caused inflated embedding costs and poor retrieval quality.Changes
max_chars, it is now split at word boundaries (split_whitespace)max_chars(URLs, identifiers) are preserved intact — no mid-word splitsvery_long_single_line_no_newlinestest withvery_long_single_line_splits_at_word_boundariesthat asserts chunks actually fit the budgetoversize_single_word_emitted_intacttest for the single-word survival caseTesting
All 23 chunker unit tests pass:
cargo test --lib -- memory::chunkerCloses #1882
Summary by CodeRabbit