Skip to content

fix(memory/chunker): split oversized single lines at word boundaries#1891

Merged
senamakel merged 4 commits into
tinyhumansai:mainfrom
iuyup:fix/chunker-split-oversize-lines
May 17, 2026
Merged

fix(memory/chunker): split oversized single lines at word boundaries#1891
senamakel merged 4 commits into
tinyhumansai:mainfrom
iuyup:fix/chunker-split-oversize-lines

Conversation

@iuyup
Copy link
Copy Markdown
Contributor

@iuyup iuyup commented May 16, 2026

Summary

Fixed split_on_lines to split oversized single lines at word boundaries instead of emitting them as one giant chunk.

Problem

A single line longer than max_chars (e.g. a 25,000-char paragraph with no newlines) was emitted as one chunk, ignoring the max_tokens budget entirely. This caused inflated embedding costs and poor retrieval quality.

Changes

  • When a line exceeds max_chars, it is now split at word boundaries (split_whitespace)
  • Single words longer than max_chars (URLs, identifiers) are preserved intact — no mid-word splits
  • Replaced very_long_single_line_no_newlines test with very_long_single_line_splits_at_word_boundaries that asserts chunks actually fit the budget
  • Added oversize_single_word_emitted_intact test for the single-word survival case

Testing

All 23 chunker unit tests pass: cargo test --lib -- memory::chunker

Closes #1882

Summary by CodeRabbit

  • Bug Fixes
    • Improved text chunking so very long lines are split at whitespace word boundaries, avoiding mid-word fragmentation and keeping chunks within size bounds.
    • Ensures extremely long uninterrupted tokens are preserved intact when unavoidable.
    • Updated tests to validate splitting behavior and reassembly integrity.

Review Change Stack

@iuyup iuyup requested a review from a team May 16, 2026 04:12
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 16, 2026

📝 Walkthrough

Walkthrough

split_on_lines now enforces max_chars for long single lines by splitting them at whitespace word boundaries; unit tests check both multi-chunk splitting for long whitespace-separated lines and preservation of single oversize tokens.

Changes

Word-Boundary Line Splitting

Layer / File(s) Summary
Word-boundary splitting implementation and validation
src/openhuman/memory/chunker.rs
split_on_lines now flushes the current chunk when adding a line would exceed max_chars, and if the line itself is too long it is split into multiple chunks at whitespace word boundaries preserving whole words and trailing-newline behavior. Unit tests were updated to assert that long whitespace-separated lines produce multiple budget-respecting chunks and that single oversize uninterrupted tokens are emitted intact.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

rust-core

Poem

🐰 A rabbit stitched lines, not tearing the words,
Long sentences clipped where soft space affords,
Chunks now behave and stay within range,
One giant word stands proud, untouched by the change,
Hoppity-hop—peace for the tokenizer herds!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(memory/chunker): split oversized single lines at word boundaries' clearly and accurately describes the primary change—handling overlong lines by splitting at word boundaries instead of emitting them as single oversized chunks.
Linked Issues check ✅ Passed The pull request fully addresses all coding requirements from issue #1882: splits overlong lines at word boundaries, preserves oversized words intact, updates tests to verify chunk size compliance, and introduces a new test for oversize-word survival.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the fix: updating split_on_lines logic, replacing and adding relevant unit tests, with no unrelated modifications to exported APIs or other chunking strategies.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. label May 16, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 417-428: Run rustfmt (cargo fmt --all) to fix the formatting
mismatch in the test block around chunk_markdown and the assert! call; reformat
the multi-line assert! invocation in the test that builds text = "word
".repeat(5000) and calls chunk_markdown(&text, 50) so it conforms to rustfmt
style (or run the formatter to automatically adjust the assert! argument layout
and indentation).
- Around line 435-441: The code in chunker.rs around the test using long_word,
chunk_markdown, and the chunks iterator chain is not formatted according to
rustfmt; run rustfmt by executing `cargo fmt --all` (or `cargo fmt`) to reformat
the `.iter().map().collect().join()` chain and the surrounding test, then re-run
tests/CI to confirm the formatting warning is resolved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 38566159-9799-48f8-a43d-4b94a550b30b

📥 Commits

Reviewing files that changed from the base of the PR and between 258837e and ab3a8df.

📒 Files selected for processing (1)
  • src/openhuman/memory/chunker.rs

Comment thread src/openhuman/memory/chunker.rs Outdated
Comment thread src/openhuman/memory/chunker.rs Outdated
@senamakel senamakel self-assigned this May 17, 2026
Reformat assert! and iterator chains in chunker tests per cargo fmt,
addressing CodeRabbit review comments on PR tinyhumansai#1891.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 216-247: The new line-splitting flow in chunker.rs (the loop over
text.lines() using variables current, chunks, max_chars) lacks required
diagnostics; add tracing/log debug statements with the stable prefix
"[memory::chunker]" at each control path: when you flush current because
current.len() + line.len() + 1 > max_chars, when a whole line fits and you
append it (include current.len(), line.len(), chunks.len()), when a line is
oversized and you enter the word-splitting branch, when you append a word to an
empty or non-empty current, and when you flush due to a word causing overflow;
use size/counter fields (e.g., current_len, line_len, word_len, chunk_count) and
the crate's log/tracing macros (trace!/debug!) rather than logging raw text.
- Around line 437-450: The current test oversize_single_word_emitted_intact()
only checks that the long_word appears after joining chunks, which allows it to
be split across chunk boundaries; update the assertion to require that at least
one element of chunks (from chunk_markdown(&text, 50)) contains the entire
long_word as a contiguous substring (e.g., chunk.content.contains(&long_word))
or equals it when trimmed, and optionally add an assertion that no chunk
contains only a strict prefix or suffix of long_word to ensure the token was not
split across chunks; reference the oversize_single_word_emitted_intact test, the
long_word variable, the text variable and the chunks vector when making these
checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0c6e6887-fb9b-4db6-a7f4-004ff79a666c

📥 Commits

Reviewing files that changed from the base of the PR and between ab3a8df and 57341d2.

📒 Files selected for processing (1)
  • src/openhuman/memory/chunker.rs

Comment thread src/openhuman/memory/chunker.rs
Comment thread src/openhuman/memory/chunker.rs Outdated
Comment on lines +437 to +450
fn oversize_single_word_emitted_intact() {
// A single 500-char "word" with no whitespace must not be split mid-word
let long_word = "a".repeat(500);
let text = format!("hello {} world", long_word);
let chunks = chunk_markdown(&text, 50); // max_chars=200
let all_content: String = chunks
.iter()
.map(|c| c.content.clone())
.collect::<Vec<_>>()
.join("");
assert!(
all_content.contains(&long_word),
"oversize word was corrupted by splitting"
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen the oversize-word test to assert “intact in one chunk,” not just “present after join.”

join("") can still contain long_word even if it was split into adjacent pieces, so this test doesn’t fully protect the no-mid-word-split contract. Assert that at least one chunk contains the full token as a contiguous standalone segment (or equals it after trim), and optionally assert no chunk contains a strict prefix/suffix split pattern.

Proposed test hardening
     let chunks = chunk_markdown(&text, 50); // max_chars=200
-    let all_content: String = chunks
-        .iter()
-        .map(|c| c.content.clone())
-        .collect::<Vec<_>>()
-        .join("");
-    assert!(
-        all_content.contains(&long_word),
-        "oversize word was corrupted by splitting"
-    );
+    assert!(
+        chunks.iter().any(|c| c.content.trim() == long_word),
+        "oversize word should appear intact in a single chunk"
+    );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/chunker.rs` around lines 437 - 450, The current test
oversize_single_word_emitted_intact() only checks that the long_word appears
after joining chunks, which allows it to be split across chunk boundaries;
update the assertion to require that at least one element of chunks (from
chunk_markdown(&text, 50)) contains the entire long_word as a contiguous
substring (e.g., chunk.content.contains(&long_word)) or equals it when trimmed,
and optionally add an assertion that no chunk contains only a strict prefix or
suffix of long_word to ensure the token was not split across chunks; reference
the oversize_single_word_emitted_intact test, the long_word variable, the text
variable and the chunks vector when making these checks.

# Conflicts:
#	src/openhuman/memory/chunker.rs
@senamakel senamakel merged commit c23b14f into tinyhumansai:main May 17, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat (memory/chunker): split_on_lines never splits within a single line; oversize paragraphs produce one giant chunk

2 participants