fix(memory/chunker): split oversized single lines at word boundaries by iuyup · Pull Request #1891 · tinyhumansai/openhuman

iuyup · 2026-05-16T04:12:41Z

Summary

Fixed split_on_lines to split oversized single lines at word boundaries instead of emitting them as one giant chunk.

Problem

A single line longer than max_chars (e.g. a 25,000-char paragraph with no newlines) was emitted as one chunk, ignoring the max_tokens budget entirely. This caused inflated embedding costs and poor retrieval quality.

Changes

When a line exceeds max_chars, it is now split at word boundaries (split_whitespace)
Single words longer than max_chars (URLs, identifiers) are preserved intact — no mid-word splits
Replaced very_long_single_line_no_newlines test with very_long_single_line_splits_at_word_boundaries that asserts chunks actually fit the budget
Added oversize_single_word_emitted_intact test for the single-word survival case

Testing

All 23 chunker unit tests pass: cargo test --lib -- memory::chunker

Closes #1882

Summary by CodeRabbit

Bug Fixes
- Improved text chunking so very long lines are split at whitespace word boundaries, avoiding mid-word fragmentation and keeping chunks within size bounds.
- Ensures extremely long uninterrupted tokens are preserved intact when unavoidable.
- Updated tests to validate splitting behavior and reassembly integrity.

…loses tinyhumansai#1882)

coderabbitai · 2026-05-16T04:12:54Z

📝 Walkthrough

Walkthrough

split_on_lines now enforces max_chars for long single lines by splitting them at whitespace word boundaries; unit tests check both multi-chunk splitting for long whitespace-separated lines and preservation of single oversize tokens.

Changes

Word-Boundary Line Splitting

Layer / File(s)	Summary
Word-boundary splitting implementation and validation `src/openhuman/memory/chunker.rs`	`split_on_lines` now flushes the current chunk when adding a line would exceed `max_chars`, and if the line itself is too long it is split into multiple chunks at whitespace word boundaries preserving whole words and trailing-newline behavior. Unit tests were updated to assert that long whitespace-separated lines produce multiple budget-respecting chunks and that single oversize uninterrupted tokens are emitted intact.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

rust-core

Poem

🐰 A rabbit stitched lines, not tearing the words,
Long sentences clipped where soft space affords,
Chunks now behave and stay within range,
One giant word stands proud, untouched by the change,
Hoppity-hop—peace for the tokenizer herds!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(memory/chunker): split oversized single lines at word boundaries' clearly and accurately describes the primary change—handling overlong lines by splitting at word boundaries instead of emitting them as single oversized chunks.
Linked Issues check	✅ Passed	The pull request fully addresses all coding requirements from issue `#1882`: splits overlong lines at word boundaries, preserves oversized words intact, updates tests to verify chunk size compliance, and introduces a new test for oversize-word survival.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the fix: updating split_on_lines logic, replacing and adding relevant unit tests, with no unrelated modifications to exported APIs or other chunking strategies.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 417-428: Run rustfmt (cargo fmt --all) to fix the formatting
mismatch in the test block around chunk_markdown and the assert! call; reformat
the multi-line assert! invocation in the test that builds text = "word
".repeat(5000) and calls chunk_markdown(&text, 50) so it conforms to rustfmt
style (or run the formatter to automatically adjust the assert! argument layout
and indentation).
- Around line 435-441: The code in chunker.rs around the test using long_word,
chunk_markdown, and the chunks iterator chain is not formatted according to
rustfmt; run rustfmt by executing `cargo fmt --all` (or `cargo fmt`) to reformat
the `.iter().map().collect().join()` chain and the surrounding test, then re-run
tests/CI to confirm the formatting warning is resolved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 38566159-9799-48f8-a43d-4b94a550b30b

📥 Commits

Reviewing files that changed from the base of the PR and between 258837e and ab3a8df.

📒 Files selected for processing (1)

src/openhuman/memory/chunker.rs

Reformat assert! and iterator chains in chunker tests per cargo fmt, addressing CodeRabbit review comments on PR tinyhumansai#1891.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 216-247: The new line-splitting flow in chunker.rs (the loop over
text.lines() using variables current, chunks, max_chars) lacks required
diagnostics; add tracing/log debug statements with the stable prefix
"[memory::chunker]" at each control path: when you flush current because
current.len() + line.len() + 1 > max_chars, when a whole line fits and you
append it (include current.len(), line.len(), chunks.len()), when a line is
oversized and you enter the word-splitting branch, when you append a word to an
empty or non-empty current, and when you flush due to a word causing overflow;
use size/counter fields (e.g., current_len, line_len, word_len, chunk_count) and
the crate's log/tracing macros (trace!/debug!) rather than logging raw text.
- Around line 437-450: The current test oversize_single_word_emitted_intact()
only checks that the long_word appears after joining chunks, which allows it to
be split across chunk boundaries; update the assertion to require that at least
one element of chunks (from chunk_markdown(&text, 50)) contains the entire
long_word as a contiguous substring (e.g., chunk.content.contains(&long_word))
or equals it when trimmed, and optionally add an assertion that no chunk
contains only a strict prefix or suffix of long_word to ensure the token was not
split across chunks; reference the oversize_single_word_emitted_intact test, the
long_word variable, the text variable and the chunks vector when making these
checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0c6e6887-fb9b-4db6-a7f4-004ff79a666c

📥 Commits

Reviewing files that changed from the base of the PR and between ab3a8df and 57341d2.

📒 Files selected for processing (1)

src/openhuman/memory/chunker.rs

coderabbitai · 2026-05-17T03:06:35Z

+    fn oversize_single_word_emitted_intact() {
+        // A single 500-char "word" with no whitespace must not be split mid-word
+        let long_word = "a".repeat(500);
+        let text = format!("hello {} world", long_word);
+        let chunks = chunk_markdown(&text, 50); // max_chars=200
+        let all_content: String = chunks
+            .iter()
+            .map(|c| c.content.clone())
+            .collect::<Vec<_>>()
+            .join("");
+        assert!(
+            all_content.contains(&long_word),
+            "oversize word was corrupted by splitting"
+        );


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen the oversize-word test to assert “intact in one chunk,” not just “present after join.”

join("") can still contain long_word even if it was split into adjacent pieces, so this test doesn’t fully protect the no-mid-word-split contract. Assert that at least one chunk contains the full token as a contiguous standalone segment (or equals it after trim), and optionally assert no chunk contains a strict prefix/suffix split pattern.

Proposed test hardening

let chunks = chunk_markdown(&text, 50); // max_chars=200 - let all_content: String = chunks - .iter() - .map(|c| c.content.clone()) - .collect::<Vec<_>>() - .join(""); - assert!( - all_content.contains(&long_word), - "oversize word was corrupted by splitting" - ); + assert!( + chunks.iter().any(|c| c.content.trim() == long_word), + "oversize word should appear intact in a single chunk" + );

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/chunker.rs` around lines 437 - 450, The current test oversize_single_word_emitted_intact() only checks that the long_word appears after joining chunks, which allows it to be split across chunk boundaries; update the assertion to require that at least one element of chunks (from chunk_markdown(&text, 50)) contains the entire long_word as a contiguous substring (e.g., chunk.content.contains(&long_word)) or equals it when trimmed, and optionally add an assertion that no chunk contains only a strict prefix or suffix of long_word to ensure the token was not split across chunks; reference the oversize_single_word_emitted_intact test, the long_word variable, the text variable and the chunks vector when making these checks.

# Conflicts: # src/openhuman/memory/chunker.rs

fix(memory/chunker): split oversized single lines at word boundaries (c…

ab3a8df

…loses tinyhumansai#1882)

iuyup requested a review from a team May 16, 2026 04:12

coderabbitai Bot added the rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. label May 16, 2026

coderabbitai Bot requested changes May 16, 2026

View reviewed changes

Comment thread src/openhuman/memory/chunker.rs Outdated

Comment thread src/openhuman/memory/chunker.rs Outdated

Merge branch 'main' into pr/1891

25d0726

senamakel self-assigned this May 17, 2026

chore(pr-fix): apply rustfmt to chunker tests

57341d2

Reformat assert! and iterator chains in chunker tests per cargo fmt, addressing CodeRabbit review comments on PR tinyhumansai#1891.

coderabbitai Bot requested changes May 17, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into pr/1891

bd546b9

# Conflicts: # src/openhuman/memory/chunker.rs

senamakel merged commit c23b14f into tinyhumansai:main May 17, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory/chunker): split oversized single lines at word boundaries#1891

fix(memory/chunker): split oversized single lines at word boundaries#1891
senamakel merged 4 commits into
tinyhumansai:mainfrom
iuyup:fix/chunker-split-oversize-lines

iuyup commented May 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iuyup commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iuyup commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading