Fix UTF-8 body_preview slicing in memory ingest by honor2030 · Pull Request #1620 · tinyhumansai/openhuman

honor2030 · 2026-05-13T10:40:12Z

Summary

avoid panics when body_preview starts inside a multi-byte UTF-8 character
centralize preview truncation in markdown_body_preview with a 2048-byte cap
add regression coverage for the UTF-8 boundary case

Fixes #1595.

Verification

cargo test -p openhuman markdown_body_preview_respects_utf8_boundary_and_byte_cap --lib -- --nocapture
cargo test -p openhuman ingest_document_handles_utf8_at_body_preview_boundary --lib -- --nocapture
cargo test -p openhuman openhuman::memory::tree::ingest::tests --lib
cargo check -p openhuman --lib
cargo fmt --all --check
pre-push hook: format:check, lint, compile, rust:check, lint:commands-tokens

Notes: cargo check / pre-push report existing warnings only; no new failures.

Summary by CodeRabbit

Bug Fixes
- Email and document previews now enforce a strict byte-length cap and never cut multibyte UTF-8 characters, preventing corrupted or malformed preview text.
Tests
- Added unit tests to verify preview truncation respects the byte cap and UTF-8 boundaries and that ingestion succeeds when previews would otherwise split multibyte characters.

coderabbitai · 2026-05-13T10:40:24Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a432a25-fb40-4e0b-b1a4-50f5333528f5

📥 Commits

Reviewing files that changed from the base of the PR and between eca19a0 and c2de041.

📒 Files selected for processing (1)

src/openhuman/memory/tree/ingest.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/openhuman/memory/tree/ingest.rs

📝 Walkthrough

Walkthrough

Adds a 2048-byte preview cap and markdown_body_preview that truncates by bytes while aligning to UTF-8 character boundaries, integrates it into persist for Email and Document sources, and adds tests validating boundary/cap behavior and ingestion success.

Changes

Memory Ingest UTF-8 Safety

Layer / File(s)	Summary
Body preview constant and safe truncation helper `src/openhuman/memory/tree/ingest.rs`	Adds `BODY_PREVIEW_MAX_BYTES` (2048) and `markdown_body_preview(md: &str)` which truncates markdown to the byte cap and adjusts the start to a valid UTF-8 character boundary (ceil-style).
Persist flow integration and tests `src/openhuman/memory/tree/ingest.rs`	`persist` now uses `markdown_body_preview` for `SourceKind::Email` and `SourceKind::Document` (keeps `Chat` as `None`). Unit tests verify the preview never exceeds the byte cap and that ingestion succeeds when a multi-byte UTF-8 char would otherwise be split.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

tinyhumansai/openhuman#1681: Similar changes to src/openhuman/memory/tree/ingest.rs replacing unsafe byte slicing with a UTF-8-boundary-safe truncation helper.

Poem

🐰 I nibble bytes with careful art,
I cut where UTF-8 won't part,
Two thousand forty-eight I keep,
No panic claws disturb your sleep,
A gentle preview, neat and smart.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: fixing UTF-8 body_preview slicing to prevent panics when slicing at multi-byte character boundaries.
Linked Issues check	✅ Passed	The PR fully addresses all objectives from issue `#1595`: eliminates panics via ceil_char_boundary, centralizes truncation logic, adds regression tests, and restores stability.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the UTF-8 boundary issue in body_preview truncation; no unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

graycyrus

Review — Fix UTF-8 body_preview slicing in memory ingest

Walkthrough

This PR fixes a real panic: persist in the memory ingest pipeline was slicing canonical.markdown at a fixed byte offset without checking whether it landed on a UTF-8 character boundary. For documents/emails whose canonical markdown exceeded 2048 bytes, any multi-byte character at the slice point (e.g. zero-width non-joiner, CJK codepoint, emoji) would panic. The fix extracts the logic into a markdown_body_preview helper using str::ceil_char_boundary to advance past any partial character before slicing, keeping the result within the byte cap. Two regression tests added.

Changes

File	Summary
`src/openhuman/memory/tree/ingest.rs`	Extracts `markdown_body_preview` helper; replaces unsafe byte-slice with `ceil_char_boundary`; adds `BODY_PREVIEW_MAX_BYTES` constant; adds 2 regression tests

Verified / looks good

ceil_char_boundary (stable since Rust 1.93.0) is the correct choice for a start index — rounding up keeps the result ≤ 2048 bytes
Unit test fixture correctly triggers the boundary condition (offset 18 inside 3-byte \u{200c})
BODY_PREVIEW_MAX_BYTES is module-private, matching the helper's scope
Email/Document match arms are the only callers; Chat correctly excluded
No external API surface changed
Pre-existing tracing::debug! on DocumentCanonicalized is untouched

graycyrus · 2026-05-13T13:59:37Z

+            !body.is_char_boundary(preview_start),
+            "test fixture must put the preview boundary inside a multi-byte character"
+        );
+


[major] Integration test bakes in a canonicalizer implementation detail

body.len() + 1 assumes document::canonicalise adds exactly 1 byte of overhead (a trailing \n). If the canonicalizer ever adjusts formatting, the + 1 silently becomes wrong and the assert! guard either passes for the wrong reason or breaks without making the regression obvious.

The unit test already proves the helper in isolation against the exact boundary condition. This integration test's job is just to show ingest_document survives — consider removing the byte-arithmetic precondition entirely:

// The unit test verifies exact boundary arithmetic. // Here we just confirm the full pipeline doesn't panic on this input.

If a guard is desired, measure after canonicalization, not before.

graycyrus · 2026-05-13T13:59:37Z


+fn markdown_body_preview(md: &str) -> String {
+    let len = md.len();
+    if len <= BODY_PREVIEW_MAX_BYTES {


[minor] Consistency note — the codebase has crate::openhuman::util::floor_char_boundary used at ~9 other call sites. This PR uses the stdlib str::ceil_char_boundary directly, which is semantically correct (ceil for a start index, floor for an end index) and works on the pinned Rust 1.93.0 toolchain.

Not a bug, but for discoverability a one-line comment would help a future reader:

// ceil_char_boundary (stable since Rust 1.93) advances the index // to the next char boundary, keeping the slice <= BODY_PREVIEW_MAX_BYTES bytes.

graycyrus · 2026-05-13T13:59:37Z

@@ -361,6 +358,16 @@ async fn persist(
    })


[nitpick] Missing doc comment — every other function in this file (public and private) has at least a one-liner. Suggestion:

/// Returns the trailing at-most `BODY_PREVIEW_MAX_BYTES` bytes of `md`, /// aligned to a UTF-8 character boundary.

This would also close the docstring coverage gap CodeRabbit flagged (66.67% -> 80%+).

senamakel

pf above

@graycyrus

Resolves merge conflict in src/openhuman/memory/tree/ingest.rs between PR's markdown_body_preview (ceil_char_boundary + BODY_PREVIEW_MAX_BYTES) and main's build_body_preview (floor_char_boundary). Resolution: kept PR's markdown_body_preview as the single implementation; deleted build_body_preview and its three tests; added doc comment explaining ceil vs floor choice; removed brittle body.len()+1 assertion from the integration test (addresses @graycyrus major review comment); applied cargo fmt to match project style.

honor2030 requested a review from a team May 13, 2026 10:40

coderabbitai Bot previously approved these changes May 13, 2026

View reviewed changes

fix memory body preview utf8 boundary

eca19a0

honor2030 force-pushed the fix/body-preview-utf8-boundary branch from deee155 to eca19a0 Compare May 13, 2026 12:44

graycyrus reviewed May 13, 2026

View reviewed changes

senamakel requested changes May 13, 2026

View reviewed changes

senamakel self-assigned this May 14, 2026

senamakel dismissed coderabbitai[bot]’s stale review via c2de041 May 14, 2026 02:44

coderabbitai Bot approved these changes May 14, 2026

View reviewed changes

senamakel merged commit 4870bc2 into tinyhumansai:main May 14, 2026
24 checks passed

This was referenced May 14, 2026

fix(memory): add fallback model chain for unavailable GMI models #1704

Merged

fix(memory): guard against char-boundary panics in ingest persist path #2102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF-8 body_preview slicing in memory ingest#1620

Fix UTF-8 body_preview slicing in memory ingest#1620
senamakel merged 2 commits into
tinyhumansai:mainfrom
honor2030:fix/body-preview-utf8-boundary

honor2030 commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

graycyrus left a comment

Uh oh!

graycyrus May 13, 2026

Uh oh!

graycyrus May 13, 2026

Uh oh!

graycyrus May 13, 2026

Uh oh!

senamakel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

honor2030 commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Review — Fix UTF-8 body_preview slicing in memory ingest

Walkthrough

Changes

Verified / looks good

Uh oh!

graycyrus May 13, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus May 13, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus May 13, 2026

Choose a reason for hiding this comment

Uh oh!

senamakel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

honor2030 commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading