Skip to content

Fix UTF-8 body_preview slicing in memory ingest#1620

Merged
senamakel merged 2 commits into
tinyhumansai:mainfrom
honor2030:fix/body-preview-utf8-boundary
May 14, 2026
Merged

Fix UTF-8 body_preview slicing in memory ingest#1620
senamakel merged 2 commits into
tinyhumansai:mainfrom
honor2030:fix/body-preview-utf8-boundary

Conversation

@honor2030
Copy link
Copy Markdown
Contributor

@honor2030 honor2030 commented May 13, 2026

Summary

  • avoid panics when body_preview starts inside a multi-byte UTF-8 character
  • centralize preview truncation in markdown_body_preview with a 2048-byte cap
  • add regression coverage for the UTF-8 boundary case

Fixes #1595.

Verification

  • cargo test -p openhuman markdown_body_preview_respects_utf8_boundary_and_byte_cap --lib -- --nocapture
  • cargo test -p openhuman ingest_document_handles_utf8_at_body_preview_boundary --lib -- --nocapture
  • cargo test -p openhuman openhuman::memory::tree::ingest::tests --lib
  • cargo check -p openhuman --lib
  • cargo fmt --all --check
  • pre-push hook: format:check, lint, compile, rust:check, lint:commands-tokens

Notes: cargo check / pre-push report existing warnings only; no new failures.

Summary by CodeRabbit

  • Bug Fixes
    • Email and document previews now enforce a strict byte-length cap and never cut multibyte UTF-8 characters, preventing corrupted or malformed preview text.
  • Tests
    • Added unit tests to verify preview truncation respects the byte cap and UTF-8 boundaries and that ingestion succeeds when previews would otherwise split multibyte characters.

Review Change Stack

@honor2030 honor2030 requested a review from a team May 13, 2026 10:40
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a432a25-fb40-4e0b-b1a4-50f5333528f5

📥 Commits

Reviewing files that changed from the base of the PR and between eca19a0 and c2de041.

📒 Files selected for processing (1)
  • src/openhuman/memory/tree/ingest.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/openhuman/memory/tree/ingest.rs

📝 Walkthrough

Walkthrough

Adds a 2048-byte preview cap and markdown_body_preview that truncates by bytes while aligning to UTF-8 character boundaries, integrates it into persist for Email and Document sources, and adds tests validating boundary/cap behavior and ingestion success.

Changes

Memory Ingest UTF-8 Safety

Layer / File(s) Summary
Body preview constant and safe truncation helper
src/openhuman/memory/tree/ingest.rs
Adds BODY_PREVIEW_MAX_BYTES (2048) and markdown_body_preview(md: &str) which truncates markdown to the byte cap and adjusts the start to a valid UTF-8 character boundary (ceil-style).
Persist flow integration and tests
src/openhuman/memory/tree/ingest.rs
persist now uses markdown_body_preview for SourceKind::Email and SourceKind::Document (keeps Chat as None). Unit tests verify the preview never exceeds the byte cap and that ingestion succeeds when a multi-byte UTF-8 char would otherwise be split.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • tinyhumansai/openhuman#1681: Similar changes to src/openhuman/memory/tree/ingest.rs replacing unsafe byte slicing with a UTF-8-boundary-safe truncation helper.

Poem

🐰 I nibble bytes with careful art,
I cut where UTF-8 won't part,
Two thousand forty-eight I keep,
No panic claws disturb your sleep,
A gentle preview, neat and smart.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing UTF-8 body_preview slicing to prevent panics when slicing at multi-byte character boundaries.
Linked Issues check ✅ Passed The PR fully addresses all objectives from issue #1595: eliminates panics via ceil_char_boundary, centralizes truncation logic, adds regression tests, and restores stability.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the UTF-8 boundary issue in body_preview truncation; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 13, 2026
@honor2030 honor2030 force-pushed the fix/body-preview-utf8-boundary branch from deee155 to eca19a0 Compare May 13, 2026 12:44
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Fix UTF-8 body_preview slicing in memory ingest

Walkthrough

This PR fixes a real panic: persist in the memory ingest pipeline was slicing canonical.markdown at a fixed byte offset without checking whether it landed on a UTF-8 character boundary. For documents/emails whose canonical markdown exceeded 2048 bytes, any multi-byte character at the slice point (e.g. zero-width non-joiner, CJK codepoint, emoji) would panic. The fix extracts the logic into a markdown_body_preview helper using str::ceil_char_boundary to advance past any partial character before slicing, keeping the result within the byte cap. Two regression tests added.

Changes

File Summary
src/openhuman/memory/tree/ingest.rs Extracts markdown_body_preview helper; replaces unsafe byte-slice with ceil_char_boundary; adds BODY_PREVIEW_MAX_BYTES constant; adds 2 regression tests

Verified / looks good

  • ceil_char_boundary (stable since Rust 1.93.0) is the correct choice for a start index — rounding up keeps the result ≤ 2048 bytes
  • Unit test fixture correctly triggers the boundary condition (offset 18 inside 3-byte \u{200c})
  • BODY_PREVIEW_MAX_BYTES is module-private, matching the helper's scope
  • Email/Document match arms are the only callers; Chat correctly excluded
  • No external API surface changed
  • Pre-existing tracing::debug! on DocumentCanonicalized is untouched

!body.is_char_boundary(preview_start),
"test fixture must put the preview boundary inside a multi-byte character"
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] Integration test bakes in a canonicalizer implementation detail

body.len() + 1 assumes document::canonicalise adds exactly 1 byte of overhead (a trailing \n). If the canonicalizer ever adjusts formatting, the + 1 silently becomes wrong and the assert! guard either passes for the wrong reason or breaks without making the regression obvious.

The unit test already proves the helper in isolation against the exact boundary condition. This integration test's job is just to show ingest_document survives — consider removing the byte-arithmetic precondition entirely:

// The unit test verifies exact boundary arithmetic.
// Here we just confirm the full pipeline doesn't panic on this input.

If a guard is desired, measure after canonicalization, not before.


fn markdown_body_preview(md: &str) -> String {
let len = md.len();
if len <= BODY_PREVIEW_MAX_BYTES {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Consistency note — the codebase has crate::openhuman::util::floor_char_boundary used at ~9 other call sites. This PR uses the stdlib str::ceil_char_boundary directly, which is semantically correct (ceil for a start index, floor for an end index) and works on the pinned Rust 1.93.0 toolchain.

Not a bug, but for discoverability a one-line comment would help a future reader:

// ceil_char_boundary (stable since Rust 1.93) advances the index
// to the next char boundary, keeping the slice <= BODY_PREVIEW_MAX_BYTES bytes.

@@ -361,6 +358,16 @@ async fn persist(
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Missing doc comment — every other function in this file (public and private) has at least a one-liner. Suggestion:

/// Returns the trailing at-most `BODY_PREVIEW_MAX_BYTES` bytes of `md`,
/// aligned to a UTF-8 character boundary.

This would also close the docstring coverage gap CodeRabbit flagged (66.67% -> 80%+).

Copy link
Copy Markdown
Member

@senamakel senamakel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pf above

@senamakel senamakel self-assigned this May 14, 2026
Resolves merge conflict in src/openhuman/memory/tree/ingest.rs between
PR's markdown_body_preview (ceil_char_boundary + BODY_PREVIEW_MAX_BYTES)
and main's build_body_preview (floor_char_boundary).

Resolution: kept PR's markdown_body_preview as the single implementation;
deleted build_body_preview and its three tests; added doc comment explaining
ceil vs floor choice; removed brittle body.len()+1 assertion from the
integration test (addresses @graycyrus major review comment); applied
cargo fmt to match project style.
@senamakel senamakel merged commit 4870bc2 into tinyhumansai:main May 14, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Panic in memory ingestion: byte index slicing on multi-byte UTF-8 char boundary

3 participants