Skip to content

fix(memory): use floor_char_boundary in body_preview slice#1681

Merged
senamakel merged 2 commits into
tinyhumansai:mainfrom
Sathvik-1007:fix/memory-tree-body-preview-panic
May 13, 2026
Merged

fix(memory): use floor_char_boundary in body_preview slice#1681
senamakel merged 2 commits into
tinyhumansai:mainfrom
Sathvik-1007:fix/memory-tree-body-preview-panic

Conversation

@Sathvik-1007
Copy link
Copy Markdown
Contributor

@Sathvik-1007 Sathvik-1007 commented May 13, 2026

Summary

  • Fix fatal panic in memory tree ingest when body_preview slice lands inside a multibyte UTF-8 codepoint.
  • Extract build_body_preview helper using existing floor_char_boundary util.
  • Add 3 unit tests covering short string, long ASCII, and multibyte-at-boundary cases.

Problem

  • src/openhuman/memory/tree/ingest.rs:172 slices canonical markdown with raw byte indexing: md[len - 2048..].
  • When len - 2048 lands inside a multibyte UTF-8 char (e.g. U+200C ZWNJ, 3 bytes), Rust panics with byte index N is not a char boundary.
  • Fatal panic kills the tokio blocking thread mid Gmail sync. Sentry: OPENHUMAN-TAURI-CP (4 occurrences).

Solution

  • Use crate::openhuman::util::floor_char_boundary(md, len - 2048) to round the cut-point down to the nearest char boundary before slicing.
  • Extracted into a build_body_preview helper for independent testability.
  • Same pattern already used in 10+ other callsites (web_fetch, inject, routing, etc.).

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case)
  • Diff coverage >= 80% — 3 tests cover all new/changed lines
  • Coverage matrix updated — N/A: bugfix-only, no new feature rows
  • All affected feature IDs from the matrix are listed — N/A: bugfix
  • No new external network dependencies introduced
  • Manual smoke checklist updated — N/A: internal memory pipeline
  • Linked issue closed via Closes #NNN in the Related section

Impact

  • Desktop only (Tauri sidecar). No web/mobile/CLI impact.
  • Fixes fatal crash during Gmail sync for emails containing multibyte chars near the 2KB boundary.
  • No performance regression — floor_char_boundary is O(1) to O(3) byte scan.

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

  • N/A (human PR)

Summary by CodeRabbit

  • Bug Fixes

    • Preview generation now safely truncates text at UTF-8 character boundaries, preventing corruption for email and document previews.
  • Tests

    • Added tests covering short inputs, exact-size truncation, and multibyte boundary cases to ensure stable preview behavior.

Review Change Stack

Panics with 'byte index N is not a char boundary' when multibyte
UTF-8 (e.g. U+200C ZWNJ) straddles the len-2048 cut point during
Gmail sync ingest. Extract build_body_preview helper using existing
floor_char_boundary util to round down safely.

Closes tinyhumansai#1654
@Sathvik-1007 Sathvik-1007 requested a review from a team May 13, 2026 20:18
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

This PR adds a UTF-8-safe helper, build_body_preview, that returns the full markdown when ≤2048 bytes or otherwise truncates to the trailing ~2048 bytes at a UTF-8 char boundary. persist now uses this helper for SourceKind::Email and SourceKind::Document. Unit tests cover short, ASCII-long, and multibyte-boundary cases.

Changes

UTF-8-safe body preview generation

Layer / File(s) Summary
Body preview truncation with UTF-8 boundary safety
src/openhuman/memory/tree/ingest.rs
Adds build_body_preview(md: &str) which returns the input when ≤2048 bytes or truncates to trailing ~2048 bytes using floor_char_boundary. Replaces the previous unsafe byte-slice in persist for SourceKind::Email and SourceKind::Document. Adds unit tests for short strings, long ASCII truncation, and a multibyte boundary regression case.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested reviewers

  • senamakel

Poem

🐇
A cut at two kilobytes made the parser frown,
Zero-width mischief tried to split the town.
I floor the boundary, stitch the text anew,
Now previews skip the panic and stay true.
Hop, patch, and bytes — all safe for you!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: using floor_char_boundary for safe UTF-8 slicing in body_preview.
Linked Issues check ✅ Passed The PR implements all coding requirements from #1654: adds build_body_preview helper using floor_char_boundary, updates persist to compute body_preview for Email/Document sources, and includes comprehensive unit tests covering short strings, ASCII truncation, and the multibyte boundary regression case.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the panic in body_preview computation; no unrelated modifications are present in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/openhuman/memory/tree/ingest.rs (1)

555-561: 💤 Low value

Consider tightening the assertion for pure ASCII.

For a pure ASCII string, every byte is a char boundary, so floor_char_boundary will return exactly len - 2048, producing a preview of exactly 2048 bytes. The current assertion allows up to 2051 bytes, which is appropriate for multibyte characters but overly permissive for this ASCII test case.

More precise assertion
     #[test]
     fn body_preview_long_ascii_truncates_to_trailing_bytes() {
         let long = "A".repeat(4096);
         let preview = super::build_body_preview(&long);
-        assert!(preview.len() >= 2048);
-        assert!(preview.len() <= 2048 + 3); // at most 3 extra bytes from boundary rounding
+        assert_eq!(preview.len(), 2048); // ASCII has no multibyte rounding
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/ingest.rs` around lines 555 - 561, The ASCII test
allows extra bytes unnecessarily; update the test
body_preview_long_ascii_truncates_to_trailing_bytes to assert that the preview
length equals exactly 2048 for a pure ASCII input by calling
super::build_body_preview(&long) and checking preview.len() == 2048 (instead of
the current range allowing up to 2051), referencing the test name and
build_body_preview to locate and change the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 175-180: The match arm assigning body_preview has formatting
differences causing CI failures; reformat the block around body_preview,
source_kind_for_store, and the match arms (SourceKind::Email |
SourceKind::Document => Some(build_body_preview(&canonical.markdown)), _ =>
None) to match rustfmt style and run `cargo fmt --all` (or apply rustfmt) so the
file src/openhuman/memory/tree/ingest.rs is formatted correctly and the CI
formatting check passes.

---

Nitpick comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 555-561: The ASCII test allows extra bytes unnecessarily; update
the test body_preview_long_ascii_truncates_to_trailing_bytes to assert that the
preview length equals exactly 2048 for a pure ASCII input by calling
super::build_body_preview(&long) and checking preview.len() == 2048 (instead of
the current range allowing up to 2051), referencing the test name and
build_body_preview to locate and change the assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e428b9f-16b2-43cf-a038-3117170e633d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b64ea8 and 8f376ca.

📒 Files selected for processing (1)
  • src/openhuman/memory/tree/ingest.rs

Comment thread src/openhuman/memory/tree/ingest.rs
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 153-161: The function build_body_preview currently truncates
UTF-8-safe previews but lacks diagnostics; add debug-level tracing/logging
inside build_body_preview to emit grep-friendly messages indicating whether we
took the pass-through path or truncation path (use stable prefixes like
"openhuman:body_preview:pass" and "openhuman:body_preview:trunc"), include
numeric metadata only (input byte length md.len(), cut offset start, and
resulting preview byte length) and avoid logging the preview content itself; use
the project's logging crate (log::debug or tracing::debug) so these messages
appear in development traces and are easy to grep.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f9e456f1-7277-4beb-86ea-496262ecdf4f

📥 Commits

Reviewing files that changed from the base of the PR and between 8f376ca and e0bf3ea.

📒 Files selected for processing (1)
  • src/openhuman/memory/tree/ingest.rs

Comment on lines +153 to +161
/// Build a trailing body preview (last ~2048 bytes), safe for multibyte UTF-8.
fn build_body_preview(md: &str) -> String {
let len = md.len();
if len <= 2048 {
return md.to_string();
}
let start = crate::openhuman::util::floor_char_boundary(md, len - 2048);
md[start..].to_string()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add diagnostics for the new UTF-8 preview truncation path.

This helper adds new branching behavior but has no trace/debug logging. Add grep-friendly diagnostics (pass-through vs truncation, input bytes, cut offset, preview bytes) without logging content.

Proposed patch
 fn build_body_preview(md: &str) -> String {
     let len = md.len();
     if len <= 2048 {
+        tracing::trace!(
+            "[memory_tree::ingest] body_preview passthrough input_bytes={}",
+            len
+        );
         return md.to_string();
     }
     let start = crate::openhuman::util::floor_char_boundary(md, len - 2048);
-    md[start..].to_string()
+    let preview = md[start..].to_string();
+    tracing::trace!(
+        "[memory_tree::ingest] body_preview truncated input_bytes={} start={} preview_bytes={}",
+        len,
+        start,
+        preview.len()
+    );
+    preview
 }

As per coding guidelines, src/**/*.rs: “All new/changed behavior in Rust core must include verbose diagnostics logging with stable grep-friendly prefixes…” and “use log / tracing at debug or trace level for development-oriented diagnostics on new/changed flows.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/ingest.rs` around lines 153 - 161, The function
build_body_preview currently truncates UTF-8-safe previews but lacks
diagnostics; add debug-level tracing/logging inside build_body_preview to emit
grep-friendly messages indicating whether we took the pass-through path or
truncation path (use stable prefixes like "openhuman:body_preview:pass" and
"openhuman:body_preview:trunc"), include numeric metadata only (input byte
length md.len(), cut offset start, and resulting preview byte length) and avoid
logging the preview content itself; use the project's logging crate (log::debug
or tracing::debug) so these messages appear in development traces and are easy
to grep.

@senamakel senamakel merged commit 022ce30 into tinyhumansai:main May 13, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory tree ingest panics on multibyte char at body_preview cut point

2 participants