fix(memory): use floor_char_boundary in body_preview slice by Sathvik-1007 · Pull Request #1681 · tinyhumansai/openhuman

Sathvik-1007 · 2026-05-13T20:18:27Z

Summary

Fix fatal panic in memory tree ingest when body_preview slice lands inside a multibyte UTF-8 codepoint.
Extract build_body_preview helper using existing floor_char_boundary util.
Add 3 unit tests covering short string, long ASCII, and multibyte-at-boundary cases.

Problem

src/openhuman/memory/tree/ingest.rs:172 slices canonical markdown with raw byte indexing: md[len - 2048..].
When len - 2048 lands inside a multibyte UTF-8 char (e.g. U+200C ZWNJ, 3 bytes), Rust panics with byte index N is not a char boundary.
Fatal panic kills the tokio blocking thread mid Gmail sync. Sentry: OPENHUMAN-TAURI-CP (4 occurrences).

Solution

Use crate::openhuman::util::floor_char_boundary(md, len - 2048) to round the cut-point down to the nearest char boundary before slicing.
Extracted into a build_body_preview helper for independent testability.
Same pattern already used in 10+ other callsites (web_fetch, inject, routing, etc.).

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case)
Diff coverage >= 80% — 3 tests cover all new/changed lines
Coverage matrix updated — N/A: bugfix-only, no new feature rows
All affected feature IDs from the matrix are listed — N/A: bugfix
No new external network dependencies introduced
Manual smoke checklist updated — N/A: internal memory pipeline
Linked issue closed via Closes #NNN in the Related section

Impact

Desktop only (Tauri sidecar). No web/mobile/CLI impact.
Fixes fatal crash during Gmail sync for emails containing multibyte chars near the 2KB boundary.
No performance regression — floor_char_boundary is O(1) to O(3) byte scan.

AI Authored PR Metadata (required for Codex/Linear PRs)

N/A (human PR)

Summary by CodeRabbit

Bug Fixes
- Preview generation now safely truncates text at UTF-8 character boundaries, preventing corruption for email and document previews.
Tests
- Added tests covering short inputs, exact-size truncation, and multibyte boundary cases to ensure stable preview behavior.

Panics with 'byte index N is not a char boundary' when multibyte UTF-8 (e.g. U+200C ZWNJ) straddles the len-2048 cut point during Gmail sync ingest. Extract build_body_preview helper using existing floor_char_boundary util to round down safely. Closes tinyhumansai#1654

coderabbitai · 2026-05-13T20:18:42Z

📝 Walkthrough

Walkthrough

This PR adds a UTF-8-safe helper, build_body_preview, that returns the full markdown when ≤2048 bytes or otherwise truncates to the trailing ~2048 bytes at a UTF-8 char boundary. persist now uses this helper for SourceKind::Email and SourceKind::Document. Unit tests cover short, ASCII-long, and multibyte-boundary cases.

Changes

UTF-8-safe body preview generation

Layer / File(s)	Summary
Body preview truncation with UTF-8 boundary safety `src/openhuman/memory/tree/ingest.rs`	Adds `build_body_preview(md: &str)` which returns the input when ≤2048 bytes or truncates to trailing ~2048 bytes using `floor_char_boundary`. Replaces the previous unsafe byte-slice in `persist` for `SourceKind::Email` and `SourceKind::Document`. Adds unit tests for short strings, long ASCII truncation, and a multibyte boundary regression case.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Panic in memory ingestion: byte index slicing on multi-byte UTF-8 char boundary #1595 — Similar UTF-8 multi-byte slicing panic fix: introduces a UTF-8-safe build_body_preview and uses it in persist, matching this PR's changes.

Suggested reviewers

senamakel

Poem

🐇
A cut at two kilobytes made the parser frown,
Zero-width mischief tried to split the town.
I floor the boundary, stitch the text anew,
Now previews skip the panic and stay true.
Hop, patch, and bytes — all safe for you!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: using floor_char_boundary for safe UTF-8 slicing in body_preview.
Linked Issues check	✅ Passed	The PR implements all coding requirements from `#1654`: adds build_body_preview helper using floor_char_boundary, updates persist to compute body_preview for Email/Document sources, and includes comprehensive unit tests covering short strings, ASCII truncation, and the multibyte boundary regression case.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the panic in body_preview computation; no unrelated modifications are present in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/openhuman/memory/tree/ingest.rs (1)

555-561: 💤 Low value

Consider tightening the assertion for pure ASCII.

For a pure ASCII string, every byte is a char boundary, so floor_char_boundary will return exactly len - 2048, producing a preview of exactly 2048 bytes. The current assertion allows up to 2051 bytes, which is appropriate for multibyte characters but overly permissive for this ASCII test case.

More precise assertion

     #[test]
     fn body_preview_long_ascii_truncates_to_trailing_bytes() {
         let long = "A".repeat(4096);
         let preview = super::build_body_preview(&long);
-        assert!(preview.len() >= 2048);
-        assert!(preview.len() <= 2048 + 3); // at most 3 extra bytes from boundary rounding
+        assert_eq!(preview.len(), 2048); // ASCII has no multibyte rounding
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/ingest.rs` around lines 555 - 561, The ASCII test
allows extra bytes unnecessarily; update the test
body_preview_long_ascii_truncates_to_trailing_bytes to assert that the preview
length equals exactly 2048 for a pure ASCII input by calling
super::build_body_preview(&long) and checking preview.len() == 2048 (instead of
the current range allowing up to 2051), referencing the test name and
build_body_preview to locate and change the assertion.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 175-180: The match arm assigning body_preview has formatting
differences causing CI failures; reformat the block around body_preview,
source_kind_for_store, and the match arms (SourceKind::Email |
SourceKind::Document => Some(build_body_preview(&canonical.markdown)), _ =>
None) to match rustfmt style and run `cargo fmt --all` (or apply rustfmt) so the
file src/openhuman/memory/tree/ingest.rs is formatted correctly and the CI
formatting check passes.

---

Nitpick comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 555-561: The ASCII test allows extra bytes unnecessarily; update
the test body_preview_long_ascii_truncates_to_trailing_bytes to assert that the
preview length equals exactly 2048 for a pure ASCII input by calling
super::build_body_preview(&long) and checking preview.len() == 2048 (instead of
the current range allowing up to 2051), referencing the test name and
build_body_preview to locate and change the assertion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e428b9f-16b2-43cf-a038-3117170e633d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b64ea8 and 8f376ca.

📒 Files selected for processing (1)

src/openhuman/memory/tree/ingest.rs

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 153-161: The function build_body_preview currently truncates
UTF-8-safe previews but lacks diagnostics; add debug-level tracing/logging
inside build_body_preview to emit grep-friendly messages indicating whether we
took the pass-through path or truncation path (use stable prefixes like
"openhuman:body_preview:pass" and "openhuman:body_preview:trunc"), include
numeric metadata only (input byte length md.len(), cut offset start, and
resulting preview byte length) and avoid logging the preview content itself; use
the project's logging crate (log::debug or tracing::debug) so these messages
appear in development traces and are easy to grep.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f9e456f1-7277-4beb-86ea-496262ecdf4f

📥 Commits

Reviewing files that changed from the base of the PR and between 8f376ca and e0bf3ea.

📒 Files selected for processing (1)

src/openhuman/memory/tree/ingest.rs

coderabbitai · 2026-05-13T20:36:50Z

+/// Build a trailing body preview (last ~2048 bytes), safe for multibyte UTF-8.
+fn build_body_preview(md: &str) -> String {
+    let len = md.len();
+    if len <= 2048 {
+        return md.to_string();
+    }
+    let start = crate::openhuman::util::floor_char_boundary(md, len - 2048);
+    md[start..].to_string()
+}


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add diagnostics for the new UTF-8 preview truncation path.

This helper adds new branching behavior but has no trace/debug logging. Add grep-friendly diagnostics (pass-through vs truncation, input bytes, cut offset, preview bytes) without logging content.

Proposed patch

fn build_body_preview(md: &str) -> String { let len = md.len(); if len <= 2048 { + tracing::trace!( + "[memory_tree::ingest] body_preview passthrough input_bytes={}", + len + ); return md.to_string(); } let start = crate::openhuman::util::floor_char_boundary(md, len - 2048); - md[start..].to_string() + let preview = md[start..].to_string(); + tracing::trace!( + "[memory_tree::ingest] body_preview truncated input_bytes={} start={} preview_bytes={}", + len, + start, + preview.len() + ); + preview }

As per coding guidelines, src/**/*.rs: “All new/changed behavior in Rust core must include verbose diagnostics logging with stable grep-friendly prefixes…” and “use log / tracing at debug or trace level for development-oriented diagnostics on new/changed flows.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/tree/ingest.rs` around lines 153 - 161, The function build_body_preview currently truncates UTF-8-safe previews but lacks diagnostics; add debug-level tracing/logging inside build_body_preview to emit grep-friendly messages indicating whether we took the pass-through path or truncation path (use stable prefixes like "openhuman:body_preview:pass" and "openhuman:body_preview:trunc"), include numeric metadata only (input byte length md.len(), cut offset start, and resulting preview byte length) and avoid logging the preview content itself; use the project's logging crate (log::debug or tracing::debug) so these messages appear in development traces and are easy to grep.

Sathvik-1007 requested a review from a team May 13, 2026 20:18

coderabbitai Bot requested changes May 13, 2026

View reviewed changes

Comment thread src/openhuman/memory/tree/ingest.rs

style: rustfmt match arm + exact ASCII assert

e0bf3ea

coderabbitai Bot requested changes May 13, 2026

View reviewed changes

senamakel merged commit 022ce30 into tinyhumansai:main May 13, 2026
24 checks passed

This was referenced May 14, 2026

Fix UTF-8 body_preview slicing in memory ingest #1620

Merged

fix(memory): add fallback model chain for unavailable GMI models #1704

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): use floor_char_boundary in body_preview slice#1681

fix(memory): use floor_char_boundary in body_preview slice#1681
senamakel merged 2 commits into
tinyhumansai:mainfrom
Sathvik-1007:fix/memory-tree-body-preview-panic

Sathvik-1007 commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sathvik-1007 commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sathvik-1007 commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading