fix(memory): use floor_char_boundary in body_preview slice#1681
Conversation
Panics with 'byte index N is not a char boundary' when multibyte UTF-8 (e.g. U+200C ZWNJ) straddles the len-2048 cut point during Gmail sync ingest. Extract build_body_preview helper using existing floor_char_boundary util to round down safely. Closes tinyhumansai#1654
📝 WalkthroughWalkthroughThis PR adds a UTF-8-safe helper, ChangesUTF-8-safe body preview generation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/openhuman/memory/tree/ingest.rs (1)
555-561: 💤 Low valueConsider tightening the assertion for pure ASCII.
For a pure ASCII string, every byte is a char boundary, so
floor_char_boundarywill return exactlylen - 2048, producing a preview of exactly 2048 bytes. The current assertion allows up to 2051 bytes, which is appropriate for multibyte characters but overly permissive for this ASCII test case.More precise assertion
#[test] fn body_preview_long_ascii_truncates_to_trailing_bytes() { let long = "A".repeat(4096); let preview = super::build_body_preview(&long); - assert!(preview.len() >= 2048); - assert!(preview.len() <= 2048 + 3); // at most 3 extra bytes from boundary rounding + assert_eq!(preview.len(), 2048); // ASCII has no multibyte rounding }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/tree/ingest.rs` around lines 555 - 561, The ASCII test allows extra bytes unnecessarily; update the test body_preview_long_ascii_truncates_to_trailing_bytes to assert that the preview length equals exactly 2048 for a pure ASCII input by calling super::build_body_preview(&long) and checking preview.len() == 2048 (instead of the current range allowing up to 2051), referencing the test name and build_body_preview to locate and change the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 175-180: The match arm assigning body_preview has formatting
differences causing CI failures; reformat the block around body_preview,
source_kind_for_store, and the match arms (SourceKind::Email |
SourceKind::Document => Some(build_body_preview(&canonical.markdown)), _ =>
None) to match rustfmt style and run `cargo fmt --all` (or apply rustfmt) so the
file src/openhuman/memory/tree/ingest.rs is formatted correctly and the CI
formatting check passes.
---
Nitpick comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 555-561: The ASCII test allows extra bytes unnecessarily; update
the test body_preview_long_ascii_truncates_to_trailing_bytes to assert that the
preview length equals exactly 2048 for a pure ASCII input by calling
super::build_body_preview(&long) and checking preview.len() == 2048 (instead of
the current range allowing up to 2051), referencing the test name and
build_body_preview to locate and change the assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1e428b9f-16b2-43cf-a038-3117170e633d
📒 Files selected for processing (1)
src/openhuman/memory/tree/ingest.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 153-161: The function build_body_preview currently truncates
UTF-8-safe previews but lacks diagnostics; add debug-level tracing/logging
inside build_body_preview to emit grep-friendly messages indicating whether we
took the pass-through path or truncation path (use stable prefixes like
"openhuman:body_preview:pass" and "openhuman:body_preview:trunc"), include
numeric metadata only (input byte length md.len(), cut offset start, and
resulting preview byte length) and avoid logging the preview content itself; use
the project's logging crate (log::debug or tracing::debug) so these messages
appear in development traces and are easy to grep.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f9e456f1-7277-4beb-86ea-496262ecdf4f
📒 Files selected for processing (1)
src/openhuman/memory/tree/ingest.rs
| /// Build a trailing body preview (last ~2048 bytes), safe for multibyte UTF-8. | ||
| fn build_body_preview(md: &str) -> String { | ||
| let len = md.len(); | ||
| if len <= 2048 { | ||
| return md.to_string(); | ||
| } | ||
| let start = crate::openhuman::util::floor_char_boundary(md, len - 2048); | ||
| md[start..].to_string() | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add diagnostics for the new UTF-8 preview truncation path.
This helper adds new branching behavior but has no trace/debug logging. Add grep-friendly diagnostics (pass-through vs truncation, input bytes, cut offset, preview bytes) without logging content.
Proposed patch
fn build_body_preview(md: &str) -> String {
let len = md.len();
if len <= 2048 {
+ tracing::trace!(
+ "[memory_tree::ingest] body_preview passthrough input_bytes={}",
+ len
+ );
return md.to_string();
}
let start = crate::openhuman::util::floor_char_boundary(md, len - 2048);
- md[start..].to_string()
+ let preview = md[start..].to_string();
+ tracing::trace!(
+ "[memory_tree::ingest] body_preview truncated input_bytes={} start={} preview_bytes={}",
+ len,
+ start,
+ preview.len()
+ );
+ preview
}As per coding guidelines, src/**/*.rs: “All new/changed behavior in Rust core must include verbose diagnostics logging with stable grep-friendly prefixes…” and “use log / tracing at debug or trace level for development-oriented diagnostics on new/changed flows.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/memory/tree/ingest.rs` around lines 153 - 161, The function
build_body_preview currently truncates UTF-8-safe previews but lacks
diagnostics; add debug-level tracing/logging inside build_body_preview to emit
grep-friendly messages indicating whether we took the pass-through path or
truncation path (use stable prefixes like "openhuman:body_preview:pass" and
"openhuman:body_preview:trunc"), include numeric metadata only (input byte
length md.len(), cut offset start, and resulting preview byte length) and avoid
logging the preview content itself; use the project's logging crate (log::debug
or tracing::debug) so these messages appear in development traces and are easy
to grep.
Summary
body_previewslice lands inside a multibyte UTF-8 codepoint.build_body_previewhelper using existingfloor_char_boundaryutil.Problem
src/openhuman/memory/tree/ingest.rs:172slices canonical markdown with raw byte indexing:md[len - 2048..].len - 2048lands inside a multibyte UTF-8 char (e.g. U+200C ZWNJ, 3 bytes), Rust panics withbyte index N is not a char boundary.Solution
crate::openhuman::util::floor_char_boundary(md, len - 2048)to round the cut-point down to the nearest char boundary before slicing.build_body_previewhelper for independent testability.Submission Checklist
Closes #NNNin the Related sectionImpact
floor_char_boundaryis O(1) to O(3) byte scan.Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Summary by CodeRabbit
Bug Fixes
Tests