Fix UTF-8 body_preview slicing in memory ingest#1620
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a 2048-byte preview cap and ChangesMemory Ingest UTF-8 Safety
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
deee155 to
eca19a0
Compare
graycyrus
left a comment
There was a problem hiding this comment.
Review — Fix UTF-8 body_preview slicing in memory ingest
Walkthrough
This PR fixes a real panic: persist in the memory ingest pipeline was slicing canonical.markdown at a fixed byte offset without checking whether it landed on a UTF-8 character boundary. For documents/emails whose canonical markdown exceeded 2048 bytes, any multi-byte character at the slice point (e.g. zero-width non-joiner, CJK codepoint, emoji) would panic. The fix extracts the logic into a markdown_body_preview helper using str::ceil_char_boundary to advance past any partial character before slicing, keeping the result within the byte cap. Two regression tests added.
Changes
| File | Summary |
|---|---|
src/openhuman/memory/tree/ingest.rs |
Extracts markdown_body_preview helper; replaces unsafe byte-slice with ceil_char_boundary; adds BODY_PREVIEW_MAX_BYTES constant; adds 2 regression tests |
Verified / looks good
ceil_char_boundary(stable since Rust 1.93.0) is the correct choice for a start index — rounding up keeps the result ≤ 2048 bytes- Unit test fixture correctly triggers the boundary condition (offset 18 inside 3-byte
\u{200c}) BODY_PREVIEW_MAX_BYTESis module-private, matching the helper's scope- Email/Document match arms are the only callers; Chat correctly excluded
- No external API surface changed
- Pre-existing
tracing::debug!onDocumentCanonicalizedis untouched
| !body.is_char_boundary(preview_start), | ||
| "test fixture must put the preview boundary inside a multi-byte character" | ||
| ); | ||
|
|
There was a problem hiding this comment.
[major] Integration test bakes in a canonicalizer implementation detail
body.len() + 1 assumes document::canonicalise adds exactly 1 byte of overhead (a trailing \n). If the canonicalizer ever adjusts formatting, the + 1 silently becomes wrong and the assert! guard either passes for the wrong reason or breaks without making the regression obvious.
The unit test already proves the helper in isolation against the exact boundary condition. This integration test's job is just to show ingest_document survives — consider removing the byte-arithmetic precondition entirely:
// The unit test verifies exact boundary arithmetic.
// Here we just confirm the full pipeline doesn't panic on this input.If a guard is desired, measure after canonicalization, not before.
|
|
||
| fn markdown_body_preview(md: &str) -> String { | ||
| let len = md.len(); | ||
| if len <= BODY_PREVIEW_MAX_BYTES { |
There was a problem hiding this comment.
[minor] Consistency note — the codebase has crate::openhuman::util::floor_char_boundary used at ~9 other call sites. This PR uses the stdlib str::ceil_char_boundary directly, which is semantically correct (ceil for a start index, floor for an end index) and works on the pinned Rust 1.93.0 toolchain.
Not a bug, but for discoverability a one-line comment would help a future reader:
// ceil_char_boundary (stable since Rust 1.93) advances the index
// to the next char boundary, keeping the slice <= BODY_PREVIEW_MAX_BYTES bytes.| @@ -361,6 +358,16 @@ async fn persist( | |||
| }) | |||
There was a problem hiding this comment.
[nitpick] Missing doc comment — every other function in this file (public and private) has at least a one-liner. Suggestion:
/// Returns the trailing at-most `BODY_PREVIEW_MAX_BYTES` bytes of `md`,
/// aligned to a UTF-8 character boundary.
This would also close the docstring coverage gap CodeRabbit flagged (66.67% -> 80%+).
Resolves merge conflict in src/openhuman/memory/tree/ingest.rs between PR's markdown_body_preview (ceil_char_boundary + BODY_PREVIEW_MAX_BYTES) and main's build_body_preview (floor_char_boundary). Resolution: kept PR's markdown_body_preview as the single implementation; deleted build_body_preview and its three tests; added doc comment explaining ceil vs floor choice; removed brittle body.len()+1 assertion from the integration test (addresses @graycyrus major review comment); applied cargo fmt to match project style.
Summary
body_previewstarts inside a multi-byte UTF-8 charactermarkdown_body_previewwith a 2048-byte capFixes #1595.
Verification
cargo test -p openhuman markdown_body_preview_respects_utf8_boundary_and_byte_cap --lib -- --nocapturecargo test -p openhuman ingest_document_handles_utf8_at_body_preview_boundary --lib -- --nocapturecargo test -p openhuman openhuman::memory::tree::ingest::tests --libcargo check -p openhuman --libcargo fmt --all --checkformat:check,lint,compile,rust:check,lint:commands-tokensNotes:
cargo check/ pre-push report existing warnings only; no new failures.Summary by CodeRabbit