fix(streaming): fix SSE buffer corruption and add CJK sentence splitting#1794
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds CJK fullwidth terminator splitting (。!?) to split_sentences and a unit test; simplifies SSE parsing by draining complete lines from the buffer with a single ChangesStreaming & Segmentation Infrastructure
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Remove erroneous post-drain buffer slice in sse_bytes_to_chunks that corrupted multi-line SSE buffers (dead code path that would panic on multi-byte content boundaries). Add CJK fullwidth sentence terminators (。!?) to the presentation layer's split_sentences so Chinese responses get proper multi-bubble segmentation instead of always falling through to single-bubble delivery. Refs tinyhumansai#1675
49ef079 to
589e3c6
Compare
graycyrus
left a comment
There was a problem hiding this comment.
Walkthrough
Clean, well-scoped PR that fixes a real SSE buffer corruption bug in compatible_stream.rs and adds CJK sentence splitting in presentation.rs. The buffer fix is clearly correct — buffer.drain(..=pos) already mutates the buffer in place, so the old buffer = buffer[pos + 1..] line was double-advancing and corrupting multi-line SSE payloads. The CJK splitting follows the existing Latin-terminator pattern nicely. Only minor gaps below.
Change Summary
| File | Change type | Description |
|---|---|---|
src/openhuman/providers/compatible_stream.rs |
Bug fix | Remove redundant buffer slice after drain(..=pos) that corrupted multi-line SSE payloads |
src/openhuman/channels/providers/presentation.rs |
Enhancement | Add CJK fullwidth sentence terminators (。!?) as split points in split_sentences |
src/openhuman/channels/providers/presentation_tests.rs |
Test | Add happy-path test for CJK sentence splitting |
Per-file Analysis
compatible_stream.rs
The fix is correct and minimal. After buffer.drain(..=pos), bytes 0..=pos (including the \n) are removed from buffer and collected into line. The old second line buffer = buffer[pos + 1..].to_string() was indexing into the already-drained buffer — effectively skipping pos + 1 characters from the remaining data. On single-line SSE payloads this was harmless (buffer was empty after drain), but on multi-line payloads (common with CJK content) it caused data loss or panics at byte boundaries. Good catch.
presentation.rs
The CJK block mirrors the Latin terminator block structurally. The i + 1 < chars.len() guard means a CJK terminator at the very end of the string falls through to the post-loop cleanup rather than being split here — this is correct and consistent with how the Latin block's last-sentence handling works. The terminators chosen (U+3002, U+FF01, U+FF1F) are the standard CJK fullwidth equivalents.
presentation_tests.rs
The happy-path test is good and verifies all three CJK terminators in one pass. See inline comment for suggested edge cases.
Summary
sse_bytes_to_chunksthat corrupted multi-line SSE bufferssplit_sentencesso Chinese responses get proper multi-bubble segmentationProblem
compatible_stream.rs:56had a redundantbuffer = buffer[pos + 1..].to_string()afterbuffer.drain(..=pos)— the drain already removes the processed bytes, so the second slice corrupts the buffer on multi-line SSE payloads. For CJK content where SSE chunks may contain multiple lines, this causes data loss or panics at byte boundaries.split_sentencesinpresentation.rsonly recognized Latin sentence terminators (.!?+ space + uppercase). Chinese text using。!?was never split into sentences, always falling through to single-bubble delivery. This prevented proper segmentation of Chinese responses.Solution
buffer = buffer[pos + 1..].to_string()line —drain(..=pos)already leaves the buffer with only the unprocessed remainder.\u3002,\uFF01,\uFF1F) as split points insplit_sentences. Chinese text now segments into multiple bubbles like English text does.Submission Checklist
## Related— N/A: bug fix onlyCloses #NNNin the## RelatedsectionImpact
Related
Summary by CodeRabbit
New Features
Performance
Tests