Skip to content

fix(streaming): fix SSE buffer corruption and add CJK sentence splitting#1794

Merged
senamakel merged 1 commit into
tinyhumansai:mainfrom
Sathvik-1007:fix/1675-chinese-response-streaming-cjk
May 16, 2026
Merged

fix(streaming): fix SSE buffer corruption and add CJK sentence splitting#1794
senamakel merged 1 commit into
tinyhumansai:mainfrom
Sathvik-1007:fix/1675-chinese-response-streaming-cjk

Conversation

@Sathvik-1007
Copy link
Copy Markdown
Contributor

@Sathvik-1007 Sathvik-1007 commented May 15, 2026

Summary

  • Fix erroneous post-drain buffer slice in sse_bytes_to_chunks that corrupted multi-line SSE buffers
  • Add CJK fullwidth sentence terminators (。!?) to presentation layer's split_sentences so Chinese responses get proper multi-bubble segmentation

Problem

  • compatible_stream.rs:56 had a redundant buffer = buffer[pos + 1..].to_string() after buffer.drain(..=pos) — the drain already removes the processed bytes, so the second slice corrupts the buffer on multi-line SSE payloads. For CJK content where SSE chunks may contain multiple lines, this causes data loss or panics at byte boundaries.
  • split_sentences in presentation.rs only recognized Latin sentence terminators (.!? + space + uppercase). Chinese text using 。!? was never split into sentences, always falling through to single-bubble delivery. This prevented proper segmentation of Chinese responses.

Solution

  • Removed the dead buffer = buffer[pos + 1..].to_string() line — drain(..=pos) already leaves the buffer with only the unprocessed remainder.
  • Added CJK fullwidth sentence terminators (\u3002, \uFF01, \uFF1F) as split points in split_sentences. Chinese text now segments into multiple bubbles like English text does.

Submission Checklist

  • Tests added or updated — existing presentation tests pass; the buffer fix is in a code path that was previously dead/panicking
  • Diff coverage ≥ 80% — N/A: 1-line removal + 8-line addition in well-tested module (28 tests pass)
  • Coverage matrix updated — N/A: no new features
  • All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: bug fix only
  • No new external network dependencies introduced
  • Manual smoke checklist updated — N/A: no release-cut surface changes
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Desktop/Mobile/Web: Chinese/CJK responses now properly segment into multiple chat bubbles instead of always being a single block
  • Streaming: SSE buffer handling is now correct for multi-line payloads (prevents potential data loss on high-throughput local model streams)
  • No breaking changes: English text behavior unchanged; segmentation is additive

Related

Summary by CodeRabbit

  • New Features

    • Added CJK (Chinese/Japanese/Korean) sentence splitting to improve message segmentation and delivery timing.
  • Performance

    • Optimized Server-Sent Events buffer handling for more efficient stream processing and reduced redundant work.
  • Tests

    • Added a unit test verifying CJK terminator sentence splitting produces expected segments.

Review Change Stack

@Sathvik-1007 Sathvik-1007 requested a review from a team May 15, 2026 07:48
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 59506251-4016-42ee-966e-9ce64a3211de

📥 Commits

Reviewing files that changed from the base of the PR and between 49ef079 and 589e3c6.

📒 Files selected for processing (3)
  • src/openhuman/channels/providers/presentation.rs
  • src/openhuman/channels/providers/presentation_tests.rs
  • src/openhuman/providers/compatible_stream.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/openhuman/providers/compatible_stream.rs
  • src/openhuman/channels/providers/presentation.rs

📝 Walkthrough

Walkthrough

Adds CJK fullwidth terminator splitting (。!?) to split_sentences and a unit test; simplifies SSE parsing by draining complete lines from the buffer with a single drain(..=pos) call instead of re-slicing.

Changes

Streaming & Segmentation Infrastructure

Layer / File(s) Summary
CJK Sentence Splitting Support
src/openhuman/channels/providers/presentation.rs, src/openhuman/channels/providers/presentation_tests.rs
split_sentences now detects fullwidth CJK terminators (。!?) to split multilingual text; adds an inline comment for Latin terminators and a unit test verifying CJK splits.
SSE Stream Buffer Optimization
src/openhuman/providers/compatible_stream.rs
sse_bytes_to_chunks extracts complete lines using buffer.drain(..=pos) and removes the subsequent manual buffer re-slice/reassignment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble bytes and split the night,
Fullwidth dots now end just right.
Lines drained clean, no extra slice,
Messages hop out, neat and concise.
🌙✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures both main changes: fixing SSE buffer corruption and adding CJK sentence splitting support, matching the core objectives of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 15, 2026
Remove erroneous post-drain buffer slice in sse_bytes_to_chunks that
corrupted multi-line SSE buffers (dead code path that would panic on
multi-byte content boundaries).

Add CJK fullwidth sentence terminators (。!?) to the presentation
layer's split_sentences so Chinese responses get proper multi-bubble
segmentation instead of always falling through to single-bubble
delivery.

Refs tinyhumansai#1675
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walkthrough

Clean, well-scoped PR that fixes a real SSE buffer corruption bug in compatible_stream.rs and adds CJK sentence splitting in presentation.rs. The buffer fix is clearly correct — buffer.drain(..=pos) already mutates the buffer in place, so the old buffer = buffer[pos + 1..] line was double-advancing and corrupting multi-line SSE payloads. The CJK splitting follows the existing Latin-terminator pattern nicely. Only minor gaps below.

Change Summary

File Change type Description
src/openhuman/providers/compatible_stream.rs Bug fix Remove redundant buffer slice after drain(..=pos) that corrupted multi-line SSE payloads
src/openhuman/channels/providers/presentation.rs Enhancement Add CJK fullwidth sentence terminators (。!?) as split points in split_sentences
src/openhuman/channels/providers/presentation_tests.rs Test Add happy-path test for CJK sentence splitting

Per-file Analysis

compatible_stream.rs

The fix is correct and minimal. After buffer.drain(..=pos), bytes 0..=pos (including the \n) are removed from buffer and collected into line. The old second line buffer = buffer[pos + 1..].to_string() was indexing into the already-drained buffer — effectively skipping pos + 1 characters from the remaining data. On single-line SSE payloads this was harmless (buffer was empty after drain), but on multi-line payloads (common with CJK content) it caused data loss or panics at byte boundaries. Good catch.

presentation.rs

The CJK block mirrors the Latin terminator block structurally. The i + 1 < chars.len() guard means a CJK terminator at the very end of the string falls through to the post-loop cleanup rather than being split here — this is correct and consistent with how the Latin block's last-sentence handling works. The terminators chosen (U+3002, U+FF01, U+FF1F) are the standard CJK fullwidth equivalents.

presentation_tests.rs

The happy-path test is good and verifies all three CJK terminators in one pass. See inline comment for suggested edge cases.

Comment thread src/openhuman/channels/providers/presentation.rs
Comment thread src/openhuman/channels/providers/presentation_tests.rs
@senamakel senamakel merged commit 5841856 into tinyhumansai:main May 16, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants