fix(streaming): fix SSE buffer corruption and add CJK sentence splitting by Sathvik-1007 · Pull Request #1794 · tinyhumansai/openhuman

Sathvik-1007 · 2026-05-15T07:48:45Z

Summary

Fix erroneous post-drain buffer slice in sse_bytes_to_chunks that corrupted multi-line SSE buffers
Add CJK fullwidth sentence terminators (。！？) to presentation layer's split_sentences so Chinese responses get proper multi-bubble segmentation

Problem

compatible_stream.rs:56 had a redundant buffer = buffer[pos + 1..].to_string() after buffer.drain(..=pos) — the drain already removes the processed bytes, so the second slice corrupts the buffer on multi-line SSE payloads. For CJK content where SSE chunks may contain multiple lines, this causes data loss or panics at byte boundaries.
split_sentences in presentation.rs only recognized Latin sentence terminators (.!? + space + uppercase). Chinese text using 。！？ was never split into sentences, always falling through to single-bubble delivery. This prevented proper segmentation of Chinese responses.

Solution

Removed the dead buffer = buffer[pos + 1..].to_string() line — drain(..=pos) already leaves the buffer with only the unprocessed remainder.
Added CJK fullwidth sentence terminators (\u3002, \uFF01, \uFF1F) as split points in split_sentences. Chinese text now segments into multiple bubbles like English text does.

Submission Checklist

Tests added or updated — existing presentation tests pass; the buffer fix is in a code path that was previously dead/panicking
Diff coverage ≥ 80% — N/A: 1-line removal + 8-line addition in well-tested module (28 tests pass)
Coverage matrix updated — N/A: no new features
All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: bug fix only
No new external network dependencies introduced
Manual smoke checklist updated — N/A: no release-cut surface changes
Linked issue closed via Closes #NNN in the ## Related section

Impact

Desktop/Mobile/Web: Chinese/CJK responses now properly segment into multiple chat bubbles instead of always being a single block
Streaming: SSE buffer handling is now correct for multi-line payloads (prevents potential data loss on high-throughput local model streams)
No breaking changes: English text behavior unchanged; segmentation is additive

Refs AI repeats every Chinese response twice when asked to answer in Chinese #1675
Note: The full duplication reported in AI repeats every Chinese response twice when asked to answer in Chinese #1675 may also involve model-level repetition (known issue with some local Ollama models like qwen2.5). These fixes address the infrastructure-level causes — SSE buffer corruption and missing CJK segmentation — that could contribute to or mask the issue.

Summary by CodeRabbit

New Features
- Added CJK (Chinese/Japanese/Korean) sentence splitting to improve message segmentation and delivery timing.
Performance
- Optimized Server-Sent Events buffer handling for more efficient stream processing and reduced redundant work.
Tests
- Added a unit test verifying CJK terminator sentence splitting produces expected segments.

coderabbitai · 2026-05-15T07:49:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 59506251-4016-42ee-966e-9ce64a3211de

📥 Commits

Reviewing files that changed from the base of the PR and between 49ef079 and 589e3c6.

📒 Files selected for processing (3)

src/openhuman/channels/providers/presentation.rs
src/openhuman/channels/providers/presentation_tests.rs
src/openhuman/providers/compatible_stream.rs

🚧 Files skipped from review as they are similar to previous changes (2)

src/openhuman/providers/compatible_stream.rs
src/openhuman/channels/providers/presentation.rs

📝 Walkthrough

Walkthrough

Adds CJK fullwidth terminator splitting (。！？) to split_sentences and a unit test; simplifies SSE parsing by draining complete lines from the buffer with a single drain(..=pos) call instead of re-slicing.

Changes

Streaming & Segmentation Infrastructure

Layer / File(s)	Summary
CJK Sentence Splitting Support `src/openhuman/channels/providers/presentation.rs`, `src/openhuman/channels/providers/presentation_tests.rs`	`split_sentences` now detects fullwidth CJK terminators (。！？) to split multilingual text; adds an inline comment for Latin terminators and a unit test verifying CJK splits.
SSE Stream Buffer Optimization `src/openhuman/providers/compatible_stream.rs`	`sse_bytes_to_chunks` extracts complete lines using `buffer.drain(..=pos)` and removes the subsequent manual buffer re-slice/reassignment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble bytes and split the night,
Fullwidth dots now end just right.
Lines drained clean, no extra slice,
Messages hop out, neat and concise.
🌙✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures both main changes: fixing SSE buffer corruption and adding CJK sentence splitting support, matching the core objectives of the pull request.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Remove erroneous post-drain buffer slice in sse_bytes_to_chunks that corrupted multi-line SSE buffers (dead code path that would panic on multi-byte content boundaries). Add CJK fullwidth sentence terminators (。！？) to the presentation layer's split_sentences so Chinese responses get proper multi-bubble segmentation instead of always falling through to single-bubble delivery. Refs tinyhumansai#1675

graycyrus

Walkthrough

Clean, well-scoped PR that fixes a real SSE buffer corruption bug in compatible_stream.rs and adds CJK sentence splitting in presentation.rs. The buffer fix is clearly correct — buffer.drain(..=pos) already mutates the buffer in place, so the old buffer = buffer[pos + 1..] line was double-advancing and corrupting multi-line SSE payloads. The CJK splitting follows the existing Latin-terminator pattern nicely. Only minor gaps below.

Change Summary

File	Change type	Description
`src/openhuman/providers/compatible_stream.rs`	Bug fix	Remove redundant buffer slice after `drain(..=pos)` that corrupted multi-line SSE payloads
`src/openhuman/channels/providers/presentation.rs`	Enhancement	Add CJK fullwidth sentence terminators (。！？) as split points in `split_sentences`
`src/openhuman/channels/providers/presentation_tests.rs`	Test	Add happy-path test for CJK sentence splitting

Per-file Analysis

`compatible_stream.rs`

The fix is correct and minimal. After buffer.drain(..=pos), bytes 0..=pos (including the \n) are removed from buffer and collected into line. The old second line buffer = buffer[pos + 1..].to_string() was indexing into the already-drained buffer — effectively skipping pos + 1 characters from the remaining data. On single-line SSE payloads this was harmless (buffer was empty after drain), but on multi-line payloads (common with CJK content) it caused data loss or panics at byte boundaries. Good catch.

`presentation.rs`

The CJK block mirrors the Latin terminator block structurally. The i + 1 < chars.len() guard means a CJK terminator at the very end of the string falls through to the post-loop cleanup rather than being split here — this is correct and consistent with how the Latin block's last-sentence handling works. The terminators chosen (U+3002, U+FF01, U+FF1F) are the standard CJK fullwidth equivalents.

`presentation_tests.rs`

The happy-path test is good and verifies all three CJK terminators in one pass. See inline comment for suggested edge cases.

Sathvik-1007 requested a review from a team May 15, 2026 07:48

coderabbitai Bot previously approved these changes May 15, 2026

View reviewed changes

Sathvik-1007 dismissed coderabbitai[bot]’s stale review via 589e3c6 May 15, 2026 13:58

Sathvik-1007 force-pushed the fix/1675-chinese-response-streaming-cjk branch from 49ef079 to 589e3c6 Compare May 15, 2026 13:58

coderabbitai Bot approved these changes May 15, 2026

View reviewed changes

graycyrus reviewed May 15, 2026

View reviewed changes

Comment thread src/openhuman/channels/providers/presentation.rs

Comment thread src/openhuman/channels/providers/presentation_tests.rs

senamakel merged commit 5841856 into tinyhumansai:main May 16, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(streaming): fix SSE buffer corruption and add CJK sentence splitting#1794

fix(streaming): fix SSE buffer corruption and add CJK sentence splitting#1794
senamakel merged 1 commit into
tinyhumansai:mainfrom
Sathvik-1007:fix/1675-chinese-response-streaming-cjk

Sathvik-1007 commented May 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

graycyrus left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Sathvik-1007 commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Walkthrough

Change Summary

Per-file Analysis

compatible_stream.rs

presentation.rs

presentation_tests.rs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sathvik-1007 commented May 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading

`compatible_stream.rs`

`presentation.rs`

`presentation_tests.rs`