Skip to content

UnicodeDecodeError in stream.posts() on multi-byte UTF-8 chunks at TCP boundaries #3

@appvoraFR

Description

@appvoraFR

Description

When using xdk.Client.stream.posts(...) for the Filtered Stream, the call crashes with UnicodeDecodeError whenever a TCP chunk boundary cuts a multi-byte UTF-8 character (emoji, accents, CJK) in the middle.

This is a classic streaming UTF-8 issue: chunk.decode('utf-8') is called per chunk without using an IncrementalDecoder, so any byte sequence split across two chunks raises the error.

Reproduction

Hard to reproduce on demand (depends on TCP chunking + a tweet containing a multi-byte char around the chunk boundary), but in production it happens regularly (~44 crashes/day for a stream watching ~1500 tweets/day in French — emojis are very common).

Code that triggers it:

from xdk import Client
from xdk.streaming import StreamConfig

client = Client(bearer_token=BEARER)
cfg = StreamConfig(max_retries=999999, initial_backoff=1.0)
for resp in client.stream.posts(tweet_fields=[...], stream_config=cfg):
    # ...

Error logs

[DISCONNECT] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1023: unexpected end of data
Non-retryable error: Unexpected error: 'utf-8' codec can't decode byte 0xe2 in position 1023: unexpected end of data

Position is frequently 1023 (suggesting a 1024-byte read buffer is being decoded as-is), but I also see positions 572, 754, 63-65.

Affected bytes: 0xe2 (typical start of em-dash, ellipsis, emoji prefix), 0xc3 (accented Latin), 0xef (BOM / emoji), 0xec (CJK), 0xf0 (4-byte emoji).

Versions tested

  • xdk 0.9.0 (2026-02-28): crash
  • xdk 0.8.1 (2026-02-12): same crash

Both versions have the same streaming bug.

Suggested fix

Use a codecs.getincrementaldecoder('utf-8')() (or higher-level io.TextIOWrapper / httpx.Response.iter_lines() pattern) instead of decoding each chunk independently.

Pseudo-fix:

import codecs
decoder = codecs.getincrementaldecoder('utf-8')()
buf = ""
for chunk in raw_byte_stream:
    buf += decoder.decode(chunk)  # handles split multi-byte chars
    while "\n" in buf:
        line, buf = buf.split("\n", 1)
        yield json.loads(line)

Workaround currently in use

Bypass xdk for streaming, use httpx.Client.stream("GET", url).iter_lines() directly. This works because httpx uses an IncrementalDecoder internally for iter_lines/iter_text. xdk is still useful for the REST API.

Environment

  • Python 3.13
  • Debian (systemd service)
  • X API v2 Filtered Stream endpoint /2/tweets/search/stream

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions