Skip to content

SER breaks high unicode values. #153

@python-processing-unit

Description

@python-processing-unit

According to the specification section 4.1.2:

The character \ MUST begin an escape sequence in a STR literal.

  • \uHHHH = exactly four hexadecimal digits (0-9|A-F|a-f). Produces code point U+HHHH.

  • \UHHHHHHHH = exactly eight hexadecimal digits (0-9|A-F|a-f). Produces code point U+HHHHHHHH.

The SER builtin serializes strings to JSON format. In the implementation, however, jb_append_json_string in builtins.c processes the UTF-8 string byte-by-byte, escaping any byte >= 0x7F as \u00xx using the raw byte value. For multi-byte UTF-8 sequences (e.g. U+00E9 é encoded as bytes 0xC3 0xA9), this produces two separate escape sequences (\u00c3\u00a9) instead of a single Unicode escape for the codepoint. The test ser-strings.pre expects SER("\u00E9") to produce \u00E9 (or lowercase variant), causing the assertion to fail.

Metadata

Metadata

Labels

bugSomething isn't workingpatchRequires a patch version change.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions