Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for wobbly strings to On Demand and lossy trancoding from escaped (with replacement) #1947

Merged
merged 5 commits into from
Mar 1, 2023

Conversation

lemire
Copy link
Member

@lemire lemire commented Jan 28, 2023

Some users receive strings that cannot be decoded into valid UTF-8 due to a bad sequence of escaped code points. A reasonable solution is to fallback on WTF-8. This is allowed by RFC...

When all the strings represented in a JSON text are composed entirely of Unicode characters [UNICODE] (however escaped), then that JSON text is interoperable in the sense that all software implementations that parse it will agree on the contents of names and of string values in objects and arrays. However, the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate). Instances of this have been observed, for example, when a library truncates a UTF-16 string without checking whether the truncation split a surrogate pair. The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions.

Note that the original input must still be UTF-8. And we will continue to validate for UTF-8 by default. Users will need to call a specific function (get_wobbly_string()).

We also adding a Boolean parameter to our get_string() methods so that replacement characters are inserted in lieu of errors (lossy decoding).

The DOM kernel will not be affected.

Fixes #1944

@lemire lemire added the enhancement New feature or request label Jan 28, 2023
@lemire lemire changed the title Adding support for wobbly strings to On Demand Adding support for wobbly strings to On Demand and lossy trancoding from escaped (with replacement) Jan 29, 2023
@lemire lemire merged commit 37e87f6 into master Mar 1, 2023
@lemire lemire deleted the dlemire/adding_support_for_wtf8 branch March 1, 2023 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Wobbly
1 participant