Fix handling of surrogates on decoding #550
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
This implements surrogate handling on decoding as it is in the standard library. Lone escaped surrogates and any raw surrogates in the input result in surrogates in the output, and escaped surrogate pairs get decoded into non-BMP characters. Note that raw surrogate pairs get treated differently on platforms/compilers with 16-bit
wchar_t
, e.g. Microsoft Windows.Before this, well-formed JSON using surrogates only in encoded surrogate pairs was decoded correctly, but anything else containing surrogates would produce unexpected results or errors. This change makes ujson's decoding compatible with the standard library's.
Unfortunately, platforms with a 16-bit
wchar_t
cannot handle raw surrogate pairs correctly becausePyUnicode_FromWideChar
decodes those. That issue was always present (but untested) and is left unfixed here. I'm not sure it is possible to handle it correctly without completely changing the decoding approach, e.g. producing UTF-8 and usingPyUnicode_FromStringAndSize
(which, by the way, is what orjson does, though it rejects lone surrogates).