Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent handling of surrogate pair characters #447

Closed
TheShiftedBit opened this issue Dec 17, 2020 · 1 comment · Fixed by #530
Closed

Inconsistent handling of surrogate pair characters #447

TheShiftedBit opened this issue Dec 17, 2020 · 1 comment · Fixed by #530

Comments

@TheShiftedBit
Copy link

Ultrajson does not correctly handle surrogate pair characters. Per RFC 7159, unmatched surrogate pair characters are permitted in JSON:

the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters; for example, "\uDEAD" (a single unpaired UTF-16
surrogate).

This behavior also disagrees with the built-in json module, which can handle strings containing surrogates correctly.

Ultrajson will correctly parse strings containing surrogate pair characters, if they're written in an escaped form, but it will not parse unescaped surrogate pair characters, and it will never dump Python strings containing surrogate pairs to JSON. This means ujson fails to round-trip certain inputs; see the example below.

What did you do?

Attempted to .dumps() a string containing a surrogate pair character.

What did you expect to happen?

It is encoded into a string.

What actually happened?

An exception is raised:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udddd' in position 0: surrogates not allowed

What versions are you using?

  • OS: Debian 5.7.17-1rodete4 (2020-10-01) x86_64 GNU/Linux
  • Python: Python 3.8.6 (tags/v3.8.6:db455296be, Nov 22 2020, 18:14:31)
  • UltraJSON: 4.0.1 (latest)

Please include code that reproduces the issue.

import ujson
x = ujson.loads('"\\udead"')  # Works, produces a surrogate pair character
y = ujson.dumps(x)  # Raises an exception

Solving this problem is a bit annoying in native code. In Python, the solution is simply to add "surrogatepass" as a second parameter to str.encode(). That functionality isn't exposed in native code, however. The solution I used in Atheris (which found this bug) was to first try encoding the string using the normal method, and if that doesn't work, fall back to encoding it by dynamically making a call to the str.encode() function in Python. Since you only have to take the slow route if the fast route fails, this is not likely to meaningfully affect performance.

@hugovk
Copy link
Member

hugovk commented Dec 17, 2020

Thanks for the report! The solution sounds reasonable, would you like to put together a PR?

JustAnotherArchivist added a commit to JustAnotherArchivist/ultrajson that referenced this issue Apr 17, 2022
This allows surrogates anywhere in the input, compatible with the json module from the standard library.

This also refactors two interfaces:
- The PyUnicode to char* conversion is moved into its own function, separated from the JSONTypeContext handling, so it can be reused for other things in the future.
- Converting the char* output to a Python string with surrogates intact requires the string length for PyUnicode_Decode (or any of its alternatives). While strlen could be used, the length is already known inside the encoder, so the encoder function now also takes an extra size_t pointer argument to return that. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's __json__ method return value were to contain them.

Fixes ultrajson#156
Fixes ultrajson#447
Supersedes ultrajson#284
JustAnotherArchivist added a commit to JustAnotherArchivist/ultrajson that referenced this issue Apr 17, 2022
This allows surrogates anywhere in the input, compatible with the json module from the standard library.

This also refactors two interfaces:
- The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context.
- Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them.

Fixes ultrajson#156
Fixes ultrajson#447
Supersedes ultrajson#284
JustAnotherArchivist added a commit to JustAnotherArchivist/ultrajson that referenced this issue Apr 17, 2022
This allows surrogates anywhere in the input, compatible with the json module from the standard library.

This also refactors two interfaces:
- The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context.
- Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them.

Fixes ultrajson#156
Fixes ultrajson#447
Supersedes ultrajson#284
JustAnotherArchivist added a commit to JustAnotherArchivist/ultrajson that referenced this issue May 30, 2022
This allows surrogates anywhere in the input, compatible with the json module from the standard library.

This also refactors two interfaces:
- The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context.
- Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them.

Fixes ultrajson#156
Fixes ultrajson#447
Fixes ultrajson#537
Supersedes ultrajson#284
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants