UltraJson doesn't behave the same way as json.JSONEncoder for unicode chars #156

prosanes · 2014-11-27T17:47:51Z

As stated in issue #155:

I'm not sure that ujson.encode('\ud83d\ude80') should give any error.
The Python standard json library ("simplejson") doesn't:

json.JSONEncoder().encode('\ud83d\ude80')
'"\ud83d\ude80"'

When using ujson:

Python 3.3.2 (default, Sep 16 2013, 16:19:35)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ujson
>>> ujson.encode('\ud83d\ude80')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
>>> ujson.__version__
'1.34'

tonicava · 2014-11-28T12:09:55Z

Right, and that's a problem...

Jahaja · 2014-12-06T13:23:43Z

We're looking into it.

vetal4444 · 2015-03-26T08:44:36Z

I have this issue too:

In [1]: import json, ujson

In [2]: s = '"\ud8df\u4b61"'

In [3]: json.loads(s)
Out[3]: u'\ud8df\u4b61'

In [4]: ujson.loads(s)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-1b4d181e3a09> in <module>()
----> 1 ujson.loads(s)

ValueError: Unpaired high surrogate when decoding 'string'

Python 2.7.9

mittonk · 2016-10-06T23:11:58Z

Still reproducible on Python 3.5.1 and ujson 1.3.5.

This is a situation where we have a Python unicode string which doesn't consist entirely of genuine Unicode characters -- some of the codepoints in the string are surrogate codepoints, which occur in a UTF-16 encoding of a string and were also repurposed in PEP 383 for losslessly encoding arbitrary mostly-UTF-8 bytestrings (like Unix filenames) in Python strings. Currently, on Python 3, we cause a UnicodeEncodeError if we try to encode such a string as JSON. It's not 100% obvious what the right thing to do here is -- this situation seems like it must reflect a bug somewhere else in the program or its environment. But * one way we can get such a string is by loading a JSON document (perhaps an invalid JSON document? anyway, we load it without error): >>> ujson.dumps(ujson.loads('"\\udcff"')) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 0: surrogates not allowed * we already pass these strings through without complaint on Python 2; * as the included test shows, passing these through matches the behavior of the stdlib's `json` module. So it seems best to pass them through. Fixes ultrajson#156.

rafaelxy · 2018-03-15T19:22:46Z

Still happens on Python 2.7.12 and ujson==1.35

hartwork · 2020-02-25T17:26:11Z

The examples presented in this ticket pass strings with invalid characters — isolated surrogates — to ujson.loads. While its handling is different to what the standard library json does, it prevents invalid characters from entering your application. encode/django-rest-framework#7026 is an example for why letting them through would be a problem.

So my vote for this ticket: Not a bug but a feature.

This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The PyUnicode to char* conversion is moved into its own function, separated from the JSONTypeContext handling, so it can be reused for other things in the future. - Converting the char* output to a Python string with surrogates intact requires the string length for PyUnicode_Decode (or any of its alternatives). While strlen could be used, the length is already known inside the encoder, so the encoder function now also takes an extra size_t pointer argument to return that. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's __json__ method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Supersedes ultrajson#284

This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context. - Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Supersedes ultrajson#284

This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context. - Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Fixes ultrajson#537 Supersedes ultrajson#284

gnprice mentioned this issue Aug 29, 2017

ujson causes UnicodeEncodeError in email mirror zulip/zulip#6332

Closed

gnprice mentioned this issue Aug 29, 2017

Fix handling of surrogate pseudocharacters under Python 3. #284

Closed

gnprice mentioned this issue Sep 13, 2017

Evaluate switching to RapidJSON (from ujson) zulip/zulip#6507

Closed

vihang mentioned this issue Jan 26, 2018

Fix handling of surrogate pseudocharacters under Python 3. finaxar/ultrajson#3

Merged

JustAnotherArchivist mentioned this issue Apr 11, 2022

Allow str and None values for indent #518

Open

JustAnotherArchivist mentioned this issue Apr 17, 2022

Fix handling of surrogates on encoding #530

Merged

hugovk closed this as completed in #530 Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UltraJson doesn't behave the same way as json.JSONEncoder for unicode chars #156

UltraJson doesn't behave the same way as json.JSONEncoder for unicode chars #156

prosanes commented Nov 27, 2014 •

edited by hugovk

Loading

tonicava commented Nov 28, 2014

Jahaja commented Dec 6, 2014

vetal4444 commented Mar 26, 2015

mittonk commented Oct 6, 2016

rafaelxy commented Mar 15, 2018

hartwork commented Feb 25, 2020 •

edited

Loading

UltraJson doesn't behave the same way as json.JSONEncoder for unicode chars #156

UltraJson doesn't behave the same way as json.JSONEncoder for unicode chars #156

Comments

prosanes commented Nov 27, 2014 • edited by hugovk Loading

tonicava commented Nov 28, 2014

Jahaja commented Dec 6, 2014

vetal4444 commented Mar 26, 2015

mittonk commented Oct 6, 2016

rafaelxy commented Mar 15, 2018

hartwork commented Feb 25, 2020 • edited Loading

prosanes commented Nov 27, 2014 •

edited by hugovk

Loading

hartwork commented Feb 25, 2020 •

edited

Loading