Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ujson causes UnicodeEncodeError in email mirror #6332

Closed
gnprice opened this issue Aug 29, 2017 · 4 comments
Closed

ujson causes UnicodeEncodeError in email mirror #6332

gnprice opened this issue Aug 29, 2017 · 4 comments

Comments

@gnprice
Copy link
Member

gnprice commented Aug 29, 2017

We've gotten a few exceptions like this recently on zulipchat.com:

  File "./zerver/views/email_mirror.py", line 22, in email_mirror_message
    result = mirror_email_message(ujson.loads(request.POST['data']))
  File "./zerver/lib/email_mirror.py", line 382, in mirror_email_message
    lambda x: None
  File "./zerver/lib/queue.py", line 306, in queue_json_publish
    get_queue_client().json_publish(queue_name, event)
  File "./zerver/lib/queue.py", line 115, in json_publish
    self.publish(queue_name, ujson.dumps(body))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 6226: surrogates not allowed

The request data in this example looks like

- POST: {'secret': ['**********'], 'data': ['{"msg_text": "Received: from [...]\\udcef\\udcbf\\udcbd[...]", "recipient": "***"}']}

The core of the issue appears to be that ujson will decode a JSON document that looks like this, but will then fail if you ask it to encode the result:

In [39]: ujson.loads('"\\udcef\\udcbf\\udcbd"')
Out[39]: '\udcef\udcbf\udcbd'

In [40]: ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-40-6b0b01834cc7> in <module>()
----> 1 ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 0: surrogates not allowed

(The result of loads there is a string of three characters, each of which is not a real character but rather a surrogate value. The surrogates were originally carved out for use in UTF-16, to encode Unicode in 16-bit elements, but in Python they're now used for losslessly encoding mostly-UTF-8 bytestrings, like Unix filenames, in Python text strings. This particular sequence of three surrogates doesn't seem to fit either of those uses, so it's a mystery where it came from. If you take the low 8 bits from each of these surrogates as a sequence of bytes, you do get the UTF-8 encoding of U+FFFD REPLACEMENT CHARACTER -- so there are probably multiple layers of wrongness involved here.)

Compare json from the stdlib, which handles the round-trip with no trouble:

In [41]: json.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
Out[41]: '"\\udcef\\udcbf\\udcbd"'

The ujson tracker has had this issue open since 2014.

The number of segfaults, crashes, and memory corruption issues in the ujson open issues suggest this may not be a super reliable or vigorously-maintained library. The benchmarks in its README are impressive, but if we can we may be better off using the stdlib's json (aka simplejson).

@gnprice
Copy link
Member Author

gnprice commented Aug 29, 2017

On Python 2, ujson handles this fine:

In [1]: import ujson

In [2]: ujson.loads('"\\udcef\\udcbf\\udcbd"')
Out[2]: u'\udcef\udcbf\udcbd'

In [3]: ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
Out[3]: '"\\udcef\\udcbf\\udcbd"'

So that's why we haven't seen this until recently. Note the decoded object is a unicode -- that's probably a hint that ujson relies on here.

@timabbott
Copy link
Sponsor Member

I think it's worth doing some benchmarking with Python 3.4. With Python 2, using ujson had a huge performance benefit for our get_messages code path, and we'd standardized on it for that reason. But Python 3 I believe has involved a lot into making the built-in JSON reasonable since ujson was written, and it might not be worth it to use ujson elsewhere in the codebase where there are (apparently) encoding-related issues. We could certainly change non-perf-sensitive code paths like this one.

@gnprice
Copy link
Member Author

gnprice commented Aug 29, 2017

I nerdsniped myself on this and sent a PR upstream: ultrajson/ultrajson#284

So we should be able to just use that fix.

I think it's worth doing some benchmarking with Python 3.4.

Their README has results for both CPython 2.7.6 and CPython 3.4.3. It still shows them far ahead for some benchmarks.

@zulipbot
Copy link
Member

Hello @zulip/server-notifications members, this issue was labeled with the area: notifications label, so you may want to check it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants