ujson causes UnicodeEncodeError in email mirror #6332

gnprice · 2017-08-29T00:06:22Z

We've gotten a few exceptions like this recently on zulipchat.com:

  File "./zerver/views/email_mirror.py", line 22, in email_mirror_message
    result = mirror_email_message(ujson.loads(request.POST['data']))
  File "./zerver/lib/email_mirror.py", line 382, in mirror_email_message
    lambda x: None
  File "./zerver/lib/queue.py", line 306, in queue_json_publish
    get_queue_client().json_publish(queue_name, event)
  File "./zerver/lib/queue.py", line 115, in json_publish
    self.publish(queue_name, ujson.dumps(body))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 6226: surrogates not allowed

The request data in this example looks like

- POST: {'secret': ['**********'], 'data': ['{"msg_text": "Received: from [...]\\udcef\\udcbf\\udcbd[...]", "recipient": "***"}']}

The core of the issue appears to be that ujson will decode a JSON document that looks like this, but will then fail if you ask it to encode the result:

In [39]: ujson.loads('"\\udcef\\udcbf\\udcbd"')
Out[39]: '\udcef\udcbf\udcbd'

In [40]: ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-40-6b0b01834cc7> in <module>()
----> 1 ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 0: surrogates not allowed

(The result of loads there is a string of three characters, each of which is not a real character but rather a surrogate value. The surrogates were originally carved out for use in UTF-16, to encode Unicode in 16-bit elements, but in Python they're now used for losslessly encoding mostly-UTF-8 bytestrings, like Unix filenames, in Python text strings. This particular sequence of three surrogates doesn't seem to fit either of those uses, so it's a mystery where it came from. If you take the low 8 bits from each of these surrogates as a sequence of bytes, you do get the UTF-8 encoding of U+FFFD REPLACEMENT CHARACTER -- so there are probably multiple layers of wrongness involved here.)

Compare json from the stdlib, which handles the round-trip with no trouble:

In [41]: json.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
Out[41]: '"\\udcef\\udcbf\\udcbd"'

The ujson tracker has had this issue open since 2014.

The number of segfaults, crashes, and memory corruption issues in the ujson open issues suggest this may not be a super reliable or vigorously-maintained library. The benchmarks in its README are impressive, but if we can we may be better off using the stdlib's json (aka simplejson).

The text was updated successfully, but these errors were encountered:

gnprice · 2017-08-29T00:10:46Z

On Python 2, ujson handles this fine:

In [1]: import ujson

In [2]: ujson.loads('"\\udcef\\udcbf\\udcbd"')
Out[2]: u'\udcef\udcbf\udcbd'

In [3]: ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
Out[3]: '"\\udcef\\udcbf\\udcbd"'

So that's why we haven't seen this until recently. Note the decoded object is a unicode -- that's probably a hint that ujson relies on here.

timabbott · 2017-08-29T00:27:04Z

I think it's worth doing some benchmarking with Python 3.4. With Python 2, using ujson had a huge performance benefit for our get_messages code path, and we'd standardized on it for that reason. But Python 3 I believe has involved a lot into making the built-in JSON reasonable since ujson was written, and it might not be worth it to use ujson elsewhere in the codebase where there are (apparently) encoding-related issues. We could certainly change non-perf-sensitive code paths like this one.

gnprice · 2017-08-29T02:12:22Z

I nerdsniped myself on this and sent a PR upstream: ultrajson/ultrajson#284

So we should be able to just use that fix.

I think it's worth doing some benchmarking with Python 3.4.

Their README has results for both CPython 2.7.6 and CPython 3.4.3. It still shows them far ahead for some benchmarks.

zulipbot · 2017-08-29T20:48:08Z

Hello @zulip/server-notifications members, this issue was labeled with the area: notifications label, so you may want to check it out!

timabbott added area: notifications (other) bug labels Aug 29, 2017

gnprice closed this as completed in b0d34b0 Aug 29, 2017

gnprice mentioned this issue Sep 13, 2017

Evaluate switching to RapidJSON (from ujson) #6507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ujson causes UnicodeEncodeError in email mirror #6332

ujson causes UnicodeEncodeError in email mirror #6332

gnprice commented Aug 29, 2017

gnprice commented Aug 29, 2017

timabbott commented Aug 29, 2017

gnprice commented Aug 29, 2017

zulipbot commented Aug 29, 2017

ujson causes UnicodeEncodeError in email mirror #6332

ujson causes UnicodeEncodeError in email mirror #6332

Comments

gnprice commented Aug 29, 2017

gnprice commented Aug 29, 2017

timabbott commented Aug 29, 2017

gnprice commented Aug 29, 2017

zulipbot commented Aug 29, 2017