Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should not serialize bytes #264

Closed
glasserc opened this issue May 25, 2017 · 1 comment · Fixed by #266
Closed

should not serialize bytes #264

glasserc opened this issue May 25, 2017 · 1 comment · Fixed by #266

Comments

@glasserc
Copy link

glasserc commented May 25, 2017

We recently hit a bug (Kinto/kinto#1224) where one code path round-tripped a bcrypt-hashed password (a bytes object) through ujson, and another code path didn't. The one that went through ujson converted everything to str, whereas the other one left it as bytes.

It's my opinion that bytes should not be a serializable type. There is no equivalent to bytes in JSON, but ujson encodes bytes as a JSON string, which is for code points, not bytes. This means that there are bytes values which are not representable in JSON. ujson tries its best, decoding bytes values as UTF8 and failing if that isn't possible:

>>> import ujson
>>> ujson.dumps({"hi": b'\x30'})
'{"hi":"0"}'
>>> ujson.dumps({"hi": b'\xff'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 7: invalid start byte

This behavior made sense in the days of Python 2, where str objects were often used to encode text (see #74), but I think that if it's going to come out as strings, it shouldn't be allowed in as bytes.

The built-in json module refuses to encode bytes, either as a value or as an Object key:

>>> import json
>>> json.dumps({"hi": b'\x30'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.5/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python3.5/json/encoder.py", line 198, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.5/json/encoder.py", line 256, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib64/python3.5/json/encoder.py", line 179, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: b'0' is not JSON serializable
>>> json.dumps({b"hi": b'\x30'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.5/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python3.5/json/encoder.py", line 198, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.5/json/encoder.py", line 256, in iterencode
    return _iterencode(o, 0)
TypeError: keys must be a string
elelay added a commit to elelay/ultrajson that referenced this issue Jun 11, 2017
raise TypeError when encountering bytes in ujson.dumps() to prevent
unexpected Unicode exceptions in production.
Fixes ultrajson#264
@elelay
Copy link
Contributor

elelay commented Jun 11, 2017

PR #266 leaves the default behaviour unchanged but adds an option to raise on bytes.
This way developer can choose what behaviour they want.
This shouldn't affect performance (only adds a pointer dereference to check if reject_bytes is active when bytes are encountered).
The UNLIKELY macro could be added if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants