-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accept bytes in json.loads() #55185
Comments
json.loads() accepts strings but errors on bytes objects. Documentation and API indicate that both should work. Review of json/init.py code shows that the loads() function's 'encoding' arg is ignored and no decoding takes place before the object is passed to JSONDecoder.decode() Tested on Python 3.1.2 and Python 3.2rc1; fails on both. Example: ################################################# #!/usr/local/bin/python3.2 import json
print(json.loads('123'))
# 123
print(json.loads(b'123'))
# /Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/json/decoder.py:325:
# TypeError: can't use a string pattern on a bytes-like object
print(json.loads(b'123', encoding='utf-8'))
# /Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/json/decoder.py:325:
# TypeError: can't use a string pattern on a bytes-like object ################################################# Patch attached. |
Hmm. According to bpo-4136, all bytes support was supposed to have been removed. |
Indeed, the documentation (and function docstring) needs fixing instead. It's a pity we didn't remove the useless |
Georg: Is it still time to deprecate the encoding parameter in 3.2? |
I've committed a doc fix in r88137. |
Doc fix works for me. |
Works for me, py2.7 on snow leopard. |
anthony: this is python3-only problem. |
Now it's too late for 3.2, should this be done for 3.3? |
If you’re talking about deprecating the obsolete encoding argument (maybe it’s time for a new bug report), +1. |
I'll just mention that the elimination of bytes handling is a bit unfortunate, since this idiom which works in Python 2 no longer works: fp = urlopen(url)
json_data = json.load(fp) /me sad |
What if the returned JSON uses a charset other than utf-8 ? |
I know this does not fix anything at the core, but it would allow you to use json.loads() with python 3.2 (maybe 3.1?): Replace by raw_data = raw_data.decode('utf-8') # Or any other ISO format
json.loads(raw_data) |
According to RFC 4627: "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8." RFC 4627 also offers a way to autodetect other Unicode encodings. |
Well, adding support for bytes objects using the spec from RFC 4627 (or at least with utf-8 as a default) may be an enhancement for 3.3. |
Things are a little more complicated. '123' is not a valid JSON according to RFC 4627 (the top-level element can only be an object or an array). This means that the autodetection algorithm will not always work for such non-standard data. If we can parse binary data, then there must be a way to generate binary data in at least one of the Unicode encodings. By the way, the documentation should give a link to RFC 4627 and explain the current implementation is different from it. |
The autodetection algorithm needn't examine all 4 first bytes. If the 2 |
I mean a string that starts with '\u0000'. b'"\x00...'. |
Le jeudi 26 avril 2012 à 15:48 +0000, Serhiy Storchaka a écrit :
According to the RFC, that should be escaped: All Unicode characters may be placed within the And indeed: >>> json.loads('"\u0000"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/antoine/opt/lib/python3.2/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/home/antoine/opt/lib/python3.2/json/decoder.py", line 351, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/antoine/opt/lib/python3.2/json/decoder.py", line 367, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1 column 1 (char 1)
>>> json.loads('"\\u0000"')
'\x00' |
According to current implementation this is acceptable. >>> json.loads('"\u0000"', strict=False)
'\x00' |
Then perhaps auto-detection can be restricted to strict mode? Non-strict mode would always use utf-8. |
Related to this question is a question about errors. How to inform the user, if an error occurred in the decoding with detected encoding? Leave UnicodeDecodeError or convert it to ValueError? If there is a syntax error in JSON -- exception will refer to the position in the decoded string, we should to translate it to the position in the original binary string? |
bpo-19837 is the complementary problem on the serialisation side - users migrating from Python 2 are accustomed to being able to use the json module directly as a wire protocol module, but the strict Python 3 interpretation as a text transform means that isn't possible - you have to apply the text encoding step separately. What appears to have happened is that the way JSON is used in practice has diverged from JSON as a formal spec. Formal spec (this is what the Py3k JSON module implements, and Py2 implements with ensure_ascii=False): JSON is a Unicode text transform, which may optionally be serialised as UTF-8, UTF-16 or UTF-32. Practice (what the Py2 JSON module implements with ensure_ascii=True, and what is covered in RFC 4627): JSON is a UTF-8 encoded wire protocol So now we're left with the options:
I'm currently leaning towards the "jsonb" module option, and deprecating the "encoding" argument in the pure text version. It's not pretty, but I think it's better than the alternatives. |
Bike-shedding: instead of jsonb, make it json.bytes. Else, it may get confused with other protocols, such as "JSONP" or "BSON". |
json.bytes would also work for me. It wouldn't need to replicate the full If people want UTF-16 and UTF-32 *en*coding (which seem to be rarely used |
This seems to be an issue (bug?) for Python 3.3 When calling json.loads() with a byte array, this is the error json.loads(response.data, 'latin-1') TypeError: can't use a string pattern on a bytes-like object When I decode the byte array to string json.loads(response.data.decode(), 'latin-1') I get this error TypeError: bytes or integer address expected instead of str instance |
bpo-17909 (auto-detecting JSON encoding) looks like it has a patch which would probably satisfy this issue |
As Martin noted, Serhiy has implemented the autodetection option for json.loads in bpo-17909 so closing this one as out of date - UTF-8, UTF-16 and UTF-32 encoded JSON data will be deserialised automatically in 3.6, while other text encodings aren't officially supported by the JSON RFCs. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: