accept bytes in json.loads() #55185

hhas · 2011-01-21T19:01:48Z

BPO	10976
Nosy	@loewis, @warsaw, @birkenfeld, @ncoghlan, @kousu, @ezio-melotti, @merwok, @bitdancer, @vadmium, @serhiy-storchaka, @jleedev
Superseder	bpo-17909: Autodetecting JSON encoding
Files	json.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-09-10.10:21:47.110>
created_at = <Date 2011-01-21.19:01:47.736>
labels = ['type-feature', 'library']
title = 'accept bytes in json.loads()'
updated_at = <Date 2016-09-10.10:21:47.108>
user = 'https://bugs.python.org/hhas'

bugs.python.org fields:

activity = <Date 2016-09-10.10:21:47.108>
actor = 'ncoghlan'
assignee = 'none'
closed = True
closed_date = <Date 2016-09-10.10:21:47.110>
closer = 'ncoghlan'
components = ['Library (Lib)']
creation = <Date 2011-01-21.19:01:47.736>
creator = 'hhas'
dependencies = []
files = ['20481']
hgrepos = []
issue_num = 10976
keywords = ['patch']
message_count = 28.0
messages = ['126772', '126782', '126785', '126786', '126788', '126831', '126986', '126997', '133645', '133672', '145343', '145345', '159359', '159360', '159364', '159366', '159368', '159388', '159391', '159395', '159454', '159469', '204810', '204937', '204959', '215529', '229973', '275615']
nosy_count = 17.0
nosy_names = ['loewis', 'barry', 'georg.brandl', 'ncoghlan', 'hhas', 'kousu', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'cvrebert', 'docs@python', 'antlong', 'martin.panter', 'serhiy.storchaka', 'Balthazar.Rouberol', 'jleedev', 'Hanxue.Lee']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = 'needs patch'
status = 'closed'
superseder = '17909'
type = 'enhancement'
url = 'https://bugs.python.org/issue10976'
versions = ['Python 3.6']

hhas · 2011-01-21T19:01:48Z

json.loads() accepts strings but errors on bytes objects. Documentation and API indicate that both should work. Review of json/init.py code shows that the loads() function's 'encoding' arg is ignored and no decoding takes place before the object is passed to JSONDecoder.decode()

Tested on Python 3.1.2 and Python 3.2rc1; fails on both.

Example:

#################################################

#!/usr/local/bin/python3.2

import json

print(json.loads('123'))
# 123

print(json.loads(b'123'))
# /Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/json/decoder.py:325:  
#   TypeError: can't use a string pattern on a bytes-like object

print(json.loads(b'123', encoding='utf-8'))
# /Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/json/decoder.py:325:  
#   TypeError: can't use a string pattern on a bytes-like object

#################################################

Patch attached.

bitdancer · 2011-01-21T20:35:33Z

Hmm. According to bpo-4136, all bytes support was supposed to have been removed.

pitrou · 2011-01-21T20:46:49Z

Indeed, the documentation (and function docstring) needs fixing instead. It's a pity we didn't remove the useless encoding parameter.

merwok · 2011-01-21T20:54:35Z

Georg: Is it still time to deprecate the encoding parameter in 3.2?

pitrou · 2011-01-21T21:38:06Z

I've committed a doc fix in r88137.

hhas · 2011-01-22T12:28:33Z

Doc fix works for me.

antlong · 2011-01-25T03:38:50Z

Works for me, py2.7 on snow leopard.

bitdancer · 2011-01-25T11:42:31Z

anthony: this is python3-only problem.

ezio-melotti · 2011-04-13T07:23:28Z

Now it's too late for 3.2, should this be done for 3.3?

merwok · 2011-04-13T15:40:46Z

If you’re talking about deprecating the obsolete encoding argument (maybe it’s time for a new bug report), +1.

warsaw · 2011-10-11T13:44:48Z

I'll just mention that the elimination of bytes handling is a bit unfortunate, since this idiom which works in Python 2 no longer works:

fp = urlopen(url)
json_data = json.load(fp)

/me sad

pitrou · 2011-10-11T13:51:37Z

I'll just mention that the elimination of bytes handling is a bit
unfortunate, since this idiom which works in Python 2 no longer works:

fp = urlopen(url)
json_data = json.load(fp)

What if the returned JSON uses a charset other than utf-8 ?

BalthazarRouberol · 2012-04-26T08:20:57Z

I know this does not fix anything at the core, but it would allow you to use json.loads() with python 3.2 (maybe 3.1?):

Replace
json.loads(raw_data)

by

raw_data = raw_data.decode('utf-8') # Or any other ISO format
json.loads(raw_data)

serhiy-storchaka · 2012-04-26T08:34:32Z

What if the returned JSON uses a charset other than utf-8 ?

According to RFC 4627: "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8." RFC 4627 also offers a way to autodetect other Unicode encodings.

pitrou · 2012-04-26T13:03:55Z

Well, adding support for bytes objects using the spec from RFC 4627 (or at least with utf-8 as a default) may be an enhancement for 3.3.

serhiy-storchaka · 2012-04-26T14:07:46Z

Things are a little more complicated. '123' is not a valid JSON according to RFC 4627 (the top-level element can only be an object or an array). This means that the autodetection algorithm will not always work for such non-standard data.

If we can parse binary data, then there must be a way to generate binary data in at least one of the Unicode encodings.

By the way, the documentation should give a link to RFC 4627 and explain the current implementation is different from it.

pitrou · 2012-04-26T14:21:40Z

Things are a little more complicated. '123' is not a valid JSON
according to RFC 4627 (the top-level element can only be an object or
an array). This means that the autodetection algorithm will not always
work for such non-standard data.

The autodetection algorithm needn't examine all 4 first bytes. If the 2
first bytes are non-zero, you have UTF-8 data. Otherwise, the JSON text
will be at least 4 bytes long (since it's either UTF-16 or UTF-32).

serhiy-storchaka · 2012-04-26T15:48:23Z

I mean a string that starts with '\u0000'. b'"\x00...'.

pitrou · 2012-04-26T16:12:44Z

Le jeudi 26 avril 2012 à 15:48 +0000, Serhiy Storchaka a écrit :

I mean a string that starts with '\u0000'. b'"\x00...'.

According to the RFC, that should be escaped:

All Unicode characters may be placed within the
quotation marks except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).

And indeed:

>>> json.loads('"\u0000"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/opt/lib/python3.2/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/home/antoine/opt/lib/python3.2/json/decoder.py", line 351, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/antoine/opt/lib/python3.2/json/decoder.py", line 367, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1 column 1 (char 1)
>>> json.loads('"\\u0000"')
'\x00'

serhiy-storchaka · 2012-04-26T16:21:35Z

According to current implementation this is acceptable.

>>> json.loads('"\u0000"', strict=False)
'\x00'

pitrou · 2012-04-27T14:06:13Z

According to current implementation this is acceptable.

Then perhaps auto-detection can be restricted to strict mode? Non-strict mode would always use utf-8.
Or we can just skip auto-detection altogether (I don't think many people produce utf-16 or utf-32 JSON; that would be a waste of bandwidth for no obvious benefit).

serhiy-storchaka · 2012-04-27T15:28:06Z

Related to this question is a question about errors. How to inform the user, if an error occurred in the decoding with detected encoding? Leave UnicodeDecodeError or convert it to ValueError? If there is a syntax error in JSON -- exception will refer to the position in the decoded string, we should to translate it to the position in the original binary string?

ncoghlan · 2013-11-30T14:06:01Z

bpo-19837 is the complementary problem on the serialisation side - users migrating from Python 2 are accustomed to being able to use the json module directly as a wire protocol module, but the strict Python 3 interpretation as a text transform means that isn't possible - you have to apply the text encoding step separately.

What appears to have happened is that the way JSON is used in practice has diverged from JSON as a formal spec.

Formal spec (this is what the Py3k JSON module implements, and Py2 implements with ensure_ascii=False): JSON is a Unicode text transform, which may optionally be serialised as UTF-8, UTF-16 or UTF-32.

Practice (what the Py2 JSON module implements with ensure_ascii=True, and what is covered in RFC 4627): JSON is a UTF-8 encoded wire protocol

So now we're left with the options:

try to tweak the existing json APIs to handle both the str<->str and str<->bytes use cases (ugly)
add new APIs within the existing json module
add a new "jsonb" module, which dumps to UTF-8 encoded bytes, and reads from UTF-8, UTF-16 or UTF-32 encoded bytes in accordance with RFC 4627 (but being more tolerant in terms of what is allowed at the top level)

I'm currently leaning towards the "jsonb" module option, and deprecating the "encoding" argument in the pure text version. It's not pretty, but I think it's better than the alternatives.

loewis · 2013-12-01T15:39:43Z

Bike-shedding: instead of jsonb, make it json.bytes. Else, it may get confused with other protocols, such as "JSONP" or "BSON".

ncoghlan · 2013-12-01T20:57:47Z

json.bytes would also work for me. It wouldn't need to replicate the full
main module API, just combine the text transform with UTF-8 encoding and
decoding (as well as autodetected UTF-16 and UTF-32 decoding) for the main
4 functions (dump[s], load[s]).

If people want UTF-16 and UTF-32 *en*coding (which seem to be rarely used
in combination with JSON), then they can invoke the text transform version
directly, and then do a separate encoding step.

HanxueLee · 2014-04-04T15:23:44Z

This seems to be an issue (bug?) for Python 3.3 When calling json.loads() with a byte array, this is the error

json.loads(response.data, 'latin-1')

TypeError: can't use a string pattern on a bytes-like object

When I decode the byte array to string

json.loads(response.data.decode(), 'latin-1')

I get this error

TypeError: bytes or integer address expected instead of str instance

vadmium · 2014-10-25T01:10:47Z

bpo-17909 (auto-detecting JSON encoding) looks like it has a patch which would probably satisfy this issue

ncoghlan · 2016-09-10T10:21:47Z

As Martin noted, Serhiy has implemented the autodetection option for json.loads in bpo-17909 so closing this one as out of date - UTF-8, UTF-16 and UTF-32 encoded JSON data will be deserialised automatically in 3.6, while other text encodings aren't officially supported by the JSON RFCs.

hhas mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 21, 2011

pitrou added docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels Jan 21, 2011

pitrou assigned docspython Jan 21, 2011

pitrou added stdlib Python modules in the Lib dir and removed docs Documentation in the Doc dir labels Apr 26, 2012

pitrou unassigned docspython Apr 26, 2012

pitrou added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Apr 26, 2012

merwok changed the title ~~json.loads() throws TypeError on bytes object~~ json.loads() raises TypeError on bytes object Apr 26, 2012

vstinner changed the title ~~json.loads() raises TypeError on bytes object~~ accept bytes in json.loads() Aug 17, 2016

ncoghlan closed this as completed Sep 10, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accept bytes in json.loads() #55185

accept bytes in json.loads() #55185

hhas mannequin commented Jan 21, 2011

hhas mannequin commented Jan 21, 2011

bitdancer commented Jan 21, 2011

pitrou commented Jan 21, 2011

merwok commented Jan 21, 2011

pitrou commented Jan 21, 2011

hhas mannequin commented Jan 22, 2011

antlong mannequin commented Jan 25, 2011

bitdancer commented Jan 25, 2011

ezio-melotti commented Apr 13, 2011

merwok commented Apr 13, 2011

warsaw commented Oct 11, 2011

pitrou commented Oct 11, 2011

BalthazarRouberol mannequin commented Apr 26, 2012

serhiy-storchaka commented Apr 26, 2012

pitrou commented Apr 26, 2012

serhiy-storchaka commented Apr 26, 2012

pitrou commented Apr 26, 2012

serhiy-storchaka commented Apr 26, 2012

pitrou commented Apr 26, 2012

serhiy-storchaka commented Apr 26, 2012

pitrou commented Apr 27, 2012

serhiy-storchaka commented Apr 27, 2012

ncoghlan commented Nov 30, 2013

loewis mannequin commented Dec 1, 2013

ncoghlan commented Dec 1, 2013

HanxueLee mannequin commented Apr 4, 2014

vadmium commented Oct 25, 2014

ncoghlan commented Sep 10, 2016

accept bytes in json.loads() #55185

accept bytes in json.loads() #55185

Comments

hhas mannequin commented Jan 21, 2011

hhas mannequin commented Jan 21, 2011

bitdancer commented Jan 21, 2011

pitrou commented Jan 21, 2011

merwok commented Jan 21, 2011

pitrou commented Jan 21, 2011

hhas mannequin commented Jan 22, 2011

antlong mannequin commented Jan 25, 2011

bitdancer commented Jan 25, 2011

ezio-melotti commented Apr 13, 2011

merwok commented Apr 13, 2011

warsaw commented Oct 11, 2011

pitrou commented Oct 11, 2011

BalthazarRouberol mannequin commented Apr 26, 2012

serhiy-storchaka commented Apr 26, 2012

pitrou commented Apr 26, 2012

serhiy-storchaka commented Apr 26, 2012

pitrou commented Apr 26, 2012

serhiy-storchaka commented Apr 26, 2012

pitrou commented Apr 26, 2012

serhiy-storchaka commented Apr 26, 2012

pitrou commented Apr 27, 2012

serhiy-storchaka commented Apr 27, 2012

ncoghlan commented Nov 30, 2013

loewis mannequin commented Dec 1, 2013

ncoghlan commented Dec 1, 2013

HanxueLee mannequin commented Apr 4, 2014

vadmium commented Oct 25, 2014

ncoghlan commented Sep 10, 2016