Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accept bytes in json.loads() #55185

Closed
hhas mannequin opened this issue Jan 21, 2011 · 28 comments
Closed

accept bytes in json.loads() #55185

hhas mannequin opened this issue Jan 21, 2011 · 28 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@hhas
Copy link
Mannequin

hhas mannequin commented Jan 21, 2011

BPO 10976
Nosy @loewis, @warsaw, @birkenfeld, @ncoghlan, @kousu, @ezio-melotti, @merwok, @bitdancer, @vadmium, @serhiy-storchaka, @jleedev
Superseder
  • bpo-17909: Autodetecting JSON encoding
  • Files
  • json.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-09-10.10:21:47.110>
    created_at = <Date 2011-01-21.19:01:47.736>
    labels = ['type-feature', 'library']
    title = 'accept bytes in json.loads()'
    updated_at = <Date 2016-09-10.10:21:47.108>
    user = 'https://bugs.python.org/hhas'

    bugs.python.org fields:

    activity = <Date 2016-09-10.10:21:47.108>
    actor = 'ncoghlan'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-09-10.10:21:47.110>
    closer = 'ncoghlan'
    components = ['Library (Lib)']
    creation = <Date 2011-01-21.19:01:47.736>
    creator = 'hhas'
    dependencies = []
    files = ['20481']
    hgrepos = []
    issue_num = 10976
    keywords = ['patch']
    message_count = 28.0
    messages = ['126772', '126782', '126785', '126786', '126788', '126831', '126986', '126997', '133645', '133672', '145343', '145345', '159359', '159360', '159364', '159366', '159368', '159388', '159391', '159395', '159454', '159469', '204810', '204937', '204959', '215529', '229973', '275615']
    nosy_count = 17.0
    nosy_names = ['loewis', 'barry', 'georg.brandl', 'ncoghlan', 'hhas', 'kousu', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'cvrebert', 'docs@python', 'antlong', 'martin.panter', 'serhiy.storchaka', 'Balthazar.Rouberol', 'jleedev', 'Hanxue.Lee']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'needs patch'
    status = 'closed'
    superseder = '17909'
    type = 'enhancement'
    url = 'https://bugs.python.org/issue10976'
    versions = ['Python 3.6']

    @hhas
    Copy link
    Mannequin Author

    hhas mannequin commented Jan 21, 2011

    json.loads() accepts strings but errors on bytes objects. Documentation and API indicate that both should work. Review of json/init.py code shows that the loads() function's 'encoding' arg is ignored and no decoding takes place before the object is passed to JSONDecoder.decode()

    Tested on Python 3.1.2 and Python 3.2rc1; fails on both.

    Example:

    #################################################

    #!/usr/local/bin/python3.2

    import json
    
    print(json.loads('123'))
    # 123
    
    print(json.loads(b'123'))
    # /Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/json/decoder.py:325:  
    #   TypeError: can't use a string pattern on a bytes-like object
    
    print(json.loads(b'123', encoding='utf-8'))
    # /Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/json/decoder.py:325:  
    #   TypeError: can't use a string pattern on a bytes-like object

    #################################################

    Patch attached.

    @hhas hhas mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 21, 2011
    @bitdancer
    Copy link
    Member

    Hmm. According to bpo-4136, all bytes support was supposed to have been removed.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 21, 2011

    Indeed, the documentation (and function docstring) needs fixing instead. It's a pity we didn't remove the useless encoding parameter.

    @pitrou pitrou added docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels Jan 21, 2011
    @merwok
    Copy link
    Member

    merwok commented Jan 21, 2011

    Georg: Is it still time to deprecate the encoding parameter in 3.2?

    @pitrou
    Copy link
    Member

    pitrou commented Jan 21, 2011

    I've committed a doc fix in r88137.

    @hhas
    Copy link
    Mannequin Author

    hhas mannequin commented Jan 22, 2011

    Doc fix works for me.

    @antlong
    Copy link
    Mannequin

    antlong mannequin commented Jan 25, 2011

    Works for me, py2.7 on snow leopard.

    @bitdancer
    Copy link
    Member

    anthony: this is python3-only problem.

    @ezio-melotti
    Copy link
    Member

    Now it's too late for 3.2, should this be done for 3.3?

    @merwok
    Copy link
    Member

    merwok commented Apr 13, 2011

    If you’re talking about deprecating the obsolete encoding argument (maybe it’s time for a new bug report), +1.

    @warsaw
    Copy link
    Member

    warsaw commented Oct 11, 2011

    I'll just mention that the elimination of bytes handling is a bit unfortunate, since this idiom which works in Python 2 no longer works:

    fp = urlopen(url)
    json_data = json.load(fp)

    /me sad

    @pitrou
    Copy link
    Member

    pitrou commented Oct 11, 2011

    I'll just mention that the elimination of bytes handling is a bit
    unfortunate, since this idiom which works in Python 2 no longer works:

    fp = urlopen(url)
    json_data = json.load(fp)

    What if the returned JSON uses a charset other than utf-8 ?

    @BalthazarRouberol
    Copy link
    Mannequin

    BalthazarRouberol mannequin commented Apr 26, 2012

    I know this does not fix anything at the core, but it would allow you to use json.loads() with python 3.2 (maybe 3.1?):

    Replace
    json.loads(raw_data)

    by

    raw_data = raw_data.decode('utf-8') # Or any other ISO format
    json.loads(raw_data)

    @serhiy-storchaka
    Copy link
    Member

    What if the returned JSON uses a charset other than utf-8 ?

    According to RFC 4627: "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8." RFC 4627 also offers a way to autodetect other Unicode encodings.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 26, 2012

    Well, adding support for bytes objects using the spec from RFC 4627 (or at least with utf-8 as a default) may be an enhancement for 3.3.

    @pitrou pitrou added stdlib Python modules in the Lib dir and removed docs Documentation in the Doc dir labels Apr 26, 2012
    @pitrou pitrou added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Apr 26, 2012
    @serhiy-storchaka
    Copy link
    Member

    Things are a little more complicated. '123' is not a valid JSON according to RFC 4627 (the top-level element can only be an object or an array). This means that the autodetection algorithm will not always work for such non-standard data.

    If we can parse binary data, then there must be a way to generate binary data in at least one of the Unicode encodings.

    By the way, the documentation should give a link to RFC 4627 and explain the current implementation is different from it.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 26, 2012

    Things are a little more complicated. '123' is not a valid JSON
    according to RFC 4627 (the top-level element can only be an object or
    an array). This means that the autodetection algorithm will not always
    work for such non-standard data.

    The autodetection algorithm needn't examine all 4 first bytes. If the 2
    first bytes are non-zero, you have UTF-8 data. Otherwise, the JSON text
    will be at least 4 bytes long (since it's either UTF-16 or UTF-32).

    @merwok merwok changed the title json.loads() throws TypeError on bytes object json.loads() raises TypeError on bytes object Apr 26, 2012
    @serhiy-storchaka
    Copy link
    Member

    I mean a string that starts with '\u0000'. b'"\x00...'.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 26, 2012

    Le jeudi 26 avril 2012 à 15:48 +0000, Serhiy Storchaka a écrit :

    I mean a string that starts with '\u0000'. b'"\x00...'.

    According to the RFC, that should be escaped:

    All Unicode characters may be placed within the
    quotation marks except for the characters that must be escaped:
    quotation mark, reverse solidus, and the control characters (U+0000
    through U+001F).

    And indeed:

    >>> json.loads('"\u0000"')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/antoine/opt/lib/python3.2/json/__init__.py", line 307, in loads
        return _default_decoder.decode(s)
      File "/home/antoine/opt/lib/python3.2/json/decoder.py", line 351, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
      File "/home/antoine/opt/lib/python3.2/json/decoder.py", line 367, in raw_decode
        obj, end = self.scan_once(s, idx)
    ValueError: Invalid control character at: line 1 column 1 (char 1)
    >>> json.loads('"\\u0000"')
    '\x00'

    @serhiy-storchaka
    Copy link
    Member

    According to current implementation this is acceptable.

    >>> json.loads('"\u0000"', strict=False)
    '\x00'

    @pitrou
    Copy link
    Member

    pitrou commented Apr 27, 2012

    According to current implementation this is acceptable.

    Then perhaps auto-detection can be restricted to strict mode? Non-strict mode would always use utf-8.
    Or we can just skip auto-detection altogether (I don't think many people produce utf-16 or utf-32 JSON; that would be a waste of bandwidth for no obvious benefit).

    @serhiy-storchaka
    Copy link
    Member

    Related to this question is a question about errors. How to inform the user, if an error occurred in the decoding with detected encoding? Leave UnicodeDecodeError or convert it to ValueError? If there is a syntax error in JSON -- exception will refer to the position in the decoded string, we should to translate it to the position in the original binary string?

    @ncoghlan
    Copy link
    Contributor

    bpo-19837 is the complementary problem on the serialisation side - users migrating from Python 2 are accustomed to being able to use the json module directly as a wire protocol module, but the strict Python 3 interpretation as a text transform means that isn't possible - you have to apply the text encoding step separately.

    What appears to have happened is that the way JSON is used in practice has diverged from JSON as a formal spec.

    Formal spec (this is what the Py3k JSON module implements, and Py2 implements with ensure_ascii=False): JSON is a Unicode text transform, which may optionally be serialised as UTF-8, UTF-16 or UTF-32.

    Practice (what the Py2 JSON module implements with ensure_ascii=True, and what is covered in RFC 4627): JSON is a UTF-8 encoded wire protocol

    So now we're left with the options:

    • try to tweak the existing json APIs to handle both the str<->str and str<->bytes use cases (ugly)
    • add new APIs within the existing json module
    • add a new "jsonb" module, which dumps to UTF-8 encoded bytes, and reads from UTF-8, UTF-16 or UTF-32 encoded bytes in accordance with RFC 4627 (but being more tolerant in terms of what is allowed at the top level)

    I'm currently leaning towards the "jsonb" module option, and deprecating the "encoding" argument in the pure text version. It's not pretty, but I think it's better than the alternatives.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 1, 2013

    Bike-shedding: instead of jsonb, make it json.bytes. Else, it may get confused with other protocols, such as "JSONP" or "BSON".

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 1, 2013

    json.bytes would also work for me. It wouldn't need to replicate the full
    main module API, just combine the text transform with UTF-8 encoding and
    decoding (as well as autodetected UTF-16 and UTF-32 decoding) for the main
    4 functions (dump[s], load[s]).

    If people want UTF-16 and UTF-32 *en*coding (which seem to be rarely used
    in combination with JSON), then they can invoke the text transform version
    directly, and then do a separate encoding step.

    @HanxueLee
    Copy link
    Mannequin

    HanxueLee mannequin commented Apr 4, 2014

    This seems to be an issue (bug?) for Python 3.3 When calling json.loads() with a byte array, this is the error

    json.loads(response.data, 'latin-1')

    TypeError: can't use a string pattern on a bytes-like object

    When I decode the byte array to string

    json.loads(response.data.decode(), 'latin-1')

    I get this error

    TypeError: bytes or integer address expected instead of str instance

    @vadmium
    Copy link
    Member

    vadmium commented Oct 25, 2014

    bpo-17909 (auto-detecting JSON encoding) looks like it has a patch which would probably satisfy this issue

    @vstinner vstinner changed the title json.loads() raises TypeError on bytes object accept bytes in json.loads() Aug 17, 2016
    @ncoghlan
    Copy link
    Contributor

    As Martin noted, Serhiy has implemented the autodetection option for json.loads in bpo-17909 so closing this one as out of date - UTF-8, UTF-16 and UTF-32 encoded JSON data will be deserialised automatically in 3.6, while other text encodings aren't officially supported by the JSON RFCs.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants