Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json.dumps not parsable by json.loads (on Linux only) #55698

Closed
BrianMerrell mannequin opened this issue Mar 13, 2011 · 21 comments
Closed

json.dumps not parsable by json.loads (on Linux only) #55698

BrianMerrell mannequin opened this issue Mar 13, 2011 · 21 comments
Assignees
Labels
extension-modules C modules in the Modules dir OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@BrianMerrell
Copy link
Mannequin

BrianMerrell mannequin commented Mar 13, 2011

BPO 11489
Nosy @rhettinger, @etrepum, @abalkin, @pitrou, @vstinner, @ezio-melotti, @akheron, @serhiy-storchaka
Files
  • issue11489.diff: Failing test (2.7)
  • json_decode_lone_surrogates_2.patch: Patch for 3.4
  • json_decode_lone_surrogates_2-2.7.patch: Patch for 2.7
  • json_decode_lone_surrogates_3-3.4.patch
  • test_json_surrogates.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2013-12-01.15:42:22.563>
    created_at = <Date 2011-03-13.23:17:19.745>
    labels = ['extension-modules', 'type-bug', 'library', 'expert-unicode', 'OS-windows']
    title = 'json.dumps not parsable by json.loads (on Linux only)'
    updated_at = <Date 2013-12-01.15:42:22.562>
    user = 'https://bugs.python.org/BrianMerrell'

    bugs.python.org fields:

    activity = <Date 2013-12-01.15:42:22.562>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2013-12-01.15:42:22.563>
    closer = 'serhiy.storchaka'
    components = ['Extension Modules', 'Library (Lib)', 'Unicode', 'Windows']
    creation = <Date 2011-03-13.23:17:19.745>
    creator = 'Brian.Merrell'
    dependencies = []
    files = ['27369', '30234', '30235', '32718', '32922']
    hgrepos = []
    issue_num = 11489
    keywords = ['patch']
    message_count = 21.0
    messages = ['130779', '130846', '130862', '130889', '130891', '133662', '144646', '169263', '169283', '171684', '174484', '188867', '189055', '200071', '203470', '204516', '204882', '204883', '204909', '204927', '204936']
    nosy_count = 14.0
    nosy_names = ['rhettinger', 'bob.ippolito', 'belopolsky', 'pitrou', 'vstinner', 'ezio.melotti', 'Arfrever', 'merrellb', 'python-dev', 'Brian.Merrell', 'petri.lehtinen', 'tchrist', 'serhiy.storchaka', 'taras.prokopenko']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue11489'
    versions = ['Python 2.7', 'Python 3.3', 'Python 3.4']

    @BrianMerrell
    Copy link
    Mannequin Author

    BrianMerrell mannequin commented Mar 13, 2011

    The following works on Win7x64 Python 2.6.5 and breaks on Ubuntu 10.04x64-2.6.5. This raises three issues:

    1. Shouldn't anything generated by json.dumps be parsed by json.loads?
    2. It appears this is an invalid unicode character. Shouldn't this be caught by decode("utf8")
    3. Why does Windows raise no issue with this and Linux does?
    import json
    unicode_bytes = '\xed\xa8\x80'
    unicode_string = unicode_bytes.decode("utf8")
    json_encoded = json.dumps("my_key":unicode_string)
    json.loads(json_encoded)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
        return _default_decoder.decode(s)
      File "/usr/lib/python2.6/json/decoder.py", line 319, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
      File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode
        obj, end = self._scanner.iterscan(s, **kw).next()
      File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
        rval, next_pos = action(m, context)
      File "/usr/lib/python2.6/json/decoder.py", line 183, in JSONObject
        value, end = iterscan(s, idx=end, context=context).next()
      File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
        rval, next_pos = action(m, context)
      File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString
        return scanstring(match.string, match.end(), encoding, strict)
    ValueError: Invalid \uXXXX escape: line 1 column 14 (char 14)

    @BrianMerrell BrianMerrell mannequin added stdlib Python modules in the Lib dir OS-windows topic-unicode type-bug An unexpected behavior, bug, or error labels Mar 13, 2011
    @abalkin
    Copy link
    Member

    abalkin commented Mar 14, 2011

    It appears this is an invalid unicode character.
    Shouldn't this be caught by decode("utf8")

    It should and it is in Python 3.x:

    >>> b'\xed\xa8\x80'.decode("utf8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

    Python 2.7 behavior seems to be a bug.

    >>> '\xed\xa8\x80'.decode("utf8")
    u'\uda00'

    Note also the following difference:

    In 3.x:

    >>> b'\xed\xa8\x80'.decode("utf8", 'replace')
    '��'

    In 2.7:

    >>> '\xed\xa8\x80'.decode("utf8", 'replace')
    u'\uda00'

    I am not sure this should be fixed in 2.x. Lone surrogates seem to round-trip just fine in 2.x and there likely to be existing code that relies on this.

    Shouldn't anything generated by json.dumps be parsed by json.loads?

    This on the other hand should probably be fixed by either rejecting lone surrogates in json.dumps or accepting them in json.loads or both. The last alternative would be consistent with the common wisdom of being conservative in what you produce but liberal in what you accept.

    @BrianMerrell
    Copy link
    Mannequin Author

    BrianMerrell mannequin commented Mar 14, 2011

    I am not sure this should be fixed in 2.x. Lone surrogates seem to >round-trip just fine in 2.x and there likely to be existing code that >relies on this.

    I generally agree but am then at a loss as to how to detect and deal with lone surrogates(eg "ignore", "replace", etc) in 2.x when interacting with services/libraries (such as Python's own json.loads) that take a stricter view.

    > Shouldn't anything generated by json.dumps be parsed by json.loads?

    This on the other hand should probably be fixed by either rejecting >lone surrogates in json.dumps or accepting them in json.loads or both. >The last alternative would be consistent with the common wisdom of >being conservative in what you produce but liberal in what you accept.

    We seem to be in the worst of both worlds right now as I've generated and stored a lot of json that can not be read back in. Could the JSON library simply leverage Python's Unicode interpreter instead of performing its own validation? We could pass it "ignore", "replace", etc. Regardless, I think we certainly need to remove the strict JSON loads() validation especially when it isn't enforced by dumps().

    @rhettinger
    Copy link
    Contributor

    We seem to be in the worst of both worlds right now
    as I've generated and stored a lot of json that can
    not be read back in

    This is unfortunate. The dumps() should have never worked in the first place.

    I don't think that loads() should be changed to accommodate the dumps() error though. JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load.

    To fix the data you've already created (one that other compliant JSON readers wouldn't be able to parse), I think you need to repreprocess those file to make them valid:

    bs.decode('utf-8', errors='ignore').encode('utf-8')

    Then we need to fix dumps so that it doesn't silently create invalid JSON.

    This on the other hand should probably be
    fixed by either rejecting lone surrogates
    in json.dumps or accepting them in json.loads or both.

    Rejection is the right way to go. For the most part,
    it is never helpful to create invalid JSON files that
    other readers can't and shouldn't read.

    @merrellb
    Copy link
    Mannequin

    merrellb mannequin commented Mar 14, 2011

    On Mon, Mar 14, 2011 at 4:09 PM, Raymond Hettinger
    <report@bugs.python.org>wrote:

    Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:

    > We seem to be in the worst of both worlds right now
    > as I've generated and stored a lot of json that can
    > not be read back in

    This is unfortunate. The dumps() should have never worked in the first
    place.

    I don't think that loads() should be changed to accommodate the dumps()
    error though. JSON is UTF-8 by definition and it is a useful feature that
    invalid UTF-8 won't load.

    I may be wrong but it appeared that json actually encoded the data as the
    string "u\da00" ie (6-bytes) which is slightly different than the encoding
    of the utf-8 encoding of the json itself. Not sure if this is relevant but
    it seems less severe than actually invalid utf-8 coding in the bytes.

    Unfortunately I don't believe this does anything on python 2.x as only
    python 3.x encode/decode flags this as invalid.

    ----------
    nosy: +rhettinger
    priority: normal -> high


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue11489\>


    @vstinner
    Copy link
    Member

    print(repr(json.loads(json.dumps({u"my_key": u'\uda00'}))['my_key'])):

    • displays u'\uda00' in Python 2.7, 3.2 and 3.3
    • raises a ValueError('Invalid \uXXXX escape: ...') on loads() in Python 2.6
    • raises a ValueError('Unpaired high surrogate: ...') on loads() in Python 3.1

    json version changed in Python 2.7: see the issue bpo-4136.

    See also this important change in simplejson:
    http://code.google.com/p/simplejson/source/detail?r=113

    We only fix security bugs in Python 2.6, not bugs. I don't think that this issue is a security bug in Python 2.6.

    We might change Python 3.1 behaviour.

    @ezio-melotti
    Copy link
    Member

    RFC 4627 doesn't say much about lone surrogates:
    A string is a sequence of zero or more Unicode characters [UNICODE].
    [...]

    All Unicode characters may be placed within the
    quotation marks except for the characters that must be escaped:
    quotation mark, reverse solidus, and the control characters (U+0000
    through U+001F).

    Any character may be escaped. If the character is in the Basic
    Multilingual Plane (U+0000 through U+FFFF), then it may be
    represented as a six-character sequence: a reverse solidus, followed
    by the lowercase letter u, followed by four hexadecimal digits that
    encode the character's code point. The hexadecimal letters A though
    F can be upper or lowercase. So, for example, a string containing
    only a single reverse solidus character may be represented as
    "\u005C".
    [...]

    To escape an extended character that is not in the Basic Multilingual
    Plane, the character is represented as a twelve-character sequence,
    encoding the UTF-16 surrogate pair. So, for example, a string
    containing only the G clef character (U+1D11E) may be represented as
    "\uD834\uDD1E".

    Raymond> JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load.

    Even if the input strings are not encodable in UTF-8 because they contain lone surrogates, they can still be converted to an \uXXXX escape, and the resulting JSON document will be valid UTF-8.
    AFAIK json always uses \uXXXX, so it doesn't produce invalid UTF-8 documents.

    While decoding, both json.loads('"\xed\xa0\x80"') and json.loads('"\ud800"') result in u'\ud800', but the first is not a valid UTF-8 document because it contains an invalid UTF-8 byte sequence that represent a lone surrogate, whereas the second one contains only ASCII bytes and it's therefore valid.
    Python 2.7 should probably reject '"\xed\xa0\x80"', but since its UTF-8 codec is somewhat permissive already, I'm not sure it makes much sense changing the behavior now. Python 3 doesn't have this problem because it works only with unicode strings, so you can't pass invalid UTF-8 byte sequences.

    OTOH the Unicode standard says that lone surrogates shouldn't be passed around, so we might decide to replace them with the replacement char U+FFFD, raise an error, or even provide a way to decide what should be done with them (something like the errors argument of codecs).

    @rhettinger rhettinger self-assigned this Oct 9, 2011
    @akheron
    Copy link
    Member

    akheron commented Aug 28, 2012

    Bear in mind that Douglas Crockford thinks a JSON document is valid even if it contains unpaired surrogates:

    http://tech.groups.yahoo.com/group/json/message/1603
    http://tech.groups.yahoo.com/group/json/message/1583
    

    It's Unicode that considers unpaired surrogates invalid, not UTF-8 by itself.

    @serhiy-storchaka
    Copy link
    Member

    It's Unicode that considers unpaired surrogates invalid, not UTF-8 by itself.

    It's UTF-8 too. See RFC 3629:

    The definition of UTF-8 prohibits encoding character numbers between
    U+D800 and U+DFFF, which are reserved for use with the UTF-16
    encoding form (as surrogate pairs) and do not directly represent
    characters. When encoding in UTF-8 from UTF-16 data, it is necessary
    to first decode the UTF-16 data to obtain character numbers, which
    are then encoded in UTF-8 as described above.

    @ezio-melotti
    Copy link
    Member

    Attached failing test.

    @serhiy-storchaka
    Copy link
    Member

    About patch. I think "with" is unnecessary here. One-line self.assertRaises(UnicodeEncodeError, self.dumps, ch) looks better for me.

    @serhiy-storchaka
    Copy link
    Member

    I forgot about this issue and open a new bpo-17906. There is a patch for it. Simplejson has accepted it in simplejson/simplejson#62.

    RFC 4627 does not make exceptions for the range 0xD800-0xDFFF (unescaped = %x20-21 / %x23-5B / %x5D-10FFFF), and the decoder must accept lone surrogates, both escaped and unescaped. Non-BMP characters may be represented as escaped surrogate pair, so escaped surrogate pair may be decoded as non-BMP character, while unescaped surrogate pair shouldn't.

    @serhiy-storchaka
    Copy link
    Member

    Here are updated patches from bpo-17906. Updated tests, fixed a bug reported by Bob Ippolito in msg188857 and fixed inconsistency noted by Ezio Melotti on Rietveld (Python implementation now raises same exception as C implementation on illegal hexadecimal escape).

    @serhiy-storchaka serhiy-storchaka added the extension-modules C modules in the Modules dir label May 12, 2013
    @rhettinger rhettinger removed their assignment May 21, 2013
    @tarasprokopenko
    Copy link
    Mannequin

    tarasprokopenko mannequin commented Oct 16, 2013

    You should use ensure_ascii=False option to json.dumps, ie

    import json
    unicode_bytes = '\xed\xa8\x80'
    unicode_string = unicode_bytes.decode("utf8")
    json_encoded = json.dumps(unicode_string, ensure_ascii=False)

    json.loads(json_encoded),unicode_string
    (u'\uda00', u'\uda00')
    cmp(json.loads(json_encoded),unicode_string)
    0

    @serhiy-storchaka
    Copy link
    Member

    I there are no objections I'll commit this patch soon.

    @serhiy-storchaka serhiy-storchaka self-assigned this Nov 20, 2013
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 26, 2013

    New changeset c85305a54e6d by Serhiy Storchaka in branch '2.7':
    Issue bpo-11489: JSON decoder now accepts lone surrogates.
    http://hg.python.org/cpython/rev/c85305a54e6d

    New changeset 8abbdbe86c01 by Serhiy Storchaka in branch '3.3':
    Issue bpo-11489: JSON decoder now accepts lone surrogates.
    http://hg.python.org/cpython/rev/8abbdbe86c01

    New changeset 5f7326ed850f by Serhiy Storchaka in branch 'default':
    Issue bpo-11489: JSON decoder now accepts lone surrogates.
    http://hg.python.org/cpython/rev/5f7326ed850f

    @Arfrever
    Copy link
    Mannequin

    Arfrever mannequin commented Dec 1, 2013

    New tests fail on 2.7 branch, at least with Python configured with --enable-unicode=ucs4 (which is default in Gentoo):

    ======================================================================
    FAIL: test_surrogates (json.tests.test_scanstring.TestCScanstring)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 107, in test_surrogates
        assertScan(u'"z\\ud834\udd20x12345"', u'z\ud834\udd20x12345')
      File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 97, in assertScan
        (expect, len(given)))
    AssertionError: Tuples differ: (u'z\ud834\udd20x12345', 16) != (u'z\U0001d120x12345', 16)

    First differing element 0:
    z\ud834\udd20x12345
    z\U0001d120x12345

    • (u'z\ud834\udd20x12345', 16)
      + (u'z\U0001d120x12345', 16)

    ======================================================================
    FAIL: test_surrogates (json.tests.test_scanstring.TestPyScanstring)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 107, in test_surrogates
        assertScan(u'"z\\ud834\udd20x12345"', u'z\ud834\udd20x12345')
      File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 97, in assertScan
        (expect, len(given)))
    AssertionError: Tuples differ: (u'z\ud834\udd20x12345', 16) != (u'z\U0001d120x12345', 16)

    First differing element 0:
    z\ud834\udd20x12345
    z\U0001d120x12345

    • (u'z\ud834\udd20x12345', 16)
      + (u'z\U0001d120x12345', 16)

    ----------------------------------------------------------------------

    @Arfrever Arfrever mannequin reopened this Dec 1, 2013
    @Arfrever
    Copy link
    Mannequin

    Arfrever mannequin commented Dec 1, 2013

    ... when code is loaded from .pyc files (i.e. when make test runs tests the second time).

    @serhiy-storchaka
    Copy link
    Member

    Thank you Arfrever. Does this patch fix the test?

    @Arfrever
    Copy link
    Mannequin

    Arfrever mannequin commented Dec 1, 2013

    test_json_surrogates.patch fixes these tests.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 1, 2013

    New changeset 02d186e3af09 by Serhiy Storchaka in branch '2.7':
    Fixed JSON tests on wide build when ran from *.pyc files (issue bpo-11489).
    http://hg.python.org/cpython/rev/02d186e3af09

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants