New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
json.dumps not parsable by json.loads (on Linux only) #55698
Comments
The following works on Win7x64 Python 2.6.5 and breaks on Ubuntu 10.04x64-2.6.5. This raises three issues:
import json
unicode_bytes = '\xed\xa8\x80'
unicode_string = unicode_bytes.decode("utf8")
json_encoded = json.dumps("my_key":unicode_string)
json.loads(json_encoded) Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.6/json/decoder.py", line 319, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode
obj, end = self._scanner.iterscan(s, **kw).next()
File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib/python2.6/json/decoder.py", line 183, in JSONObject
value, end = iterscan(s, idx=end, context=context).next()
File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString
return scanstring(match.string, match.end(), encoding, strict)
ValueError: Invalid \uXXXX escape: line 1 column 14 (char 14) |
It should and it is in Python 3.x: >>> b'\xed\xa8\x80'.decode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte Python 2.7 behavior seems to be a bug. >>> '\xed\xa8\x80'.decode("utf8")
u'\uda00' Note also the following difference: In 3.x: >>> b'\xed\xa8\x80'.decode("utf8", 'replace')
'��' In 2.7: >>> '\xed\xa8\x80'.decode("utf8", 'replace')
u'\uda00' I am not sure this should be fixed in 2.x. Lone surrogates seem to round-trip just fine in 2.x and there likely to be existing code that relies on this.
This on the other hand should probably be fixed by either rejecting lone surrogates in json.dumps or accepting them in json.loads or both. The last alternative would be consistent with the common wisdom of being conservative in what you produce but liberal in what you accept. |
I generally agree but am then at a loss as to how to detect and deal with lone surrogates(eg "ignore", "replace", etc) in 2.x when interacting with services/libraries (such as Python's own json.loads) that take a stricter view.
We seem to be in the worst of both worlds right now as I've generated and stored a lot of json that can not be read back in. Could the JSON library simply leverage Python's Unicode interpreter instead of performing its own validation? We could pass it "ignore", "replace", etc. Regardless, I think we certainly need to remove the strict JSON loads() validation especially when it isn't enforced by dumps(). |
This is unfortunate. The dumps() should have never worked in the first place. I don't think that loads() should be changed to accommodate the dumps() error though. JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load. To fix the data you've already created (one that other compliant JSON readers wouldn't be able to parse), I think you need to repreprocess those file to make them valid: bs.decode('utf-8', errors='ignore').encode('utf-8') Then we need to fix dumps so that it doesn't silently create invalid JSON.
Rejection is the right way to go. For the most part, |
On Mon, Mar 14, 2011 at 4:09 PM, Raymond Hettinger
I may be wrong but it appeared that json actually encoded the data as the Unfortunately I don't believe this does anything on python 2.x as only
|
print(repr(json.loads(json.dumps({u"my_key": u'\uda00'}))['my_key'])):
json version changed in Python 2.7: see the issue bpo-4136. See also this important change in simplejson: We only fix security bugs in Python 2.6, not bugs. I don't think that this issue is a security bug in Python 2.6. We might change Python 3.1 behaviour. |
RFC 4627 doesn't say much about lone surrogates: All Unicode characters may be placed within the Any character may be escaped. If the character is in the Basic To escape an extended character that is not in the Basic Multilingual Raymond> JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load. Even if the input strings are not encodable in UTF-8 because they contain lone surrogates, they can still be converted to an \uXXXX escape, and the resulting JSON document will be valid UTF-8. While decoding, both json.loads('"\xed\xa0\x80"') and json.loads('"\ud800"') result in u'\ud800', but the first is not a valid UTF-8 document because it contains an invalid UTF-8 byte sequence that represent a lone surrogate, whereas the second one contains only ASCII bytes and it's therefore valid. OTOH the Unicode standard says that lone surrogates shouldn't be passed around, so we might decide to replace them with the replacement char U+FFFD, raise an error, or even provide a way to decide what should be done with them (something like the errors argument of codecs). |
Bear in mind that Douglas Crockford thinks a JSON document is valid even if it contains unpaired surrogates:
It's Unicode that considers unpaired surrogates invalid, not UTF-8 by itself. |
It's UTF-8 too. See RFC 3629: The definition of UTF-8 prohibits encoding character numbers between |
Attached failing test. |
About patch. I think "with" is unnecessary here. One-line self.assertRaises(UnicodeEncodeError, self.dumps, ch) looks better for me. |
I forgot about this issue and open a new bpo-17906. There is a patch for it. Simplejson has accepted it in simplejson/simplejson#62. RFC 4627 does not make exceptions for the range 0xD800-0xDFFF (unescaped = %x20-21 / %x23-5B / %x5D-10FFFF), and the decoder must accept lone surrogates, both escaped and unescaped. Non-BMP characters may be represented as escaped surrogate pair, so escaped surrogate pair may be decoded as non-BMP character, while unescaped surrogate pair shouldn't. |
You should use ensure_ascii=False option to json.dumps, ie import json
unicode_bytes = '\xed\xa8\x80'
unicode_string = unicode_bytes.decode("utf8")
json_encoded = json.dumps(unicode_string, ensure_ascii=False) json.loads(json_encoded),unicode_string |
I there are no objections I'll commit this patch soon. |
New changeset c85305a54e6d by Serhiy Storchaka in branch '2.7': New changeset 8abbdbe86c01 by Serhiy Storchaka in branch '3.3': New changeset 5f7326ed850f by Serhiy Storchaka in branch 'default': |
New tests fail on 2.7 branch, at least with Python configured with --enable-unicode=ucs4 (which is default in Gentoo): ====================================================================== Traceback (most recent call last):
File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 107, in test_surrogates
assertScan(u'"z\\ud834\udd20x12345"', u'z\ud834\udd20x12345')
File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 97, in assertScan
(expect, len(given)))
AssertionError: Tuples differ: (u'z\ud834\udd20x12345', 16) != (u'z\U0001d120x12345', 16) First differing element 0:
====================================================================== Traceback (most recent call last):
File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 107, in test_surrogates
assertScan(u'"z\\ud834\udd20x12345"', u'z\ud834\udd20x12345')
File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 97, in assertScan
(expect, len(given)))
AssertionError: Tuples differ: (u'z\ud834\udd20x12345', 16) != (u'z\U0001d120x12345', 16) First differing element 0:
---------------------------------------------------------------------- |
... when code is loaded from .pyc files (i.e. when |
Thank you Arfrever. Does this patch fix the test? |
test_json_surrogates.patch fixes these tests. |
New changeset 02d186e3af09 by Serhiy Storchaka in branch '2.7': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: