-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autodetecting JSON encoding #62109
Comments
RFC 4627 specifies a method to determine an encoding (one of UTF-8, UTF-16(BE|LE) or UTF-32(BE|LE)) of encoded JSON text. The proposed preliminary patch (it doesn't include the documentation yet) allows load() and loads() functions accept bytes data when it is encoded with standard Unicode encoding. Also accepted data with BOM (this doesn't specified in RFC 4627, but is widely used). There is only one case where the method can give a misfire. Serialized string "\x00..." encoded in UTF-16LE may be erroneously detected as encoded in UTF-32LE. This case violates the two rules of RFC 4627: the string was serialized instead of a an object or an array, and the control character U+0000 was not escaped. The standard encoded JSON always detected correctly. This patch requires "surrogatepass" error handler for utf-16/32 (see bpo-12892 and bpo-13916). |
All dependencies for this issue are resolved now. Here is updated patch, synchronized with tip. |
You'll need to also update the "Character Encodings" subsection of the json docs. |
Both json standard (ECMA-404) [1] and the new json rfc 7159 [2] do not mention [1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf From the rfc:
Implementations MUST NOT add a byte order mark to the beginning of a |
I agree that the state of encoding detection in the new RFC seems unclear, given that the old RFC prefaced the part about the encoding detection with:
But in the new RFC:
Thus,
There seems to have been a thread about encoding detection in the RFC 7159 working group, but I don't have the time to read through it all:
It eventually leads to a dedicated sub-thread:
|
If you adjusted the detect_encoding() logic according to Pete Cordell’s table at the bottom of <http://www.ietf.org/mail-archive/web/json/current/msg01959.html\>, it might work for standalone strings. However since the RFC encourages UTF-8 for best interoperability, I wonder if any of this autodetection is necessary. It might be simpler to just assume UTF-8, or use the “utf-8-sig” codec. Or are there real cases where detecting UTF-16 or -32 would be useful? |
Hi Serhiy, I have reviewed your patch, it seems to be ok. |
Having hit the json.loads() problem recently when porting a project to Python 3, I'm keen to see this land for 3.6. Accodingly, assigning to myself to review and merge Serhiy's patch - if it proves necessary, we can tweak the details of the encoding detection during beta. |
New changeset e9e1bf9ec2ac by Nick Coghlan in branch 'default': |
Thanks for tackling this Serhiy! I removed bpo-13916 from the dependency list, as while that's a reasonable suggestion, I don't think this fix is conditional on that change. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: