New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add encoding aliases from the (HTML5) Encoding Standard #69602
Comments
The codecs registry (as of 3.4) is unaware of two of the canonical encoding names from <https://encoding.spec.whatwg.org/#names-and-labels\>: "windows-874" and "x-mac-cyrillic". For interoperability's sake, please make these aliases for "cp874" and "mac_cyrillic" respectively. (For full interop, *every* name and label in that list should be understood by str.encode(), but the canonical names are most useful. Lack of support for iso-8859-i is already reported as https://bugs.python.org/issue18624 . I have not tested the full set of non-canonical labels.) |
Adding those aliases sounds good to me. I think it would be good to add some tests first (possibly as a separate issue/pr), even though I'm not sure what would be the best way to test the aliases. Testing if the list is complete/correct should be done against the HTML5/Unicode specs, but that, if automated, would require downloading/parsing the specs and is probably not worth doing it. We can also check that all the aliases are accepted by str.encode/decode, and all corresponding aliases should give the same result, however if str.encode/decode use the aliases dict, the test is nothing more than a sanity check and won't detect e.g. typos in the aliases names, or wrongly assigned aliases. |
Ezio, I have issued a simple PR that adds just the two aliases cited in the issue's initial message. I would like to implement tests but as I wrote in the PR's message, I'm not really sure how to proceed with that. bpo-18624 is really related to this issue and in there is a reference to a test_codecs.py file that I did not find. If you could give me a few pointer on how to proceed, I'll be glad to improve my PR, add tests and even add all the other aliases that are missing. |
Please note that we can only add aliases if the encodings are indeed the same. Given that WhatWG has made changes to several standard encodings, this is especially important, since our codecs are mostly based on what the Unicode consortium defines as these encodings. Tests for aliases can be minimal: just verify that the codecs subsystem detects them and results in the correct codec being used. There's no need to download any WhatWG specs for this. |
I know this is old, but I ran across it today as part of a discussion about encoding detection in a wrapper for Firefox’s chardetng library (which detects based on WHATWG encoding definitions). I wrote a little script to compare Unicode consortium and WHATWG definitions for that. Feel free to use or fork in case it’s useful here: https://gist.github.com/Mr0grog/70ec66c2ed0e7ee9a5d50406534dad46 Anyway, it turns out that even some encodings that already use the same names between Python and WHATWG differ in their definition (e.g. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: