Add encoding aliases from the (HTML5) Encoding Standard #69602

zwol · 2015-10-15T18:13:08Z

BPO	25416
Nosy	@malemburg, @loewis, @ezio-melotti, @fbidu
PRs	bpo-25416: add aliases for cp874 and mac_cyrillic encodings #10237

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-10-15.18:13:07.708>
labels = ['type-feature', '3.8', 'expert-unicode']
title = 'Add encoding aliases from the (HTML5) Encoding Standard'
updated_at = <Date 2018-11-02.08:39:30.411>
user = 'https://bugs.python.org/zwol'

bugs.python.org fields:

activity = <Date 2018-11-02.08:39:30.411>
actor = 'lemburg'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2015-10-15.18:13:07.708>
creator = 'zwol'
dependencies = []
files = []
hgrepos = []
issue_num = 25416
keywords = ['patch']
message_count = 4.0
messages = ['253061', '328990', '329093', '329115']
nosy_count = 5.0
nosy_names = ['lemburg', 'loewis', 'ezio.melotti', 'zwol', 'fbidu']
pr_nums = ['10237']
priority = 'normal'
resolution = None
stage = 'test needed'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue25416'
versions = ['Python 3.8']

zwol · 2015-10-15T18:13:07Z

The codecs registry (as of 3.4) is unaware of two of the canonical encoding names from <https://encoding.spec.whatwg.org/#names-and-labels\>: "windows-874" and "x-mac-cyrillic". For interoperability's sake, please make these aliases for "cp874" and "mac_cyrillic" respectively.

(For full interop, *every* name and label in that list should be understood by str.encode(), but the canonical names are most useful. Lack of support for iso-8859-i is already reported as https://bugs.python.org/issue18624 . I have not tested the full set of non-canonical labels.)

ezio-melotti · 2018-10-31T12:25:17Z

Adding those aliases sounds good to me. I think it would be good to add some tests first (possibly as a separate issue/pr), even though I'm not sure what would be the best way to test the aliases.

Testing if the list is complete/correct should be done against the HTML5/Unicode specs, but that, if automated, would require downloading/parsing the specs and is probably not worth doing it.

We can also check that all the aliases are accepted by str.encode/decode, and all corresponding aliases should give the same result, however if str.encode/decode use the aliases dict, the test is nothing more than a sanity check and won't detect e.g. typos in the aliases names, or wrongly assigned aliases.

fbidu · 2018-11-02T01:17:53Z

Ezio, I have issued a simple PR that adds just the two aliases cited in the issue's initial message. I would like to implement tests but as I wrote in the PR's message, I'm not really sure how to proceed with that. bpo-18624 is really related to this issue and in there is a reference to a test_codecs.py file that I did not find.

If you could give me a few pointer on how to proceed, I'll be glad to improve my PR, add tests and even add all the other aliases that are missing.

malemburg · 2018-11-02T08:39:30Z

Please note that we can only add aliases if the encodings are indeed the same. Given that WhatWG has made changes to several standard encodings, this is especially important, since our codecs are mostly based on what the Unicode consortium defines as these encodings.

Tests for aliases can be minimal: just verify that the codecs subsystem detects them and results in the correct codec being used. There's no need to download any WhatWG specs for this.

Mr0grog · 2023-10-17T21:28:44Z

Please note that we can only add aliases if the encodings are indeed the same. Given that WhatWG has made changes to several standard encodings, this is especially important, since our codecs are mostly based on what the Unicode consortium defines as these encodings.

I know this is old, but I ran across it today as part of a discussion about encoding detection in a wrapper for Firefox’s chardetng library (which detects based on WHATWG encoding definitions). I wrote a little script to compare Unicode consortium and WHATWG definitions for that. Feel free to use or fork in case it’s useful here: https://gist.github.com/Mr0grog/70ec66c2ed0e7ee9a5d50406534dad46

Anyway, it turns out that even some encodings that already use the same names between Python and WHATWG differ in their definition (e.g. KOI8-U). My naive first reaction to that is to say that Python should just map all the relevant names — they are meant to be conceptually the same thing, and you can’t get around the fact that different systems may interpret the same encodings differently, and are already doing so. Not handling some aliases probably isn’t actually saving anyone from this problem (and there’s really no “right” answer, since it is hard to know whether a given piece of KOI8-U (or windows-874, or whatever) content was created with an encoder that was spec-compliant or non-compliant or buggy or whatever).

zwol mannequin added the type-feature A feature request or enhancement label Oct 15, 2015

serhiy-storchaka added the topic-unicode label Oct 15, 2015

ezio-melotti added the 3.8 only security fixes label Oct 31, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add encoding aliases from the (HTML5) Encoding Standard #69602

Add encoding aliases from the (HTML5) Encoding Standard #69602

zwol mannequin commented Oct 15, 2015

zwol mannequin commented Oct 15, 2015

ezio-melotti commented Oct 31, 2018

fbidu mannequin commented Nov 2, 2018

malemburg commented Nov 2, 2018

Mr0grog commented Oct 17, 2023 •

edited

Add encoding aliases from the (HTML5) Encoding Standard #69602

Add encoding aliases from the (HTML5) Encoding Standard #69602

Comments

zwol mannequin commented Oct 15, 2015

zwol mannequin commented Oct 15, 2015

ezio-melotti commented Oct 31, 2018

fbidu mannequin commented Nov 2, 2018

malemburg commented Nov 2, 2018

Mr0grog commented Oct 17, 2023 • edited

Mr0grog commented Oct 17, 2023 •

edited