Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add encoding aliases from the (HTML5) Encoding Standard #69602

Open
zwol mannequin opened this issue Oct 15, 2015 · 5 comments
Open

Add encoding aliases from the (HTML5) Encoding Standard #69602

zwol mannequin opened this issue Oct 15, 2015 · 5 comments
Labels
3.8 only security fixes topic-unicode type-feature A feature request or enhancement

Comments

@zwol
Copy link
Mannequin

zwol mannequin commented Oct 15, 2015

BPO 25416
Nosy @malemburg, @loewis, @ezio-melotti, @fbidu
PRs
  • bpo-25416: add aliases for cp874 and mac_cyrillic encodings #10237
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2015-10-15.18:13:07.708>
    labels = ['type-feature', '3.8', 'expert-unicode']
    title = 'Add encoding aliases from the (HTML5) Encoding Standard'
    updated_at = <Date 2018-11-02.08:39:30.411>
    user = 'https://bugs.python.org/zwol'

    bugs.python.org fields:

    activity = <Date 2018-11-02.08:39:30.411>
    actor = 'lemburg'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Unicode']
    creation = <Date 2015-10-15.18:13:07.708>
    creator = 'zwol'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 25416
    keywords = ['patch']
    message_count = 4.0
    messages = ['253061', '328990', '329093', '329115']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'loewis', 'ezio.melotti', 'zwol', 'fbidu']
    pr_nums = ['10237']
    priority = 'normal'
    resolution = None
    stage = 'test needed'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue25416'
    versions = ['Python 3.8']

    @zwol
    Copy link
    Mannequin Author

    zwol mannequin commented Oct 15, 2015

    The codecs registry (as of 3.4) is unaware of two of the canonical encoding names from <https://encoding.spec.whatwg.org/#names-and-labels\>: "windows-874" and "x-mac-cyrillic". For interoperability's sake, please make these aliases for "cp874" and "mac_cyrillic" respectively.

    (For full interop, *every* name and label in that list should be understood by str.encode(), but the canonical names are most useful. Lack of support for iso-8859-i is already reported as https://bugs.python.org/issue18624 . I have not tested the full set of non-canonical labels.)

    @zwol zwol mannequin added the type-feature A feature request or enhancement label Oct 15, 2015
    @ezio-melotti
    Copy link
    Member

    Adding those aliases sounds good to me. I think it would be good to add some tests first (possibly as a separate issue/pr), even though I'm not sure what would be the best way to test the aliases.

    Testing if the list is complete/correct should be done against the HTML5/Unicode specs, but that, if automated, would require downloading/parsing the specs and is probably not worth doing it.

    We can also check that all the aliases are accepted by str.encode/decode, and all corresponding aliases should give the same result, however if str.encode/decode use the aliases dict, the test is nothing more than a sanity check and won't detect e.g. typos in the aliases names, or wrongly assigned aliases.

    @ezio-melotti ezio-melotti added the 3.8 only security fixes label Oct 31, 2018
    @fbidu
    Copy link
    Mannequin

    fbidu mannequin commented Nov 2, 2018

    Ezio, I have issued a simple PR that adds just the two aliases cited in the issue's initial message. I would like to implement tests but as I wrote in the PR's message, I'm not really sure how to proceed with that. bpo-18624 is really related to this issue and in there is a reference to a test_codecs.py file that I did not find.

    If you could give me a few pointer on how to proceed, I'll be glad to improve my PR, add tests and even add all the other aliases that are missing.

    @malemburg
    Copy link
    Member

    Please note that we can only add aliases if the encodings are indeed the same. Given that WhatWG has made changes to several standard encodings, this is especially important, since our codecs are mostly based on what the Unicode consortium defines as these encodings.

    Tests for aliases can be minimal: just verify that the codecs subsystem detects them and results in the correct codec being used. There's no need to download any WhatWG specs for this.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @Mr0grog
    Copy link

    Mr0grog commented Oct 17, 2023

    Please note that we can only add aliases if the encodings are indeed the same. Given that WhatWG has made changes to several standard encodings, this is especially important, since our codecs are mostly based on what the Unicode consortium defines as these encodings.

    I know this is old, but I ran across it today as part of a discussion about encoding detection in a wrapper for Firefox’s chardetng library (which detects based on WHATWG encoding definitions). I wrote a little script to compare Unicode consortium and WHATWG definitions for that. Feel free to use or fork in case it’s useful here: https://gist.github.com/Mr0grog/70ec66c2ed0e7ee9a5d50406534dad46

    Anyway, it turns out that even some encodings that already use the same names between Python and WHATWG differ in their definition (e.g. KOI8-U). My naive first reaction to that is to say that Python should just map all the relevant names — they are meant to be conceptually the same thing, and you can’t get around the fact that different systems may interpret the same encodings differently, and are already doing so. Not handling some aliases probably isn’t actually saving anyone from this problem (and there’s really no “right” answer, since it is hard to know whether a given piece of KOI8-U (or windows-874, or whatever) content was created with an encoder that was spec-compliant or non-compliant or buggy or whatever).

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes topic-unicode type-feature A feature request or enhancement
    Projects
    Development

    No branches or pull requests

    4 participants