Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encodings and aliases do not match runtime #42237

Open
liturgist mannequin opened this issue Aug 1, 2005 · 11 comments
Open

Encodings and aliases do not match runtime #42237

liturgist mannequin opened this issue Aug 1, 2005 · 11 comments
Labels
docs Documentation in the Doc dir type-feature A feature request or enhancement

Comments

@liturgist
Copy link
Mannequin

liturgist mannequin commented Aug 1, 2005

BPO 1249749
Nosy @malemburg, @loewis
Files
  • encodingaliases.py: encodingaliases.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2005-08-01.18:23:30.000>
    labels = ['type-feature', 'docs']
    title = 'Encodings and aliases do not match runtime'
    updated_at = <Date 2020-09-19.19:04:08.068>
    user = 'https://bugs.python.org/liturgist'

    bugs.python.org fields:

    activity = <Date 2020-09-19.19:04:08.068>
    actor = 'georg.brandl'
    assignee = 'docs@python'
    closed = False
    closed_date = None
    closer = None
    components = ['Documentation']
    creation = <Date 2005-08-01.18:23:30.000>
    creator = 'liturgist'
    dependencies = []
    files = ['1749']
    hgrepos = []
    issue_num = 1249749
    keywords = []
    message_count = 11.0
    messages = ['25927', '25928', '25929', '25930', '25931', '25932', '25933', '25934', '25935', '25936', '25937']
    nosy_count = 4.0
    nosy_names = ['lemburg', 'loewis', 'liturgist', 'docs@python']
    pr_nums = []
    priority = 'low'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1249749'
    versions = ['Python 3.2']

    @liturgist
    Copy link
    Mannequin Author

    liturgist mannequin commented Aug 1, 2005

    2.4.1 documentation has a list of standard encodings in
    4.9.2. However, this list does not seem to match what
    is returned by the runtime. Below is code to dump out
    the encodings and aliases. Please tell me if anything
    is incorrect.

    In some cases, there are many more valid aliases than
    listed in the documentation. See 'cp037' as an example.

    I see that the identifiers are intended to be case
    insensitive. I would prefer to see the documentation
    provide the identifiers as they will appear in
    encodings.aliases.aliases. The only alias containing
    any upper case letters appears to be 'hp_roman8'.

    $ cat encodingaliases.py
    #!/usr/bin/env python
    import sys
    import encodings
    def main():
        enchash = {}
    
        for enc in encodings.aliases.aliases.values():
            enchash[enc] = []
        for encalias in encodings.aliases.aliases.keys():
           
    enchash[encodings.aliases.aliases[encalias]].append(encalias)
    
        elist = enchash.keys()
        elist.sort()
        for enc in elist:
            print enc, enchash[enc]
    
    if __name__ == '__main__':
        main()
        sys.exit(0)
    13:12 pwatson [
    ruth.knightsbridge.com:/home/pwatson/src/python ] 366
    $ ./encodingaliases.py
    ascii ['iso_ir_6', 'ansi_x3_4_1968', 'ibm367',
    'iso646_us', 'us', 'cp367', '646', 'us_ascii',
    'csascii', 'ansi_x3.4_1986', 'iso_646.irv_1991',
    'ansi_x3.4_1968']
    base64_codec ['base_64', 'base64']
    big5 ['csbig5', 'big5_tw']
    big5hkscs ['hkscs', 'big5_hkscs']
    bz2_codec ['bz2']
    cp037 ['ebcdic_cp_wt', 'ebcdic_cp_us', 'ebcdic_cp_nl',
    '037', 'ibm039', 'ibm037', 'csibm037', 'ebcdic_cp_ca']
    cp1026 ['csibm1026', 'ibm1026', '1026']
    cp1140 ['1140', 'ibm1140']
    cp1250 ['1250', 'windows_1250']
    cp1251 ['1251', 'windows_1251']
    cp1252 ['windows_1252', '1252']
    cp1253 ['1253', 'windows_1253']
    cp1254 ['1254', 'windows_1254']
    cp1255 ['1255', 'windows_1255']
    cp1256 ['1256', 'windows_1256']
    cp1257 ['1257', 'windows_1257']
    cp1258 ['1258', 'windows_1258']
    cp424 ['ebcdic_cp_he', 'ibm424', '424', 'csibm424']
    cp437 ['ibm437', '437', 'cspc8codepage437']
    cp500 ['csibm500', 'ibm500', '500', 'ebcdic_cp_ch',
    'ebcdic_cp_be']
    cp775 ['cspc775baltic', '775', 'ibm775']
    cp850 ['ibm850', 'cspc850multilingual', '850']
    cp852 ['ibm852', '852', 'cspcp852']
    cp855 ['csibm855', 'ibm855', '855']
    cp857 ['csibm857', 'ibm857', '857']
    cp860 ['csibm860', 'ibm860', '860']
    cp861 ['csibm861', 'cp_is', 'ibm861', '861']
    cp862 ['cspc862latinhebrew', 'ibm862', '862']
    cp863 ['csibm863', 'ibm863', '863']
    cp864 ['csibm864', 'ibm864', '864']
    cp865 ['csibm865', 'ibm865', '865']
    cp866 ['csibm866', 'ibm866', '866']
    cp869 ['csibm869', 'ibm869', '869', 'cp_gr']
    cp932 ['mskanji', '932', 'ms932', 'ms_kanji']
    cp949 ['uhc', 'ms949', '949']
    cp950 ['ms950', '950']
    euc_jis_2004 ['eucjis2004', 'jisx0213', 'euc_jis2004']
    euc_jisx0213 ['eucjisx0213']
    euc_jp ['eucjp', 'ujis', 'u_jis']
    euc_kr ['ksc5601', 'korean', 'euckr', 'ksx1001',
    'ks_c_5601', 'ks_c_5601_1987', 'ks_x_1001']
    gb18030 ['gb18030_2000']
    gb2312 ['chinese', 'euc_cn', 'csiso58gb231280',
    'iso_ir_58', 'euccn', 'eucgb2312_cn', 'gb2312_1980',
    'gb2312_80']
    gbk ['cp936', 'ms936', '936']
    hex_codec ['hex']
    hp_roman8 ['csHPRoman8', 'r8', 'roman8']
    hz ['hzgb', 'hz_gb_2312', 'hz_gb']
    iso2022_jp ['iso2022jp', 'iso_2022_jp', 'csiso2022jp']
    iso2022_jp_1 ['iso_2022_jp_1', 'iso2022jp_1']
    iso2022_jp_2 ['iso_2022_jp_2', 'iso2022jp_2']
    iso2022_jp_2004 ['iso_2022_jp_2004', 'iso2022jp_2004']
    iso2022_jp_3 ['iso_2022_jp_3', 'iso2022jp_3']
    iso2022_jp_ext ['iso2022jp_ext', 'iso_2022_jp_ext']
    iso2022_kr ['iso_2022_kr', 'iso2022kr', 'csiso2022kr']
    iso8859_10 ['csisolatin6', 'l6', 'iso_8859_10_1992',
    'iso_ir_157', 'iso_8859_10', 'latin6']
    iso8859_11 ['iso_8859_11', 'thai', 'iso_8859_11_2001']
    iso8859_13 ['iso_8859_13']
    iso8859_14 ['iso_celtic', 'iso_ir_199', 'l8',
    'iso_8859_14_1998', 'iso_8859_14', 'latin8']
    iso8859_15 ['iso_8859_15']
    iso8859_16 ['iso_8859_16_2001', 'l10', 'iso_ir_226',
    'latin10', 'iso_8859_16']
    iso8859_2 ['l2', 'csisolatin2', 'iso_ir_101',
    'iso_8859_2', 'iso_8859_2_1987', 'latin2']
    iso8859_3 ['iso_8859_3_1988', 'l3', 'iso_ir_109',
    'csisolatin3', 'iso_8859_3', 'latin3']
    iso8859_4 ['csisolatin4', 'l4', 'iso_ir_110',
    'iso_8859_4', 'iso_8859_4_1988', 'latin4']
    iso8859_5 ['iso_8859_5_1988', 'iso_8859_5', 'cyrillic',
    'csisolatincyrillic', 'iso_ir_144']
    iso8859_6 ['iso_8859_6_1987', 'iso_ir_127',
    'csisolatinarabic', 'asmo_708', 'iso_8859_6',
    'ecma_114', 'arabic']
    iso8859_7 ['ecma_118', 'greek8', 'iso_8859_7',
    'iso_ir_126', 'elot_928', 'iso_8859_7_1987',
    'csisolatingreek', 'greek']
    iso8859_8 ['iso_8859_8_1988', 'iso_ir_138',
    'iso_8859_8', 'csisolatinhebrew', 'hebrew']
    iso8859_9 ['l5', 'iso_8859_9_1989', 'iso_8859_9',
    'csisolatin5', 'latin5', 'iso_ir_148']
    johab ['cp1361', 'ms1361']
    koi8_r ['cskoi8r']
    latin_1 ['iso8859', 'csisolatin1', 'latin', 'l1',
    'iso_ir_100', 'ibm819', 'cp819', 'iso_8859_1',
    'latin1', 'iso_8859_1_1987', '8859']
    mac_cyrillic ['maccyrillic']
    mac_greek ['macgreek']
    mac_iceland ['maciceland']
    mac_latin2 ['maccentraleurope', 'maclatin2']
    mac_roman ['macroman']
    mac_turkish ['macturkish']
    mbcs ['dbcs']
    ptcp154 ['cp154', 'cyrillic-asian', 'csptcp154', 'pt154']
    quopri_codec ['quopri', 'quoted_printable',
    'quotedprintable']
    rot_13 ['rot13']
    shift_jis ['s_jis', 'sjis', 'shiftjis', 'csshiftjis']
    shift_jis_2004 ['shiftjis2004', 's_jis_2004', 'sjis_2004']
    shift_jisx0213 ['shiftjisx0213', 'sjisx0213', 's_jisx0213']
    tactis ['tis260']
    tis_620 ['tis620', 'tis_620_2529_1', 'tis_620_2529_0',
    'iso_ir_166', 'tis_620_0']
    utf_16 ['utf16', 'u16']
    utf_16_be ['utf_16be', 'unicodebigunmarked']
    utf_16_le ['utf_16le', 'unicodelittleunmarked']
    utf_7 ['u7', 'utf7']
    utf_8 ['u8', 'utf', 'utf8_ucs4', 'utf8_ucs2', 'utf8']
    uu_codec ['uu']
    zlib_codec ['zlib', 'zip']

    @liturgist liturgist mannequin added docs Documentation in the Doc dir labels Aug 1, 2005
    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Doc patches are welcome - perhaps you could enhance your
    script to have the doc table generated from the available
    codecs and aliases ?!

    Thanks.

    @liturgist
    Copy link
    Mannequin Author

    liturgist mannequin commented Aug 5, 2005

    Logged In: YES
    user_id=197677

    I would very much like to produce the doc table from code.
    However, I have a few questions.

    It seems that encodings.aliases.aliases is a list of all
    encodings and not necessarily those supported on all
    machines. Ie. mbcs on UNIX or embedded systems that might
    exclude some large character sets to save space. Is this
    correct? If so, will it remain that way?

    To find out if an encoding is supported on the current
    machine, the code should handle the exception generated when
    codecs.lookup() fails. Right?

    To generate the table, I need to produce the "Languages"
    field. This information does not seem to be available from
    the Python runtime. I would much rather see this
    information, including a localized version of the string,
    come from the Python runtime, rather than hardcode it into
    the script. Is that a possibility? Would it be a better
    approach?

    The non-language oriented encodings such as base_64 and
    rot_13 do not seem to have anything that distinguishes them
    from human languages. How can these be separated out
    without hardcoding?

    Likewise, the non-language encodings have an "Operand type"
    field which would need to be generated. My feeling is,
    again, that this should come from the Python runtime and not
    be hardcoded into the doc generation script. Any suggestions?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 6, 2005

    Logged In: YES
    user_id=21627

    I would not like to see the documentation contain a complete
    list of all aliases. The documentation points out that this
    are "a few common aliases", ie. I selected aliases that
    people are likely to encounter, and are encouraged to use.

    I don't think it is useful to produce the table from the
    code. If you want to know everything in aliases, just look
    at aliases directly.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Martin, I don't see any problem with putting the complete
    list of aliases into the documentation.

    liturgist, don't worry about hard-coding things into the
    script. The extra information Martin gave in the table is
    not likely going to become part of the standard lib, because
    there's no a lot you can do with it programmatically.

    @liturgist
    Copy link
    Mannequin Author

    liturgist mannequin commented Aug 10, 2005

    Logged In: YES
    user_id=197677

    The script attached generates two HTML tables in files
    specified on the command line.

    usage:  encodingaliases.py
    

    <language-oriented-codecs-html-file>
    <non-language-oriented-codecs-html-file>

    A static list of codecs in this script is used because the
    language description is not available in the python runtime.
    Codecs found in the encodings.aliases.aliases list are
    added to the list, but will be described as "unknown" encodings.

    The "bijectiveType" was, like the language descriptions,
    taken from the current (2.4.1) documentation.

    It would be much better for the descriptions and "bijective"
    type settings to come from the runtime. The problem is one
    of maintenance. Without these available for introspection
    in the runtime, a new encoding with no alias will never be
    identified. When it does appear with an alias, it can only
    be described as "unknown."

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 10, 2005

    Logged In: YES
    user_id=21627

    I do see a problem with generating these tables
    automatically. It suggests the reader that the aliases are
    all equally relevant. However, I bet few people have ever
    heard of or used, say, 'cspc850multilingual'.

    As for the actual patch: Please don't generate HTML.
    Instead, TeX should be generated, as this is the primary
    source. Also please add a patch to the current TeX file,
    updating it appropriately.

    @liturgist
    Copy link
    Mannequin Author

    liturgist mannequin commented Aug 11, 2005

    Logged In: YES
    user_id=197677

    For example: there appears to be a codec for iso8859-1, but
    it has no alias in the encodings.aliases.aliases list and it
    is not in the current documentation.

    What is the relationship of iso8859_1 to latin_1? Should
    iso8859_1 be considered a base codec? When should iso8859_1
    be used rather than latin_1?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 11, 2005

    Logged In: YES
    user_id=21627

    I think the presence of iso8859_1.py is a bug, resulting
    from automatic generation of these files. The file should be
    deleted; iso8859-1 should be encoded through the alias to
    latin-1. Thanks for pointing that out.

    @liturgist
    Copy link
    Mannequin Author

    liturgist mannequin commented Aug 11, 2005

    Logged In: YES
    user_id=197677

    If it does not present a problem, making latin_1 and alias
    for iso8859_1 as the base codec would present the ISO
    standards as a complete, orthogonal set. The alias would
    mean that no existing code is broken. Right?

    Would this approach present any problem? Should this
    become a separate bug entry?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 11, 2005

    Logged In: YES
    user_id=21627

    It does present a problem: the latin-1 codec is faster than
    the iso8859-1 codec, as it is a special case in C (employin
    the fact that Latin-1 and Unicode share the first 256 code
    points). So I think the iso8859-1 should be dropped. But, as
    you guess, this is an issue independent of the documentation
    issue at hand, and should be reported (and resolved) separately.

    @devdanzin devdanzin mannequin added type-feature A feature request or enhancement labels Feb 16, 2009
    @BreamoreBoy BreamoreBoy mannequin assigned docspython and unassigned birkenfeld Aug 21, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir type-feature A feature request or enhancement
    Projects
    Development

    No branches or pull requests

    2 participants