Encodings and aliases do not match runtime #42237

liturgist · 2005-08-01T18:23:30Z

BPO	1249749
Nosy	@malemburg, @loewis
Files	encodingaliases.py: encodingaliases.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2005-08-01.18:23:30.000>
labels = ['type-feature', 'docs']
title = 'Encodings and aliases do not match runtime'
updated_at = <Date 2020-09-19.19:04:08.068>
user = 'https://bugs.python.org/liturgist'

bugs.python.org fields:

activity = <Date 2020-09-19.19:04:08.068>
actor = 'georg.brandl'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation']
creation = <Date 2005-08-01.18:23:30.000>
creator = 'liturgist'
dependencies = []
files = ['1749']
hgrepos = []
issue_num = 1249749
keywords = []
message_count = 11.0
messages = ['25927', '25928', '25929', '25930', '25931', '25932', '25933', '25934', '25935', '25936', '25937']
nosy_count = 4.0
nosy_names = ['lemburg', 'loewis', 'liturgist', 'docs@python']
pr_nums = []
priority = 'low'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue1249749'
versions = ['Python 3.2']

liturgist · 2005-08-01T18:23:30Z

2.4.1 documentation has a list of standard encodings in
4.9.2. However, this list does not seem to match what
is returned by the runtime. Below is code to dump out
the encodings and aliases. Please tell me if anything
is incorrect.

In some cases, there are many more valid aliases than
listed in the documentation. See 'cp037' as an example.

I see that the identifiers are intended to be case
insensitive. I would prefer to see the documentation
provide the identifiers as they will appear in
encodings.aliases.aliases. The only alias containing
any upper case letters appears to be 'hp_roman8'.

$ cat encodingaliases.py
#!/usr/bin/env python
import sys
import encodings

def main():
    enchash = {}

    for enc in encodings.aliases.aliases.values():
        enchash[enc] = []
    for encalias in encodings.aliases.aliases.keys():
       
enchash[encodings.aliases.aliases[encalias]].append(encalias)

    elist = enchash.keys()
    elist.sort()
    for enc in elist:
        print enc, enchash[enc]

if __name__ == '__main__':
    main()
    sys.exit(0)
13:12 pwatson [
ruth.knightsbridge.com:/home/pwatson/src/python ] 366
$ ./encodingaliases.py
ascii ['iso_ir_6', 'ansi_x3_4_1968', 'ibm367',
'iso646_us', 'us', 'cp367', '646', 'us_ascii',
'csascii', 'ansi_x3.4_1986', 'iso_646.irv_1991',
'ansi_x3.4_1968']
base64_codec ['base_64', 'base64']
big5 ['csbig5', 'big5_tw']
big5hkscs ['hkscs', 'big5_hkscs']
bz2_codec ['bz2']
cp037 ['ebcdic_cp_wt', 'ebcdic_cp_us', 'ebcdic_cp_nl',
'037', 'ibm039', 'ibm037', 'csibm037', 'ebcdic_cp_ca']
cp1026 ['csibm1026', 'ibm1026', '1026']
cp1140 ['1140', 'ibm1140']
cp1250 ['1250', 'windows_1250']
cp1251 ['1251', 'windows_1251']
cp1252 ['windows_1252', '1252']
cp1253 ['1253', 'windows_1253']
cp1254 ['1254', 'windows_1254']
cp1255 ['1255', 'windows_1255']
cp1256 ['1256', 'windows_1256']
cp1257 ['1257', 'windows_1257']
cp1258 ['1258', 'windows_1258']
cp424 ['ebcdic_cp_he', 'ibm424', '424', 'csibm424']
cp437 ['ibm437', '437', 'cspc8codepage437']
cp500 ['csibm500', 'ibm500', '500', 'ebcdic_cp_ch',
'ebcdic_cp_be']
cp775 ['cspc775baltic', '775', 'ibm775']
cp850 ['ibm850', 'cspc850multilingual', '850']
cp852 ['ibm852', '852', 'cspcp852']
cp855 ['csibm855', 'ibm855', '855']
cp857 ['csibm857', 'ibm857', '857']
cp860 ['csibm860', 'ibm860', '860']
cp861 ['csibm861', 'cp_is', 'ibm861', '861']
cp862 ['cspc862latinhebrew', 'ibm862', '862']
cp863 ['csibm863', 'ibm863', '863']
cp864 ['csibm864', 'ibm864', '864']
cp865 ['csibm865', 'ibm865', '865']
cp866 ['csibm866', 'ibm866', '866']
cp869 ['csibm869', 'ibm869', '869', 'cp_gr']
cp932 ['mskanji', '932', 'ms932', 'ms_kanji']
cp949 ['uhc', 'ms949', '949']
cp950 ['ms950', '950']
euc_jis_2004 ['eucjis2004', 'jisx0213', 'euc_jis2004']
euc_jisx0213 ['eucjisx0213']
euc_jp ['eucjp', 'ujis', 'u_jis']
euc_kr ['ksc5601', 'korean', 'euckr', 'ksx1001',
'ks_c_5601', 'ks_c_5601_1987', 'ks_x_1001']
gb18030 ['gb18030_2000']
gb2312 ['chinese', 'euc_cn', 'csiso58gb231280',
'iso_ir_58', 'euccn', 'eucgb2312_cn', 'gb2312_1980',
'gb2312_80']
gbk ['cp936', 'ms936', '936']
hex_codec ['hex']
hp_roman8 ['csHPRoman8', 'r8', 'roman8']
hz ['hzgb', 'hz_gb_2312', 'hz_gb']
iso2022_jp ['iso2022jp', 'iso_2022_jp', 'csiso2022jp']
iso2022_jp_1 ['iso_2022_jp_1', 'iso2022jp_1']
iso2022_jp_2 ['iso_2022_jp_2', 'iso2022jp_2']
iso2022_jp_2004 ['iso_2022_jp_2004', 'iso2022jp_2004']
iso2022_jp_3 ['iso_2022_jp_3', 'iso2022jp_3']
iso2022_jp_ext ['iso2022jp_ext', 'iso_2022_jp_ext']
iso2022_kr ['iso_2022_kr', 'iso2022kr', 'csiso2022kr']
iso8859_10 ['csisolatin6', 'l6', 'iso_8859_10_1992',
'iso_ir_157', 'iso_8859_10', 'latin6']
iso8859_11 ['iso_8859_11', 'thai', 'iso_8859_11_2001']
iso8859_13 ['iso_8859_13']
iso8859_14 ['iso_celtic', 'iso_ir_199', 'l8',
'iso_8859_14_1998', 'iso_8859_14', 'latin8']
iso8859_15 ['iso_8859_15']
iso8859_16 ['iso_8859_16_2001', 'l10', 'iso_ir_226',
'latin10', 'iso_8859_16']
iso8859_2 ['l2', 'csisolatin2', 'iso_ir_101',
'iso_8859_2', 'iso_8859_2_1987', 'latin2']
iso8859_3 ['iso_8859_3_1988', 'l3', 'iso_ir_109',
'csisolatin3', 'iso_8859_3', 'latin3']
iso8859_4 ['csisolatin4', 'l4', 'iso_ir_110',
'iso_8859_4', 'iso_8859_4_1988', 'latin4']
iso8859_5 ['iso_8859_5_1988', 'iso_8859_5', 'cyrillic',
'csisolatincyrillic', 'iso_ir_144']
iso8859_6 ['iso_8859_6_1987', 'iso_ir_127',
'csisolatinarabic', 'asmo_708', 'iso_8859_6',
'ecma_114', 'arabic']
iso8859_7 ['ecma_118', 'greek8', 'iso_8859_7',
'iso_ir_126', 'elot_928', 'iso_8859_7_1987',
'csisolatingreek', 'greek']
iso8859_8 ['iso_8859_8_1988', 'iso_ir_138',
'iso_8859_8', 'csisolatinhebrew', 'hebrew']
iso8859_9 ['l5', 'iso_8859_9_1989', 'iso_8859_9',
'csisolatin5', 'latin5', 'iso_ir_148']
johab ['cp1361', 'ms1361']
koi8_r ['cskoi8r']
latin_1 ['iso8859', 'csisolatin1', 'latin', 'l1',
'iso_ir_100', 'ibm819', 'cp819', 'iso_8859_1',
'latin1', 'iso_8859_1_1987', '8859']
mac_cyrillic ['maccyrillic']
mac_greek ['macgreek']
mac_iceland ['maciceland']
mac_latin2 ['maccentraleurope', 'maclatin2']
mac_roman ['macroman']
mac_turkish ['macturkish']
mbcs ['dbcs']
ptcp154 ['cp154', 'cyrillic-asian', 'csptcp154', 'pt154']
quopri_codec ['quopri', 'quoted_printable',
'quotedprintable']
rot_13 ['rot13']
shift_jis ['s_jis', 'sjis', 'shiftjis', 'csshiftjis']
shift_jis_2004 ['shiftjis2004', 's_jis_2004', 'sjis_2004']
shift_jisx0213 ['shiftjisx0213', 'sjisx0213', 's_jisx0213']
tactis ['tis260']
tis_620 ['tis620', 'tis_620_2529_1', 'tis_620_2529_0',
'iso_ir_166', 'tis_620_0']
utf_16 ['utf16', 'u16']
utf_16_be ['utf_16be', 'unicodebigunmarked']
utf_16_le ['utf_16le', 'unicodelittleunmarked']
utf_7 ['u7', 'utf7']
utf_8 ['u8', 'utf', 'utf8_ucs4', 'utf8_ucs2', 'utf8']
uu_codec ['uu']
zlib_codec ['zlib', 'zip']

malemburg · 2005-08-04T14:47:22Z

Logged In: YES
user_id=38388

Doc patches are welcome - perhaps you could enhance your
script to have the doc table generated from the available
codecs and aliases ?!

Thanks.

liturgist · 2005-08-05T17:53:44Z

Logged In: YES
user_id=197677

I would very much like to produce the doc table from code.
However, I have a few questions.

It seems that encodings.aliases.aliases is a list of all
encodings and not necessarily those supported on all
machines. Ie. mbcs on UNIX or embedded systems that might
exclude some large character sets to save space. Is this
correct? If so, will it remain that way?

To find out if an encoding is supported on the current
machine, the code should handle the exception generated when
codecs.lookup() fails. Right?

To generate the table, I need to produce the "Languages"
field. This information does not seem to be available from
the Python runtime. I would much rather see this
information, including a localized version of the string,
come from the Python runtime, rather than hardcode it into
the script. Is that a possibility? Would it be a better
approach?

The non-language oriented encodings such as base_64 and
rot_13 do not seem to have anything that distinguishes them
from human languages. How can these be separated out
without hardcoding?

Likewise, the non-language encodings have an "Operand type"
field which would need to be generated. My feeling is,
again, that this should come from the Python runtime and not
be hardcoded into the doc generation script. Any suggestions?

loewis · 2005-08-06T12:41:24Z

Logged In: YES
user_id=21627

I would not like to see the documentation contain a complete
list of all aliases. The documentation points out that this
are "a few common aliases", ie. I selected aliases that
people are likely to encounter, and are encouraged to use.

I don't think it is useful to produce the table from the
code. If you want to know everything in aliases, just look
at aliases directly.

malemburg · 2005-08-06T12:49:05Z

Logged In: YES
user_id=38388

Martin, I don't see any problem with putting the complete
list of aliases into the documentation.

liturgist, don't worry about hard-coding things into the
script. The extra information Martin gave in the table is
not likely going to become part of the standard lib, because
there's no a lot you can do with it programmatically.

liturgist · 2005-08-10T21:29:12Z

Logged In: YES
user_id=197677

The script attached generates two HTML tables in files
specified on the command line.

usage:  encodingaliases.py

<language-oriented-codecs-html-file>
<non-language-oriented-codecs-html-file>

A static list of codecs in this script is used because the
language description is not available in the python runtime.
Codecs found in the encodings.aliases.aliases list are
added to the list, but will be described as "unknown" encodings.

The "bijectiveType" was, like the language descriptions,
taken from the current (2.4.1) documentation.

It would be much better for the descriptions and "bijective"
type settings to come from the runtime. The problem is one
of maintenance. Without these available for introspection
in the runtime, a new encoding with no alias will never be
identified. When it does appear with an alias, it can only
be described as "unknown."

loewis · 2005-08-10T21:59:38Z

Logged In: YES
user_id=21627

I do see a problem with generating these tables
automatically. It suggests the reader that the aliases are
all equally relevant. However, I bet few people have ever
heard of or used, say, 'cspc850multilingual'.

As for the actual patch: Please don't generate HTML.
Instead, TeX should be generated, as this is the primary
source. Also please add a patch to the current TeX file,
updating it appropriately.

liturgist · 2005-08-11T02:54:17Z

Logged In: YES
user_id=197677

For example: there appears to be a codec for iso8859-1, but
it has no alias in the encodings.aliases.aliases list and it
is not in the current documentation.

What is the relationship of iso8859_1 to latin_1? Should
iso8859_1 be considered a base codec? When should iso8859_1
be used rather than latin_1?

loewis · 2005-08-11T05:56:42Z

Logged In: YES
user_id=21627

I think the presence of iso8859_1.py is a bug, resulting
from automatic generation of these files. The file should be
deleted; iso8859-1 should be encoded through the alias to
latin-1. Thanks for pointing that out.

liturgist · 2005-08-11T14:31:40Z

Logged In: YES
user_id=197677

If it does not present a problem, making latin_1 and alias
for iso8859_1 as the base codec would present the ISO
standards as a complete, orthogonal set. The alias would
mean that no existing code is broken. Right?

Would this approach present any problem? Should this
become a separate bug entry?

loewis · 2005-08-11T22:10:10Z

Logged In: YES
user_id=21627

It does present a problem: the latin-1 codec is faster than
the iso8859-1 codec, as it is a special case in C (employin
the fact that Latin-1 and Unicode share the first 256 code
points). So I think the iso8859-1 should be dropped. But, as
you guess, this is an issue independent of the documentation
issue at hand, and should be reported (and resolved) separately.

liturgist mannequin added docs Documentation in the Doc dir labels Aug 1, 2005

devdanzin mannequin added type-feature A feature request or enhancement labels Feb 16, 2009

birkenfeld assigned birkenfeld Apr 5, 2009

BreamoreBoy mannequin assigned docspython and unassigned birkenfeld Aug 21, 2010

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encodings and aliases do not match runtime #42237

Encodings and aliases do not match runtime #42237

liturgist mannequin commented Aug 1, 2005

liturgist mannequin commented Aug 1, 2005

malemburg commented Aug 4, 2005

liturgist mannequin commented Aug 5, 2005

loewis mannequin commented Aug 6, 2005

malemburg commented Aug 6, 2005

liturgist mannequin commented Aug 10, 2005

loewis mannequin commented Aug 10, 2005

liturgist mannequin commented Aug 11, 2005

loewis mannequin commented Aug 11, 2005

liturgist mannequin commented Aug 11, 2005

loewis mannequin commented Aug 11, 2005

Encodings and aliases do not match runtime #42237

Encodings and aliases do not match runtime #42237

Comments

liturgist mannequin commented Aug 1, 2005

liturgist mannequin commented Aug 1, 2005

malemburg commented Aug 4, 2005

liturgist mannequin commented Aug 5, 2005

loewis mannequin commented Aug 6, 2005

malemburg commented Aug 6, 2005

liturgist mannequin commented Aug 10, 2005

loewis mannequin commented Aug 10, 2005

liturgist mannequin commented Aug 11, 2005

loewis mannequin commented Aug 11, 2005

liturgist mannequin commented Aug 11, 2005

loewis mannequin commented Aug 11, 2005