-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatch between glibc and X11 locale.alias #64286
Comments
The locale module uses locale alias table derived from X11 locale.alias file for mapping bare locale names without encodings to locale names with encodings. However sometimes glibc default encoding for a locale differs from that used in X11 locale.alias. Here is full differences table:
az_az az_AZ.UTF-8 az_AZ.ISO8859-9E For example with the en_IN encoding: >>> import locale, _locale
>>> _locale.setlocale(locale.LC_CTYPE)
'en_IN'
>>> locale.getlocale()
('en_IN', 'ISO8859-1')
>>> locale.nl_langinfo(locale.CODESET)
'UTF-8'
>>> locale.setlocale(locale.LC_CTYPE, locale.getlocale())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/locale.py", line 592, in setlocale
return _setlocale(category, locale)
locale.Error: unsupported locale setting |
Needed a test for few common locales (en_IN, ru_RU) and maybe for unusual locales (uz_uz, uz_uz@cyrillic). I would prefer to have a separate issue that updates the aliases table to glibc 2.24. |
I agree that it's reasonable to have glibc's aliases override While some changes obviously make sense (e.g. 'ca_AD.ISO8859-1' to 'ca_AD.ISO8859-15'), others are less clear (e.g. 'cy_GB.ISO8859-1' to 'cy_GB.ISO8859-14' or 'tg_TJ.KOI8-C' to 'tg_TJ.KOI8-T' or several of the moves from ISO encodings to UTF-8). Is there some reference for why glibc chose different values than X.org for these ? I also don't understand why some "xx.utf-8" locale mappings were removed - I don't think we should remove those, unless they are no lot needed due to some other logic implying these mappings. Since these are major changes, we need an appropriate warning in the NEWS file (and the "What's New" document), an update of the top comment (under "### Database") to mention that the glibc database takes precedence and where to find it, |
Looks as just fixing an error. The default West-European ISO8859-1 is changed to Celtic cy_GB.ISO8859-14. This looks better option for Welsh.
KOI8-C is not supported by Python, but KOI8-T is supported. I don't know what KOI8-C means, there are several rarely used incompatible encodings with this name.
The aliases table is a table of exceptions. Removed entries no longer are exceptional. |
On 07.03.2017 18:23, Serhiy Storchaka wrote:
While all this may make sense, I'm missing some more reasoning This change also looks strange: - 'ka_ge': 'ka_GE.GEORGIAN-ACADEMY',
+ 'ka_ge': 'ka_GE.GEORGIAN_PS',
'ka_ge.georgianacademy': 'ka_GE.GEORGIAN-ACADEMY',
'ka_ge.georgianps': 'ka_GE.GEORGIAN-PS',
'ka_ge.georgianrs': 'ka_GE.GEORGIAN-ACADEMY', Why is GEORGIAN_PS written with an underscore whereas the other Or this one:
Why would a locale switch away from an encoding having Or why is this latin variant removed:
Why should Russians switch back to ISO ?
or from ISO to KOI ?
The more I look at these changes, the more I believe we
It's not a table of exceptions, it's a table mapping commonly But regardless, I checked the code and it is already In some cases, it's probably better to drop the ".utf8" + 'bhb_in.utf8': 'bhb_IN.UTF-8', or
though I'd expect that mapping to be:
as for all other "de" entries. |
Why is the X11 locale alias map used at all? It seems like it can only create confusion with libc. |
Not all platforms use glibc 2.24 as libc. Ideally most of entries should even not exist. We should ask libc for the default encoding if it is not included in the locale name. The aliases table should be used only for mapping commonly used but unsupported by libc locales to supported by libc locales. |
On 08.03.2017 08:20, Serhiy Storchaka wrote:
True. Many don't even use glibc.
I think you have a wrong understanding of what this alias table The alias table is there to avoid having to go to the lib C I know that Python still plays the usual "save current locale, Regarding the patch: we cannot simply use the output from the E.g. this entry in the table is clearly a typo:
(it should read en_ZW.UTF-8) This entry appears wrong as well:
(XX is not a valid country ISO code) How should we go about this ? Mark all the problems in the PR ? |
The problem is that that table can get incorrect result for non-Linux platforms (or for Linux with old glibc). |
On 08.03.2017 07:27, Benjamin Peterson wrote:
Because it was the only such maintained mapping available at the |
On 08.03.2017 10:37, Serhiy Storchaka wrote:
Sure, it's a best effort approach. Also note that on today's systems you often don't have the full set of Our locale database works on all these system, regardless of |
Why was the PR merged while we were still discussing it ? |
"eo_XX" is just something that appears in the X11 locale.alias file. My change doesn't add that; it was already there. (for Esperanto, which I suppose explains the "XX") Most of the changes you identify the glibc aliases taking precedence over the X11 ones. e.g., glibc has "fi_FI ISO-8859-1" while the X11 locale list has "fi_FI.ISO8859-15". That seems correct to me as far as the intent of this change is concerned. How do you propose to pick and choose what we use from the X11 locale alias list? |
On 09.03.2017 08:15, Benjamin Peterson wrote:
Yes, I know. That was an example of a bug in the X.org list.
No, it's not correct. ISO-8859-1 is the older version of Latin-1
We have to go through the list one by one to check whether This will be difficult in a few cases where the glibc mapping My take on this is that the X.org folks know better than the Also note that you are parsing the SUPPORTED file from https://github.com/bminor/glibc/blob/master/localedata/SUPPORTED This file does not provide a locale alias mapping as https://github.com/bminor/glibc/blob/73dfd088936b9237599e4ab737c7ae2ea7d710e1/localedata/Makefile In glibc you can define both the locale and the encoding separately As such, I don't see how you can derive a default alias It's simply an indication of what glibc would have installed So the file doesn't even provide a hint at what could Here's the history: https://github.com/bminor/glibc/commits/master/localedata/SUPPORTED It's merely a list of additions and removals from the Overall, I believe the file is pretty useless to use as On the other hand, you have the local.alias master file: https://cgit.freedesktop.org/xorg/lib/libX11/tree/nls/locale.alias.pre together with the history of why changes were made and when. I'd suggest to make the override optional in makelocalealias.py If you absolutely want to parse the glibc file per default as |
Originally only the X11 locale alias map was used. The support of the glibc locale alias map was added 2.5 years ago (bpo-20079). |
The SUPPORTED file from glibc is used for determining the default encoding for locales that don't include it explicitly. For example en_IN uses UTF-8 rather than ISO8859-1. |
On 09.03.2017 11:47, Serhiy Storchaka wrote:
No, the glibc locales don't say anything about default encodings http://manpages.ubuntu.com/manpages/wily/en/man5/locale.5.html These encodings are just used for determining the default glibc does have a locale.alias file: https://github.com/bminor/glibc/blob/73dfd088936b9237599e4ab737c7ae2ea7d710e1/intl/locale.alias which uses the X.org format, but this is completely out of Serhiy: If you believe that there's anything authoritative about It doesn't help, trying to interpret things into such build |
The original issue is bpo-29571. The locale module returned encoding ISO8859-1 for locale en_IN (as in the X11 locale alias map), but glibc uses UTF-8 (as in glibc SUPPORT file). |
Do you believe this program should work? import locale, os
for l in open("/usr/share/i18n/SUPPORTED"):
alias, encoding = l.strip().split()
locale.setlocale(locale.LC_ALL, alias)
try:
enc = locale.getlocale()[1]
except ValueError:
continue # not in table
normalized = enc.replace("ISO", "ISO-"). \
replace("_", "-"). \
replace("euc", "EUC-"). \
replace("big5", "big5-").upper()
assert normalized == locale.nl_langinfo(locale.CODESET) After my change it does—the encoding returned from getlocale() is the one actually being used by glibc. It fails dramatically on earlier versions of Python (for example on the en_IN example from bpo-29571.) I don't understand why Python needs to editorialize whatever choices libc or the system administrator has made. Is getlocale() expected to return something different from the underlying C locale? In fact, why have this table at all instead of using nl_langinfo to return the encoding for the current locale? |
On 10.03.2017 08:37, Benjamin Peterson wrote:
Your program essentially tests what alias is configured What we want in Python is a consistent mapping of aliases to locales Also note that a lot of these discussions are really academic, While Unix gravitates to UTF-8 for all system related things, This is why defaulting to UTF-8 for locales (as e.g. So to answer your question: No, I don't believe that SUPPORTED The SUPPORTED file can server as extra resource for fixing bugs
getlocale() will return whatever is currently configured via Of course, it can return something different from what some glibc
The table is meant to normalize locale names and enrich Regarding nl_langinfo(): nl_langinfo() will only work if you have called If you don't have a problem with calling setlocale() for Going forward, I think that the following changes make
As for the other changes: please undo them and also We can readd some of the modifications later on if there's Thanks,Marc-Andre Lemburg |
I'm feeling there is something wrong with the current locale design. See issues bpo-504219, bpo-10466, bpo-20088, bpo-25191, bpo-29571. |
I'm still confused about what getlocale() is supposed to do. Why do we attempt to return an encoding anyway if the underlying setlocale call doesn't return one? Is getlocale() not supposed to a simple wrapper over the C locale? If not, how is one supposed to get the encoding associated with the C locale? The old alias table code meant that the encoding returned from getlocale() could be related to or completely unrelated to the actual C locale. Misunderstanding this results in issues like bpo-29571. |
The main purpose of the alias table is to support normalization and this is used for getdefaultencoding() which was created to be able to determine the default encoding based on what X.org uses as default without doing temporary setlocale() tricks. Now, normalization also happens when passing a locale value to the underlying setlocale(), mainly to avoid many common bugs due to setlocale() being extremely picky about the locale value. A side effect of this is that normalization will also kick in to add the encoding in case no encoding is given in the parameter. Note that no normalization is necessary to simply set the configured default locale configured on the system. In such a case, you'd run setlocale('LC_ALL') and get what's configured. If you run the lib C setlocale() with a locale without encoding, the encoding used by the system entirely on what's configured on the system. The SUPPORTED file only gives a hint at what glibc think it should install per default, but any admin or distributor could change these settings simply by running localedef with some other encoding (charmap in locale speak). I suppose that we could resolve some of the confusion by adding a parameter to disable this normalization in setlocale(). |
Hi all, The locale in the latest Ubuntu 18.04 contains en_IL as valid locale, but Python cannot resolve this. en_IL has significant impact because this is English locale and now supported in the latest Ubuntu. Is there any plan to add only en_IL? (Note that I've already created the PR. ( #6707 ))
|
Benjamin's patch did two things: 1) made the glibc alias table taking precedence over the X11 one; 2) updated the alias mapping with new glibc. The first part is controversial, but updating the alias mapping with new glibc is made regularly. PR 6708 updates it with glibc 2.27. This adds 39 new aliases and fixes bpo-32781 and bpo-33432. |
Thanks, Serhiy. |
I believe we can close this old issue. The discussion was certainly a useful one. I guess we should stop updating the alias table automatically and instead add new aliases or change existing ones based on more research and using the X11 files as well as glibc and other resources to help. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: