[Windows] test_locale.TestMiscellaneous.test_getsetlocale_issue1813() fails #82126
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/tjguk' closed_at = <Date 2021-03-31.12:08:34.105> created_at = <Date 2019-08-25.17:33:21.926> labels = ['type-bug', '3.8', '3.9', '3.10', 'library', 'OS-windows'] title = '[Windows] test_locale.TestMiscellaneous.test_getsetlocale_issue1813() fails' updated_at = <Date 2021-03-31.22:47:06.662> user = 'https://github.com/tjguk'
activity = <Date 2021-03-31.22:47:06.662> actor = 'vstinner' assignee = 'tim.golden' closed = True closed_date = <Date 2021-03-31.12:08:34.105> closer = 'vstinner' components = ['Library (Lib)', 'Windows'] creation = <Date 2019-08-25.17:33:21.926> creator = 'tim.golden' dependencies =  files = ['48561'] hgrepos =  issue_num = 37945 keywords = ['patch'] message_count = 38.0 messages = ['350466', '350470', '350471', '350485', '350491', '350510', '350548', '350549', '350559', '350568', '350569', '350571', '350573', '350574', '350598', '350820', '350823', '351788', '361804', '361806', '379068', '382739', '389005', '389657', '389661', '389784', '389839', '389867', '389883', '389884', '389885', '389886', '389888', '389890', '389891', '389892', '389938', '389940'] nosy_count = 16.0 nosy_names = ['terry.reedy', 'paul.moore', 'db3l', 'vstinner', 'tim.golden', 'jkloth', 'ned.deily', 'methane', 'guy.linton', 'lukasz.langa', 'zach.ware', 'serhiy.storchaka', 'eryksun', 'steve.dower', 'miss-islington', 'xtreak'] pr_nums = ['25110', '25112', '25113'] priority = None resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue37945' versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']
The text was updated successfully, but these errors were encountered:
On a Win10 machine I'm consistently seeing test_locale (and test__locale) fail. I'll attach pythoninfo.
Traceback (most recent call last): File "C:\Users\tim\work-in-progress\cpython\lib\test\test_locale.py", line 531, in test_getsetlocale_issue1813 locale.setlocale(locale.LC_CTYPE, loc) File "C:\Users\tim\work-in-progress\cpython\lib\locale.py", line 604, in setlocale return _setlocale(category, locale) locale.Error: unsupported locale setting
Ok; so basically this doesn't work:
It gives "locale.Error: unsupported locale setting" which comes from https://github.com/python/cpython/blob/master/Modules/_localemodule.c#L107
(For locale.getdefaultlocale() you could substitute locale.getlocale() or simply ("en_GB", "cp1252")). On my machine it raises that exception on Python 2.7.15, 3.6.6 and on master.
Interestingly, none of the other tests in test_locale appear to exercise the 2-tuple 2nd param to setlocale. When you call setlocale and it returns the previous setting, it's a single string, eg "en_GB" etc. Passing that back in works. But when you call getlocale, it returns the 2-tuple, eg ("en_GB", "cp1252"). But all the other tests use the setlocale-returns-current trick for their setup/teardown.
I've quickly tested on 3.5 on Linux and the 2-tuple version works ok. I assume it's working on buildbots or we'd see the Turkish test failing every time. So is there something different about my C runtime, I wonder?
Just to save you looking, the code in https://github.com/python/cpython/blob/master/Modules/_localemodule.c#L107 converts the 2-tuple to lang.encoding form so the C module is seeing "en_GB.cp1252"
local.normalize is generally wrong in Windows. It's meant for POSIX systems. Currently "tr_TR" is parsed as follows:
>>> locale._parse_localename('tr_TR') ('tr_TR', 'ISO8859-9')
The encoding "ISO8859-9" is meaningless to Windows. Also, the old CRT only ever supported either full language/country names or non-standard abbreviations -- e.g. either "Turkish_Turkey" or "trk_TUR". Having locale.getdefaultlocale() return ISO two-letter codes (e.g. "en_GB") was fundamentally wrong for the old CRT. (2.7 will die with this wart.)
3.5+ uses the Universal CRT, which does support standard ISO codes, but only in BCP 47  locale names of the following form:
BCP 47 locale names have been preferred by Windows for the past 13 years, since Vista was released. Windows extends BCP 47 with a non-standard sort-order field (e.g. "de-Latn-DE_phoneb" is the German language with Latin script in the region of Germany with phone-book sort order). Another departure from strict BCP 47 in Windows is allowing underscore to be used as the delimiter instead of hyphen.
In a concession to existing C code, the Universal CRT also supports an encoding suffix in BCP 47 locales, but this can only be either ".utf-8" or ".utf8". (Windows itself does not support specifying an encoding in a locale name, but it's Unicode anyway.) No other encoding is allowed. If ".utf-8" isn't specified, a BCP 47 locale defaults to the locale's ANSI codepage. However, there's no way to convey this in the locale name itself. Also, if a locale is Unicode only (e.g. Hindi), the CRT implicitly uses UTF-8 even without the ".utf-8" suffix.
The following are valid BCP 47 locale names in the CRT: "tr", "tr.utf-8", "tr-TR", "tr_TR", "tr_TR.utf8", or "tr-Latn-TR.utf-8". But note that "tr_TR.1254" is not supported.
The following shows that omitting the optional "utf-8" encoding in a BCP 47 locale makes the CRT default to the associated ANSI codepage.
>>> locale.setlocale(locale.LC_CTYPE, 'tr_TR') 'tr_TR' >>> ucrt.___lc_codepage_func() 1254
C ___lc_codepage_func() queries the codepage of the current locale. We can directly query this codepage for a BCP 47 locale via GetLocaleInfoEx:
>>> cpstr = (ctypes.c_wchar * 6)() >>> kernel32.GetLocaleInfoEx('tr-TR', ... LOCALE_IDEFAULTANSICODEPAGE, cpstr, len(cpstr)) 5 >>> cpstr.value '1254'
If the result is '0', it's a Unicode-only locale (e.g. 'hi-IN' -- Hindi, India). Recent versions of the CRT use UTF-8 (codepage 65001) for Unicode-only locales:
>>> locale.setlocale(locale.LC_CTYPE, 'hi-IN') 'hi-IN' >>> ucrt.___lc_codepage_func() 65001
Here are some example locale tuples that should be supported, given that the CRT continues to support full English locale names and non-standard abbreviations, in addition to the new BCP 47 names:
The return value from C setlocale can be normalized to replace hyphen delimiters with underscores, and "utf8" can be normalized as "utf-8". If it's a BCP 47 locale that has no encoding, GetLocaleInfoEx can be called to query the ANSI codepage. UTF-8 can be assumed if it's a Unicode-only locale.
As to prefixing a codepage with 'cp', we don't really need to do this. We have aliases defined for most, such as '1252' -> 'cp1252'. But if the 'cp' prefix does get added, then the locale module should at least know to remove it when building a locale name from a tuple.
test_getsetlocale_issue1813 fails for me as well. I can't imagine how setlocale(LC_CTYPE, "tr_TR.ISO8859-9") would succeed with recent versions of the Universal CRT in Windows. It parses "tr_TR" as a BCP 47 locale name, which only supports UTF-8 (e.g. "tr_TR.utf-8") and implicit ANSI (e.g. "tr_TR"). Plus "ISO8859-9" in general isn't a supported encoding of the form ".<codepage>", ".ACP" (ANSI), ".utf8", or ".utf-8".
With the old CRT (2.x and <=3.4) and older versions of the Universal CRT, the initial locale.setlocale(locale.LC_CTYPE 'tr_TR') call fails as an unsupported locale, so the test is skipped:
The old CRT only supports "trk_TUR", "trk_Turkey", "turkish_TUR", and "turkish_Turkey".
I agree that that could be a fix. And certainly, if it turns out that this could never have (recently) worked as Eryk is suggesting, then let's go for it.
But I still have this uneasy feeling that it's not failing on the buildbots and I can't see any sign of a skipped test in the test stdio. I just wonder whether there's something else at play here.
I pushed a custom buildbot run that only runs this test in verbose mode, and it looks like the test is being skipped some other way?
I don't see any evidence there that it's running at all, though I do on my own machine.
Perhaps one of the other buildbot settings causes it to run in a different order and something skips the entire class? I haven't dug in enough to figure that out yet.
We get into trouble with test_getsetlocale_issue1813 because normalize() maps "tr_TR" (supported) to "tr_TR.ISO8859-9" (not supported).
>>> locale.normalize('tr_TR') 'tr_TR.ISO8859-9'
We should skip normalize() in Windows. It's based on a POSIX locale_alias mapping that can only cause problems. The work for normalizing locale names in Windows is best handled inline in _build_localename and _parse_localename.
For the old long form, C setlocale always returns the codepage encoding (e.g. "Turkish_Turkey.1254") or "utf8", so that's simple to parse. For BCP 47 locales, the encoding is either "utf8" or "utf-8", or nothing at all. For the latter, there's an implied legacy ANSI encoding. This is used by the CRT wherever we depend on byte strings, such as in time.strftime:
>>> locale.setlocale(locale.LC_CTYPE, 'en_GB') 'en_GB' >>> time.strftime("\u0100") 'A'
>>> locale.setlocale(locale.LC_CTYPE, 'en_GB.utf-8') 'en_GB.utf-8' >>> time.strftime("\u0100") 'Ā'
(We should switch back to using wcsftime if possible.)
The implicit BCP-47 case can be parsed as
For _build_localename, we can't simply limit the encoding to UTF-8. We need to support the old long/abbreviated forms (e.g. "trk_TUR", "turkish_Turkey") in addition to the newer BCP 47 locale names. In the old form we have to support the following encodings:
(The CRT documentation says "OEM" should also be supported, but it's not.)
A locale name can also omit the language in the old form -- e.g. (None, "ACP") or (None, "cp1254"). The CRT uses the current language in this case. This is discouraged because the result may be nonsense.
Oh yeah, that locale_alias table is useless on Windows :(
But at least the function is documented in such a way that we can change it: "The returned locale code is formatted for use with :func:`setlocale`."
Alternatively, we could make setlocale() do its own normalization step on Windows and ignore (or otherwise validate/reject) the encoding.
None of that explains why the test doesn't seem to run at all on the buildbots though.
Are the buildbots using an older version of UCRT? BCP 47 locales used to strictly require a hyphen as the delimiter (e.g. 'tr-TR') instead of underscore (e.g. 'tr_TR'). Supporting underscore and UTF-8 are relatively recent additions that aren't documented yet. Even WINAPI GetLocaleInfoEx supports underscore as the delimiter now, which is also undocumented behavior.
Oh man, this is too broken for me to think about today...
If someone feels like writing a Windows-specific normalize() function to totally replace the Unix one, feel free, but it looks like we won't be able to get away with anything less. The "easy" change breaks a variety of other tests.
This feels like one of those changes where what's in place is clearly flawed but any change seems like it'll break stuff which people have had in place for years.
I'll try to look at a least-breaking change but I'm honestly not sure what that would look like.
Here's some additional background information for work on this issue.
A Unix locale identifier has the following form:
(X/Open Portability Guide, Issue 4, 1992 -- aka XPG4)
Some systems also implement "C.UTF-8".
The language and territory should use ISO 639 and ISO 3166 alpha-2 codes. The "@" modifier may indicate an alternate script such as "sr_RS@latin" or an alternate currency such as "de_DE@euro". For the optional codeset, IANA publishes the following table of character sets:
In Debian Linux, the available encodings are defined by mapping files in "/usr/share/i18n/charmaps". But encodings can't be arbitrarily used in locales at run time. A locale has to be generated (see "/etc/locale.gen") before it's available.
A Windows (not ucrt) locale name has the following form:
The invariant locale provides stable data. The system and user default locales vary according to the Control Panel "Region" settings.
A locale name is based on BCP 47 language tags, with the form "<language>-<script>-<region>"(e.g. "en-Latn-GB"), for which the script and region codes are optional. The language is an ISO 639 alpha-2 or alpha-3 code, with alpha-2 preferred. The script is an initial-uppercase ISO 15924 code. The region is an ISO 3166-1 alpha-2 or numeric-3 code, with alpha-2 preferred.
As specified, the sort-order code should be delimited by an underscore, but Windows 10 (maybe older versions also?) accepts a hyphen instead. Here's a list of the sort-order codes that I've seen:
One final note of interest about Windows locales is that the user-interface language has been functionally isolated from the locale. The display language is handled by the Multilinugual User Interface (MUI) API, which depends on .mui files in locale-named subdirectories of a binary, such as "kernel32.dll" -> "en-US\kernel32.dll.mui". Windows 10 has an option to configure the user locale to match the preferred display language. This helps to keep the two in sync, but they're still functionally independent.
The Universal CRT (ucrt) in Windows supports the following syntax for a locale identifier:
NULL is used with setlocale to query the current value of a category. The empty string "" is the current-user locale. "C" is a minimal locale. For LC_CTYPE, "C" uses Latin-1, but for LC_TIME it uses the system ANSI codepage (possibly multi-byte), which can lead to mojibake. The "POSIX" locale is not supported, nor is "C.UTF-8".
Note that UTF-8 support is relatively new, as is the ability to set the encoding without also specifying a region (e.g. "english.utf8").
Recent versions of ucrt extend BCP-47 support in a couple of ways. Underscore is allowed in addition to hyphen as the tag delimiter (e.g "en_GB" instead of "en-GB"), and specifying UTF-8 as the encoding (and only UTF-8) is supported. If UTF-8 isn't specified, internally the locale defaults to the language's ANSI codepage. ucrt has to parse BCP 47 locales manually if they include an encoding, and also in some cases when underscore is used. Currently this fails to handle a sort-order tag, so we can't use, for example, "de_DE_phoneb.utf8".
If normalize() is implemented for Windows, then the tests should be split out into POSIX and Windows versions. Currently, most of the tests in NormalizeTest are not checking a result that's properly normalized for ucrt.
A useful implementation of locale.normalize should allow a script to use ("en_US", "iso8859_1") in Windows without having to know that Latin-1 is Windows codepage 28591, or that ucrt requires a classic locale name if the encoding isn't UTF-8. The required result for setlocale() is "English_United States.28591".
As far as aliases are concerned, at a minimum, we need to map "posix" and "c" to "C". We can also support "C.UTF-8" as "en_US.UTF-8". Do we need to support the Unix locale_alias mappings from X.org? If so, I suppose we could use a double mapping. First try the Unix locale_alias mapping. Then try that result in a windows_locale_alias mapping that includes additional mappings from Unix to Windows. For example:
Note that the last one doesn't currently work. "ca-ES-valencia" is a valid Windows locale name for the Valencian variant of Catalan (ca), which lacks an ISO 639 code of its own since it's officially (and somewhat controversially) designated as a dialect of Catalan. This is an unusual case that has a subtag after the region, which ucrt's manual BCP-47 parsing cannot handle. (It tries to parse "ES" as the script and "valencia" as an ISO 3166-1 country code.)
After mapping aliases, if the result still has "@" in it, normalize() should fail. We don't know what the "@" modifier means.
Otherwise, split the locale name and encoding parts. If the encoding isn't UTF-8, try to map it to a codepage. For this we need a windows_codepage_alias dict that maps IANA official and Python-specific encoding names to Windows codepages. Next, check the locale name via WINAPI IsValidLocaleName. If it's not valid, try replacing underscore with hyphen and check again. Otherwise assume it's a classic ucrt locale name. (It may not be valid, but implementing all of the work ucrt does to parse a classic locale name is too much I think.) If it's a valid Windows locale name, and we have a codepage encoding, then try to translate it as a classic ucrt locale name. This requires two WINAPI GetLocaleInfoEx calls to look up the English versions of the language and country name.
For me, the locale.getlocale() function is broken. Python attempts to guess too many things about the current locale. For example, it relies on an hard coded locale.locale_encoding_alias dictionary which seems to come from X11!? In 2020, Wayland replaced X11 and locales are updated frequently. This dictionary makes no sense on Windows. For example, 'ka_ge' is mapped to 'ka_GE.GEORGIAN-ACADEMY': "GEORGIAN-ACADEMY" is not an encoding, what is the purpose of this string?
I fixed dozens and dozens of locale issues and I never ever used locale.getlocale(). I never understood its purpose nor how it guess the encoding.
I always use locale.setlocale(category) or locale.setlocale(category, None) which returns a simply string which I can pass back to locale.setlocale(category, string) to restore the locale, when I need to temporarily change a locale.
This seemingly useless test is the only test failure for me with installed 3.9.1. Why keep it, at least on Windows.
The failure with "locale.Error: unsupported locale setting" is not limited to Windows. bpo-25191 and duplicate bpo-31636 report the same error on user machines (non-buildbot, as here). #25191proposes the following patch to skip this failure.
- locale.setlocale(locale.LC_CTYPE, loc) - self.assertEqual(loc, locale.getlocale(locale.LC_CTYPE)) + try: + locale.setlocale(locale.LC_CTYPE, loc) + self.assertEqual(loc, locale.getlocale(locale.LC_CTYPE)) + except locale.Error: + # Unsupported locale setting + self.skipTest('unsupported locale setting')
I believe that this is effectively the same as deleting the test. But if we believe it is being skipped on at least Windows buildbots, then we should do the same at least on user Windows machines. Or, if we could detect user manchines, skip on them.
I don't have much of a horse in the race, but since the test has historically been skipped on Windows, and the test hasn't and doesn't work on Windows, modifications to restore the skip behavior seem reasonable to me. The trigger for this issue was Windows adding support for underscore in locale names (like tr_TR) so the test began executing. But it's not a regression or new issue, it's just existing reality becoming exposed.
The user machine and buildbot discrepancy can be attributed to version differences, as the buildbot hadn't yet received the same underscore locale name support.
I'd be fine with removing the test entirely - always skipping on a failure just seems pointless. Then again, bpo-1813 created the test for a purpose on other systems, though even back then it appears it was complicated. Leaving the test but skipping known failing systems (I guess at least Windows and OpenBSD) might be slightly less intrusive of a change, assuming the test is still serving a purpose elsewhere.
Separately, there's a lot of useful/interesting detail here which could inform any eventual normalization changes on Windows, should the underlying issue be deemed worthy of addressing. But that seems like something that could be a distinct operation from clearing up the test issue.
In lieu of the patch in bpo-25191, what about a pair of skips to deal with the issues at hand without killing the test entirely? I'm including OpenBSD since those issues were closed in favor of this one, and am assuming that skipping there is also appropriate.
--- a/Lib/test/test_locale.py +++ b/Lib/test/test_locale.py @@ -552,6 +552,10 @@ def test_setlocale_category(self): # crasher from bug python/cpython#51668 self.assertRaises(locale.Error, locale.setlocale, 12345) + @unittest.skipIf(sys.platform == 'win32', + "Test broken on Windows (issue python/cpython#82126)") + @unittest.skipIf(sys.platform.startswith('openbsd'), + "Test broken on OpenBSD (issues python/cpython#75817 and python/cpython#69378)") def test_getsetlocale_issue1813(self): # Issue python/cpython#46138: setting and getting the locale under a Turkish locale oldlocale = locale.setlocale(locale.LC_CTYPE)
On Windows 10 build 1903,
vstinner@WIN C:\vstinner\python\master>python -m test test_locale -m test_getsetlocale_issue1813 -v
Traceback (most recent call last): File "C:\vstinner\python\master\lib\test\test_locale.py", line 567, in test_getsetlocale_issue1813 locale.setlocale(locale.LC_CTYPE, loc) File "C:\vstinner\python\master\lib\locale.py", line 610, in setlocale return _setlocale(category, locale) locale.Error: unsupported locale setting
It's a bug in the weird locale.getlocale() function which produces a locale name which doesn't exist:
vstinner@WIN C:\vstinner\python\master>python >>> import locale >>> locale.setlocale(locale.LC_CTYPE, "tr_TR") 'tr_TR' >>> locale.setlocale(locale.LC_CTYPE, None) 'tr_TR' >>> locale.getlocale(locale.LC_CTYPE) ('tr_TR', 'ISO8859-9') >>> locale.setlocale(locale.LC_CTYPE, ('tr_TR', 'ISO8859-9')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\vstinner\python\master\lib\locale.py", line 610, in setlocale return _setlocale(category, locale) locale.Error: unsupported locale setting >>> locale.setlocale(locale.LC_CTYPE, 'tr_TR') 'tr_TR'
If you use setlocale(LC_CTYPE, None) to get the locale, it works as expected.
IMO the getlocale() function is dangerous and should be removed: see bpo-43557 "Deprecate getdefaultlocale(), getlocale() and normalize() functions".
I wrote PR 25110 to simply skip the test if setlocale() fails. It fix the issue on Windows (I tested manually, see my comment on my PR), but it should also fix the issue on OpenBSD and any platform where getlocale() returns a locale not accepted by setlocale().
Again, don't ust getlocale(category) but setlocale(category, None).