-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_re is failing when local is set for en_IN
#73757
Comments
Description: Traceback: ====================================================================== Traceback (most recent call last):
File "/home/bigj/Jaysinh/cpython_git/cpython/Lib/test/test_re.py", line 1422, in test_locale_flag
self.assertTrue(pat.match(bletter))
AssertionError: None is not true Ran 120 tests in 2.079s FAILED (failures=1, skipped=1) 1 test failed: Total duration: 2 sec Local value: Operating system: Ubuntu 16.04 LTS(64 bit) |
I'm just wondering whether the problem is just due to the locale's encoding being UTF-8. The locale support in re really only works with encodings that use 1 byte/character. |
Locale encoding is ISO8859-1. This test is skipped on non 8-bit locale. This is a problem with tests, not with the re module. I don't have a solution. |
The report says "== encodings: locale=UTF-8, FS=utf-8". It says that "test_locale_caching" was skipped, but also that "test_locale_flag" failed. |
Good point. The test used locale.getlocale() and it returned returned ('en_IN', 'ISO8859-1'). Following patch makes the test using locale.getpreferredencoding(False), the same encoding as was reported at the header of test report. |
Seriously? Not a GitHub pull request? ;-) (old habit?) |
I'm not experienced with git, and devguide still looks not ready. |
I have a few folks hitting this at the PyCon Pune sprints, so I'm going to apply Serhiy's patch :) |
Looking into this at the PyCon Pune sprints, the problem appears to be arising due to the following difference in behaviour when the unqualifed $ LANG=en_IN.UTF-8 python3 -c "import locale; print(locale.getlocale(locale.LC_CTYPE), locale.getpreferredencoding(False), sep='\n')"
('en_IN', 'UTF-8')
UTF-8
$ LANG=en_IN python3 -c "import locale; print(locale.getlocale(locale.LC_CTYPE), locale.getpreferredencoding(False), sep='\n')"
('en_IN', 'ISO8859-1')
UTF-8 re.LOCALE is presumably picking up the "UTF-8" rather than the "ISO8859-1", and hence the test is failing. |
Yes, please push it Nick. |
This seems to have broken test_re on Windows, see https://ci.appveyor.com/project/python/cpython/build/3.7.0a0.1 I found this change to be the culprit via git bisect, unfortunately we didn't have any working CI on Windows (buildbots were otherwise broken) at the time this was merged. |
Yep, I think we should merge #422 and revert ncoghlan's change. |
I'm not sure this will help on Windows. |
And I don't understand why my fix doesn't work on Windows. |
But the test was never broken on windows. On Sun, Mar 5, 2017, at 23:54, Serhiy Storchaka wrote:
|
getpreferredencoding() takes a completely different path on windows |
I'm with Serhiy on this one: if the "re" module isn't using locale.getpreferredencoding(), then there's something odd going on. It just sounds like the disconnect on Windows is the opposite of the one we hit on Linux without Benjamin's patch, perhaps due to the UTF-8 mode changes - it wouldn't surprise me to learn that the re module is still using mbcs there instead of utf-8. |
I don't see what's odd about it. re.LOCALE uses the C locale, which one |
Thanks for the explanation - given that, I agree that simply reverting the attempted test-based fix and instead relying on the bpo-20087 updates is the way to go. |
Hmm, even though we reverted the original test_re based change, and the initial attempted fix for bpo-20087 was also reverted, I'm still not currently seeing the failure for: LANG=en_IN.utf8 ./python -m test -v test_re I do have the locale installed, so it's not a result of falling back to the C locale and that getting coerced to C.UTF-8: $ LANG=en_IN.utf8 locale -k currency_symbol
currency_symbol="₹" Jaysinh, are you still seeing this test failure on a fresh checkout? |
Hmm, this actually works for me on Fedora 27 even if I go back to 1b3d88e, the commit just before the initially merged (and subsequently reverted) test change above. Unassigning, since I can't readily reproduce it myself. |
Hello Nick,
|
I've also added Matthias and Barry to the cc list, in case this does turn out to be a Debian or Ubuntu specific quirk. Restating the problem, the issue is that test_locale_flag in test_re may fail for at least the en_IN locale, and we're not sure yet whether that's a test bug, a locale module bug, or a distro bug: LANG=en_IN ./python -m test -v test_re We've only confirmed it on Ubuntu so far though - I haven't been able to reproduce it on Fedora, and Jaysinh hasn't been able to reproduce it since switching to Gentoo. |
Similar issue reported on debian9.8 stretch with python 3.7.2 and en_IN : bpo-36134 |
Ah, I can reproduce the bug on Fedora 29 using "LANG=en_IN ./python -m test -v test_re". The problem is that locale.getlocale() is not reliable: it pretends that the locale encoding is ISO8859-1, whereas the real encoding is UTF-8: $ LANG=en_IN ./python
Python 3.8.0a2+ (heads/master:4cbea518a0, Feb 28 2019, 18:19:44)
>>> chr(224).encode('ISO8859-1')
b'\xe0'
>>> import _testcapi
>>> _testcapi.DecodeLocaleEx(b'\xe0', 0, 'strict')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: decode error: pos=0, reason=decoding error
# Wrong encoding
>>> locale.getlocale(locale.LC_CTYPE)
('en_IN', 'ISO8859-1')
>>> locale.setlocale(locale.LC_CTYPE, None)
'en_IN'
>>> locale._parse_localename('en_IN')
('en_IN', 'ISO8859-1')
# Real encoding
>>> locale.getpreferredencoding()
'UTF-8'
>>> locale.nl_langinfo(locale.CODESET)
'UTF-8' Attached PR 12099 fix the issue. |
It seems like the ANSI code page is 1252 ("cp1252"). == CPython 3.7.0a0 (master:d31b28e16a2387d0251df948ef5d1b33d4357652, Mar 5 2017, 21:47:06) [MSC v.1900 32 bit (Intel)] ... FAIL: test_locale_flag (test.test_re.ReTests) Traceback (most recent call last):
File "C:\projects\cpython\lib\test\test_re.py", line 1422, in test_locale_flag
self.assertTrue(pat.match(bletter))
AssertionError: None is not true
On my Windows 10 with Python 3.8, getpreferredencoding() (and getpreferredencoding(False)) returns "cp1252", getlocale(LC_CTYPE)[1] returns "1252". Python has an alias "1252" for "cp1252". On Windows, getpreferredencoding() is implemented as _locale._getdefaultlocale()[1]. _getdefaultlocale()[1] is implemented with: PyOS_snprintf(encoding, sizeof(encoding), "cp%d", GetACP()); At the end, it's the ANSI code page (1252). -- I don't understand how the change ace5c0f introduced a regression. And so I don't understand how commit 21a7431 (revert) could fix anything. -- On my PR 12099, two Windows CI run and both succeeded:
When the change ace5c0f was merged, Python had no working Windows CI. Things evolved at lot in the meanwhile. I also tested manually my PR 12099 on my Windows 10 VM which also uses cp1252: test_re pass. -- re.LOCALE flag of re.compile() for a bytes pattern uses the following function of Modules/_sre.c: LOCAL(int)
char_loc_ignore(SRE_CODE pattern, SRE_CODE ch)
{
return ch == pattern
|| (SRE_CODE) sre_lower_locale(ch) == pattern
|| (SRE_CODE) sre_upper_locale(ch) == pattern;
} |
AppVeyor failed on the backport to Python 3.7 of my fix: PR 12108. Ok, now I understand the bug in Python 3.7. locale.getlocale(locale.LC_CTYPE)[1] returns None because Python doesn't set LC_CTYPE to the user preferred locale. I'm not sure of which locale is used in practice in that case, but at least I can say that None is not the expected encoding name... str.encode() and bytes.decode() use UTF-8 when None is passed as the encoding. locale.getpreferredencoding() returns 'cp1252' which is the ANSI code page. Python 3.8 is different. In bpo-34485, I modified Python 3.8 to set LC_CTYPE locale to the user preference (ANSI code page):
--- |
I wrote C and Python code to check what is the effective encoding used by the LC_CTYPE locale before setlocale(LC_CTYPE, "") is called on Python 3.7. Result: Windows uses the Latin1 encoding. See attached files: _testcapi.patch + loc.py produced loc.log (output). |
I don't understand the relationship with bpo-20087, so I removed the dependency. I fixed test_re in 3.7 and master branches. I close the issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: