-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re.LOCALE is nonsensical for Unicode #66597
Comments
Current implementation of re.LOCALE support for Unicode strings is nonsensical. It correctly works only on Latin1 locales (because Unicode string interpreted as Latin1 decoded bytes string. all characters outside UCS1 range considered as non-words), on other locales it got strange and useless results. >>> import re, locale
>>> locale.setlocale(locale.LC_CTYPE, 'ru_RU.cp1251')
'ru_RU.cp1251'
>>> re.match(br'\w', 'µ'.encode('cp1251'), re.L)
<_sre.SRE_Match object; span=(0, 1), match=b'\xb5'>
>>> re.match(r'\w', 'µ', re.L)
<_sre.SRE_Match object; span=(0, 1), match='µ'>
>>> re.match(br'\w', 'ё'.encode('cp1251'), re.L)
<_sre.SRE_Match object; span=(0, 1), match=b'\xb8'>
>>> re.match(r'\w', 'ё', re.L) Proposed patch fixes re.LOCALE support for Unicode strings. It uses the wide-character equivalents of C characters functions (towlower(), iswalpha(), etc). The problem is that these functions are not exists in C89, they are introduced only in C99. Gcc understand them, we should check other compilers. However these functions are already used on FreeBSD and MacOS. |
I don't think we should fix this in 2.x: some people may rely on the old behaviour, and it will be difficult for them to debug. |
Yes, one of solution is to deprecate re.LOCALE for unicode strings and then Example. >>> for a in 'Ii\u0130\u0131':
... for b in 'Ii\u0130\u0131':
... if a != b and re.match(a, b, re.I): print(a, '~', b)
...
I ~ i
I ~ İ
i ~ I
i ~ İ
İ ~ I
İ ~ i This is incorrect result in Turkish. Capital dotless "I" matches capital "İ" Regex produces more relevant output, which includes matches for Turkish and I ~ i With locale tr_TR.utf8 (with the patch): >>> for a in 'Ii\u0130\u0131':
... for b in 'Ii\u0130\u0131':
... if a != b and re.match(a, b, re.I|re.L): print(a, '~', b)
...
I ~ ı
i ~ İ
İ ~ i
ı ~ I This is correct result in Turkish. Therefore there is a use case for this feature. |
Ha, I always forget about the Turkish locale case... |
Here is simple patch which just deprecate using of the re.LOCALE flag with str patterns. It also deprecates using of the re.LOCALE flag with the re.ASCII flag (with bytes patterns) and adds some re.LOCALE related tests. |
If there are no objections I'll commit the re_deprecate_unicode_locale.patch patch. But it would be good if someone will review doc changes. |
Looks like revision 561d1d0de518 was to fix this issue, but the NEWS entry has the wrong reference number |
Indeed. Thank you Martin. |
New changeset abc7fe393016 by Serhiy Storchaka in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: