Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.LOCALE is nonsensical for Unicode #66597

Closed
serhiy-storchaka opened this issue Sep 14, 2014 · 9 comments
Closed

re.LOCALE is nonsensical for Unicode #66597

serhiy-storchaka opened this issue Sep 14, 2014 · 9 comments
Assignees
Labels
expert-regex expert-unicode extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Sep 14, 2014

BPO 22407
Nosy @pitrou, @vstinner, @ezio-melotti, @vadmium, @serhiy-storchaka
Dependencies
  • bpo-22838: Convert re tests to unittest
  • Files
  • re_unicode_locale.patch
  • re_deprecate_unicode_locale.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2014-12-01.09:53:48.617>
    created_at = <Date 2014-09-14.15:43:18.738>
    labels = ['extension-modules', 'expert-regex', 'type-feature', 'library', 'expert-unicode']
    title = 're.LOCALE is nonsensical for Unicode'
    updated_at = <Date 2014-12-01.11:16:44.679>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2014-12-01.11:16:44.679>
    actor = 'python-dev'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2014-12-01.09:53:48.617>
    closer = 'serhiy.storchaka'
    components = ['Extension Modules', 'Library (Lib)', 'Regular Expressions', 'Unicode']
    creation = <Date 2014-09-14.15:43:18.738>
    creator = 'serhiy.storchaka'
    dependencies = ['22838']
    files = ['36615', '36853']
    hgrepos = []
    issue_num = 22407
    keywords = ['patch']
    message_count = 9.0
    messages = ['226871', '226949', '226959', '226960', '228876', '231022', '231924', '231927', '231931']
    nosy_count = 8.0
    nosy_names = ['pitrou', 'vstinner', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'python-dev', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue22407'
    versions = ['Python 3.5']

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Sep 14, 2014

    Current implementation of re.LOCALE support for Unicode strings is nonsensical. It correctly works only on Latin1 locales (because Unicode string interpreted as Latin1 decoded bytes string. all characters outside UCS1 range considered as non-words), on other locales it got strange and useless results.

    >>> import re, locale
    >>> locale.setlocale(locale.LC_CTYPE, 'ru_RU.cp1251')
    'ru_RU.cp1251'
    >>> re.match(br'\w', 'µ'.encode('cp1251'), re.L)
    <_sre.SRE_Match object; span=(0, 1), match=b'\xb5'>
    >>> re.match(r'\w', 'µ', re.L)
    <_sre.SRE_Match object; span=(0, 1), match='µ'>
    >>> re.match(br'\w', 'ё'.encode('cp1251'), re.L)
    <_sre.SRE_Match object; span=(0, 1), match=b'\xb8'>
    >>> re.match(r'\w', 'ё', re.L)

    Proposed patch fixes re.LOCALE support for Unicode strings. It uses the wide-character equivalents of C characters functions (towlower(), iswalpha(), etc).

    The problem is that these functions are not exists in C89, they are introduced only in C99. Gcc understand them, we should check other compilers. However these functions are already used on FreeBSD and MacOS.

    @serhiy-storchaka serhiy-storchaka added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir expert-regex extension-modules C modules in the Modules dir labels Sep 14, 2014
    @pitrou
    Copy link
    Member

    pitrou commented Sep 16, 2014

    I don't think we should fix this in 2.x: some people may rely on the old behaviour, and it will be difficult for them to debug.
    In 3.x, I simply propose we deprecate re.LOCALE for unicode strings and make it a no-op.

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Sep 16, 2014

    Yes, one of solution is to deprecate re.LOCALE for unicode strings and then
    make it incompatible with unicode strings. But I think it would be good to
    implement locale-aware matching.

    Example.

    >>> for a in 'Ii\u0130\u0131':
    ...     for b in 'Ii\u0130\u0131':
    ...         if a != b and re.match(a, b, re.I): print(a, '~', b)
    ... 
    I ~ i
    I ~ İ
    i ~ I
    i ~ İ
    İ ~ I
    İ ~ i

    This is incorrect result in Turkish. Capital dotless "I" matches capital "İ"
    with dot above, and small dotless "ı" doesn't match anything.

    Regex produces more relevant output, which includes matches for Turkish and
    English:

    I ~ i
    I ~ ı
    i ~ I
    i ~ İ
    İ ~ i
    ı ~ I

    With locale tr_TR.utf8 (with the patch):

    >>> for a in 'Ii\u0130\u0131':
    ...     for b in 'Ii\u0130\u0131':
    ...         if a != b and re.match(a, b, re.I|re.L): print(a, '~', b)
    ... 
    I ~ ı
    i ~ İ
    İ ~ i
    ı ~ I

    This is correct result in Turkish.

    Therefore there is a use case for this feature.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 16, 2014

    Ha, I always forget about the Turkish locale case...

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Oct 9, 2014

    Here is simple patch which just deprecate using of the re.LOCALE flag with str patterns. It also deprecates using of the re.LOCALE flag with the re.ASCII flag (with bytes patterns) and adds some re.LOCALE related tests.

    @serhiy-storchaka serhiy-storchaka self-assigned this Nov 11, 2014
    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Nov 11, 2014

    If there are no objections I'll commit the re_deprecate_unicode_locale.patch patch. But it would be good if someone will review doc changes.

    @serhiy-storchaka serhiy-storchaka added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Dec 1, 2014
    @vadmium
    Copy link
    Member

    vadmium commented Dec 1, 2014

    Looks like revision 561d1d0de518 was to fix this issue, but the NEWS entry has the wrong reference number

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Dec 1, 2014

    Indeed. Thank you Martin.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 1, 2014

    New changeset abc7fe393016 by Serhiy Storchaka in branch 'default':
    Fixed issue number in Misc/NEWS for issue bpo-22407.
    https://hg.python.org/cpython/rev/abc7fe393016

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    expert-regex expert-unicode extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants