Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re \w does not match some valid Unicode characters #75021

Closed
davidism mannequin opened this issue Jul 3, 2017 · 5 comments
Closed

re \w does not match some valid Unicode characters #75021

davidism mannequin opened this issue Jul 3, 2017 · 5 comments
Labels
3.7 (EOL) end of life topic-regex topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@davidism
Copy link
Mannequin

davidism mannequin commented Jul 3, 2017

BPO 30838
Nosy @vstinner, @ezio-melotti, @serhiy-storchaka, @ThiefMaster, @davidism

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-07-05.15:19:42.232>
created_at = <Date 2017-07-03.15:39:29.866>
labels = ['expert-regex', 'invalid', 'type-bug', '3.7', 'expert-unicode']
title = 're \\w does not match some valid Unicode characters'
updated_at = <Date 2017-07-05.17:22:30.253>
user = 'https://github.com/davidism'

bugs.python.org fields:

activity = <Date 2017-07-05.17:22:30.253>
actor = 'mrabarnett'
assignee = 'none'
closed = True
closed_date = <Date 2017-07-05.15:19:42.232>
closer = 'davidism'
components = ['Regular Expressions', 'Unicode']
creation = <Date 2017-07-03.15:39:29.866>
creator = 'davidism'
dependencies = []
files = []
hgrepos = []
issue_num = 30838
keywords = []
message_count = 5.0
messages = ['297603', '297604', '297613', '297766', '297773']
nosy_count = 6.0
nosy_names = ['vstinner', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'ThiefMaster', 'davidism']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue30838'
versions = ['Python 3.3', 'Python 3.4', 'Python 3.5', 'Python 3.6', 'Python 3.7']

@davidism
Copy link
Mannequin Author

davidism mannequin commented Jul 3, 2017

This came up while writing a regex to match characters that are valid in Python identifiers for Jinja. pallets/jinja#731 \w matches all valid identifier characters except for 4 special cases:

import unicodedata
import re
import sys

cre = re.compile(r'\w')

for cp in range(sys.maxunicode + 1):
    s = chr(cp)

    if s.isidentifier() and not cre.match(s):
        print(hex(cp), unicodedata.name(s))

0x1885 MONGOLIAN LETTER ALI GALI BALUDA
0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
0x2118 SCRIPT CAPITAL P
0x212e ESTIMATED SYMBOL

Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped matching them.

For our case, we just added them to a character set, [\w\u1885\u1886\u2118\u212e].

It can cause unexpected behavior when using \b, since that's defined as the transition from \w to \W and those 4 characters aren't in \w. re.match(r'\b[\w\u212e', '℮') fails to match.

@davidism davidism mannequin added 3.7 (EOL) end of life topic-regex topic-unicode type-bug An unexpected behavior, bug, or error labels Jul 3, 2017
@davidism
Copy link
Mannequin Author

davidism mannequin commented Jul 3, 2017

Adding or ('a' + s).isidentifer(), to catch valid id_continue characters, to the test in the previous script reveals many more characters that seem like valid word characters but aren't matched by \w.

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Jul 3, 2017

In Unicode 9.0.0, U+1885 and U+1886 changed from being General_Category=Other_Letter (Lo) to General_Category=Nonspacing_Mark (Mn).

U+2118 is General_Category=Math_Symbol (Sm) and U+212E is General_Category=Other_Symbol (So).

\w doesn't include Mn, Sm or So.

The .identifier method uses the Unicode properties XID_Start and XID_Continue, which include these codepoints.

@davidism
Copy link
Mannequin Author

davidism mannequin commented Jul 5, 2017

After thinking about it more, I guess I misunderstood what \w was doing compared to isidentifier. Since Python just relies on the Unicode database, there's not much to be done anyway. Closing this.

For anyone interested, we ended up with a hybrid approach for lexing identifiers: build a regex group that includes all valid ranges not matched by \w, then validate with isidentifier later. https://github.com/pallets/jinja/pull/731/files

@davidism davidism mannequin closed this as completed Jul 5, 2017
@davidism davidism mannequin added the invalid label Jul 5, 2017
@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Jul 5, 2017

Python identifiers match the regex:

[_\p{XID_Start}]\p{XID_Continue}*

The standard re module doesn't support \p{...}, but the third-party "regex" module does.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.7 (EOL) end of life topic-regex topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

0 participants