New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re \w does not match some valid Unicode characters #75021
Comments
This came up while writing a regex to match characters that are valid in Python identifiers for Jinja. pallets/jinja#731 import unicodedata
import re
import sys
cre = re.compile(r'\w')
for cp in range(sys.maxunicode + 1):
s = chr(cp)
if s.isidentifier() and not cre.match(s):
print(hex(cp), unicodedata.name(s)) 0x1885 MONGOLIAN LETTER ALI GALI BALUDA Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped matching them. For our case, we just added them to a character set, It can cause unexpected behavior when using |
Adding |
In Unicode 9.0.0, U+1885 and U+1886 changed from being General_Category=Other_Letter (Lo) to General_Category=Nonspacing_Mark (Mn). U+2118 is General_Category=Math_Symbol (Sm) and U+212E is General_Category=Other_Symbol (So). \w doesn't include Mn, Sm or So. The .identifier method uses the Unicode properties XID_Start and XID_Continue, which include these codepoints. |
After thinking about it more, I guess I misunderstood what \w was doing compared to isidentifier. Since Python just relies on the Unicode database, there's not much to be done anyway. Closing this. For anyone interested, we ended up with a hybrid approach for lexing identifiers: build a regex group that includes all valid ranges not matched by \w, then validate with isidentifier later. https://github.com/pallets/jinja/pull/731/files |
Python identifiers match the regex:
The standard re module doesn't support \p{...}, but the third-party "regex" module does. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: