re \w does not match some valid Unicode characters #75021

davidism · 2017-07-03T15:39:30Z

BPO	30838
Nosy	@vstinner, @ezio-melotti, @serhiy-storchaka, @ThiefMaster, @davidism

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-07-05.15:19:42.232>
created_at = <Date 2017-07-03.15:39:29.866>
labels = ['expert-regex', 'invalid', 'type-bug', '3.7', 'expert-unicode']
title = 're \\w does not match some valid Unicode characters'
updated_at = <Date 2017-07-05.17:22:30.253>
user = 'https://github.com/davidism'

bugs.python.org fields:

activity = <Date 2017-07-05.17:22:30.253>
actor = 'mrabarnett'
assignee = 'none'
closed = True
closed_date = <Date 2017-07-05.15:19:42.232>
closer = 'davidism'
components = ['Regular Expressions', 'Unicode']
creation = <Date 2017-07-03.15:39:29.866>
creator = 'davidism'
dependencies = []
files = []
hgrepos = []
issue_num = 30838
keywords = []
message_count = 5.0
messages = ['297603', '297604', '297613', '297766', '297773']
nosy_count = 6.0
nosy_names = ['vstinner', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'ThiefMaster', 'davidism']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue30838'
versions = ['Python 3.3', 'Python 3.4', 'Python 3.5', 'Python 3.6', 'Python 3.7']

davidism · 2017-07-03T15:39:29Z

This came up while writing a regex to match characters that are valid in Python identifiers for Jinja. pallets/jinja#731 \w matches all valid identifier characters except for 4 special cases:

import unicodedata
import re
import sys

cre = re.compile(r'\w')

for cp in range(sys.maxunicode + 1):
    s = chr(cp)

    if s.isidentifier() and not cre.match(s):
        print(hex(cp), unicodedata.name(s))

0x1885 MONGOLIAN LETTER ALI GALI BALUDA
0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
0x2118 SCRIPT CAPITAL P
0x212e ESTIMATED SYMBOL

Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped matching them.

For our case, we just added them to a character set, [\w\u1885\u1886\u2118\u212e].

It can cause unexpected behavior when using \b, since that's defined as the transition from \w to \W and those 4 characters aren't in \w. re.match(r'\b[\w\u212e', '℮') fails to match.

davidism · 2017-07-03T16:21:56Z

Adding or ('a' + s).isidentifer(), to catch valid id_continue characters, to the test in the previous script reveals many more characters that seem like valid word characters but aren't matched by \w.

mrabarnett · 2017-07-03T20:21:46Z

In Unicode 9.0.0, U+1885 and U+1886 changed from being General_Category=Other_Letter (Lo) to General_Category=Nonspacing_Mark (Mn).

U+2118 is General_Category=Math_Symbol (Sm) and U+212E is General_Category=Other_Symbol (So).

\w doesn't include Mn, Sm or So.

The .identifier method uses the Unicode properties XID_Start and XID_Continue, which include these codepoints.

davidism · 2017-07-05T15:19:42Z

After thinking about it more, I guess I misunderstood what \w was doing compared to isidentifier. Since Python just relies on the Unicode database, there's not much to be done anyway. Closing this.

For anyone interested, we ended up with a hybrid approach for lexing identifiers: build a regex group that includes all valid ranges not matched by \w, then validate with isidentifier later. https://github.com/pallets/jinja/pull/731/files

mrabarnett · 2017-07-05T17:22:30Z

Python identifiers match the regex:

[_\p{XID_Start}]\p{XID_Continue}*

The standard re module doesn't support \p{...}, but the third-party "regex" module does.

davidism mannequin added 3.7 (EOL) end of life topic-regex topic-unicode type-bug An unexpected behavior, bug, or error labels Jul 3, 2017

davidism mannequin closed this as completed Jul 5, 2017

davidism mannequin added the invalid label Jul 5, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re \w does not match some valid Unicode characters #75021

re \w does not match some valid Unicode characters #75021

davidism mannequin commented Jul 3, 2017

davidism mannequin commented Jul 3, 2017

davidism mannequin commented Jul 3, 2017

mrabarnett mannequin commented Jul 3, 2017

davidism mannequin commented Jul 5, 2017

mrabarnett mannequin commented Jul 5, 2017

re \w does not match some valid Unicode characters #75021

re \w does not match some valid Unicode characters #75021

Comments

davidism mannequin commented Jul 3, 2017

davidism mannequin commented Jul 3, 2017

davidism mannequin commented Jul 3, 2017

mrabarnett mannequin commented Jul 3, 2017

davidism mannequin commented Jul 5, 2017

mrabarnett mannequin commented Jul 5, 2017