Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-30688: support \N{name} escapes in re patterns #2261

Closed
wants to merge 4 commits into from

Conversation

@jonathaneunice
Copy link
Contributor

jonathaneunice commented Jun 17, 2017

re specially handles Unicode escapes (\uXXXX and \UXXXXXXXX) so that even raw strings (r'...') have symbolic Unicode characters. But it has not supported named Unicode escapes such as r'\N{EM DASH}', making the escapes for string literals diverge from those for regular expressions.

This PR brings them back into alignment by supporting named Unicode escapes in re, along with accompanying updates to tests and docs.

The implementation is straightforward, but several notes:

  1. Uses ast.literal_eval rather than unicodedata.lookup to evaluate the names. While adequate and safe, it wasn't my first choice. But I had some issues importing unicodedata which seemed to be related to the sequencing of the build. Tinkering with CPython's build automation is above my pay grade.
  2. Tokenizer.getuntil() seems more apposite for tracking to the closing brace of a named escape, but I used the less-obvious .getwhile() as .getuntil() seems to have some baked-in assumptions about its use case, especially under error conditions. Wanting to tread lightly, I fell back to .getwhile().
  3. The 100-character maximum search width is sufficient; the longest current Unicode character name is 83 characters, for the delightful ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM, which seems unlikely to be excelled any time soon.

name length distribution

https://bugs.python.org/issue30688

@mention-bot

This comment has been minimized.

Copy link

mention-bot commented Jun 17, 2017

@jonathaneunice, thanks for your PR! By analyzing the history of the files in this pull request, we identified @birkenfeld, @serhiy-storchaka and @tiran to be potential reviewers.

Copy link
Member

serhiy-storchaka left a comment

Add a Misc/NEWS and "What's New" entries and your name in Misc/ACKS.

@@ -443,7 +443,7 @@ character ``'$'``.
Most of the standard escapes supported by Python string literals are also
accepted by the regular expression parser::

\a \b \f \n
\a \b \f \n \N{name}

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jun 17, 2017

Member

Just \N.

Add '\N' to the note below:

``'\u'`` and ``'\U'`` escape sequences are only recognized ...
escape += source.getwhile(100, UNICODE_NAME)
escape += source.getwhile(1, CLOSING_BRACE)
try:
c = ord(literal_eval('"%s"' % escape))

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jun 17, 2017

Member

I prefer using unicodedata. If there are issues with importing, import it not at the module startup, but only if it is needed. Since this feature is relatively complex, the implementation can be moved in a separate function for avoiding the code duplication.

elif c == "N" and source.istext:
# named unicode escape e.g. \N{EM DASH}
escape += source.getwhile(1, OPENING_BRACE)
escape += source.getwhile(100, UNICODE_NAME)

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jun 17, 2017

Member

You could use modified getuntil(). Just add yet one parameter for specifying what is missing in error messages.

badly_formed = [r'\N{BUBBA DASH}', r'\N{EM DASH',
r'\NEM DASH}', r'\NOGGIN']
for bad in badly_formed:
with self.assertRaises(re.error):

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jun 17, 2017

Member

Use checkPatternError() and test error messages.

for bad in badly_formed:
with self.assertRaises(re.error):
re.compile(bad)

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jun 17, 2017

Member

Add also tests for \N in bytes patterns.

suites = [
[ # basic matches
['\u2014', r'\u2014', '\N{EM DASH}',
r'\N{EM DASH}'], # pattern

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jun 17, 2017

Member

Isn't the last case enough?

for patterns, match_yes, match_no in suites:
for pat in patterns:
for target in match_yes:
self.assertTrue(re.match(pat, target))

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jun 17, 2017

Member

Use subTest() when check in a loop.

Actually I think that loops are not needed. It is enough to test just one case for a pattern.

[ # basic matches
['\u2014', r'\u2014', '\N{EM DASH}',
r'\N{EM DASH}'], # pattern
['\u2014', '\N{EM DASH}', '—', '—and more'], # matches

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jun 17, 2017

Member

It is hard to see differences between different dashes on terminal. Use just \u2014 or \N{EM DASH}.

@serhiy-storchaka serhiy-storchaka self-assigned this Jun 17, 2017
Copy link
Member

serhiy-storchaka left a comment

Please use unicodedata. It is faster, uses less memory and is safer.

@brettcannon

This comment has been minimized.

Copy link
Member

brettcannon commented Feb 2, 2018

To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request.

If/when the requested changes have been made, please leave a comment that says, I have made the requested changes; please review again. That will trigger a bot to flag this pull request as ready for a follow-up review.

@serhiy-storchaka

This comment has been minimized.

Copy link
Member

serhiy-storchaka commented Feb 8, 2018

Since the original author didn't respond for long time, but the proposed feature looks worthy, I have created a new #5588 for addressing my requests.

@serhiy-storchaka serhiy-storchaka removed their assignment Dec 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.