Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support \u and \U escapes in regexes #47915

Closed
birkenfeld opened this issue Aug 24, 2008 · 14 comments
Closed

Support \u and \U escapes in regexes #47915

birkenfeld opened this issue Aug 24, 2008 · 14 comments
Assignees
Labels
stdlib Python modules in the Lib dir topic-regex topic-unicode type-feature A feature request or enhancement

Comments

@birkenfeld
Copy link
Member

BPO 3665
Nosy @birkenfeld, @atsuoishimoto, @pitrou, @ezio-melotti, @merwok, @serhiy-storchaka
Files
  • re_unicode_escapes.diff
  • 3665.patch
  • re_unicode_escapes.diff: Regenerate georg.brandl's patch for review
  • 3665.patch: Regenerate ishimoto's patch for review
  • re_unicode_escapes-2.patch: + PEP 393, + cleanup, + tests
  • re_unicode_escapes-3.patch: + byte patterns, + tests, + docs
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/pitrou'
    closed_at = <Date 2012-06-23.11:33:41.166>
    created_at = <Date 2008-08-24.20:33:51.445>
    labels = ['expert-regex', 'type-feature', 'library', 'expert-unicode']
    title = 'Support \\u and \\U escapes in regexes'
    updated_at = <Date 2012-06-23.11:48:13.258>
    user = 'https://github.com/birkenfeld'

    bugs.python.org fields:

    activity = <Date 2012-06-23.11:48:13.258>
    actor = 'serhiy.storchaka'
    assignee = 'pitrou'
    closed = True
    closed_date = <Date 2012-06-23.11:33:41.166>
    closer = 'pitrou'
    components = ['Library (Lib)', 'Regular Expressions', 'Unicode']
    creation = <Date 2008-08-24.20:33:51.445>
    creator = 'georg.brandl'
    dependencies = []
    files = ['11235', '17939', '25783', '25784', '26035', '26040']
    hgrepos = []
    issue_num = 3665
    keywords = ['patch', 'needs review']
    message_count = 14.0
    messages = ['71861', '71864', '71865', '71868', '109961', '138219', '162052', '162830', '163065', '163094', '163580', '163584', '163585', '163590']
    nosy_count = 9.0
    nosy_names = ['georg.brandl', 'ishimoto', 'pitrou', 'timehorse', 'ezio.melotti', 'eric.araujo', 'mrabarnett', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue3665'
    versions = ['Python 3.3']

    @birkenfeld
    Copy link
    Member Author

    Since \u and \U aren't interpolated in raw strings anymore, the re
    module should support those escapes in addition to the \x and octal ones
    it already does. Attached patch.

    @birkenfeld birkenfeld added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Aug 24, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Aug 24, 2008

    • Check that it also works for chars > 0xFFFF (even in UCS2 builds, at
      least when the chars are not part of [character range])
    • What does happen with e.g. [\U00010000-\U00010001] on an UCS build?

    @pitrou
    Copy link
    Member

    pitrou commented Aug 24, 2008

    (in the last sentence, I meant UCS2. Sorry)

    @birkenfeld
    Copy link
    Member Author

    These concerns indeed must be handled: On narrow unicode builds, chars >
    0xffff must be converted to surrogates. In ranges, they should raise an
    error.

    Additionally, this should at least raise an error too:

    >>> re.compile("[\U00100000]").match("\U00100000").group()
    '\udbc0'

    @atsuoishimoto
    Copy link
    Mannequin

    atsuoishimoto mannequin commented Jul 11, 2010

    Here's an updated patch for py3k branch.
    As per Georg's comment, I added to check codepoint in the character
    ranges, conversion to the surrogate pairs. I also added check to raise
    exception if codepoint > 0x10ffff.
    I with to English speakers to fix error messages in the patch.

    @merwok
    Copy link
    Member

    merwok commented Jun 12, 2011

    FYI,
    + raise error("bogus escape: %s" % repr(escape))

    can be written simply as

    + raise error("bogus escape: %r" % escape)

    @serhiy-storchaka
    Copy link
    Member

    I don't think it is worth to target it for 2.7 and 3.2 (it's new feature, not bugfix), but for 3.3 it will be very useful.

    Since PEP-393 conversion to the surrogate pairs is no longer relevant.

    @serhiy-storchaka serhiy-storchaka added topic-regex topic-unicode type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Jun 1, 2012
    @serhiy-storchaka
    Copy link
    Member

    Georg, Atsuo, how are you?

    @serhiy-storchaka
    Copy link
    Member

    Here is updated (in conforming with PEP-393) patch. In additional octal and hexadecimal escaping cleared, illegal error message for hexadecimal escaping fixed. Added new tests for octal and hexadecimal escaping.

    @serhiy-storchaka
    Copy link
    Member

    I forgot about byte patterns. Here is an updated patch.

    @serhiy-storchaka
    Copy link
    Member

    Any chance to commit the patch today and to get this feature in Python 3.3?

    @pitrou pitrou self-assigned this Jun 23, 2012
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 23, 2012

    New changeset b1dbd8827e79 by Antoine Pitrou in branch 'default':
    Issue bpo-3665: \u and \U escapes are now supported in unicode regular expressions.
    http://hg.python.org/cpython/rev/b1dbd8827e79

    @pitrou
    Copy link
    Member

    pitrou commented Jun 23, 2012

    Any chance to commit the patch today and to get this feature in Python
    3.3?

    Thanks for reminding us! It's now in 3.3.

    @pitrou pitrou closed this as completed Jun 23, 2012
    @serhiy-storchaka
    Copy link
    Member

    Thank you for the quick response.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-regex topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants