Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.match loops forever on simple regexp #60767

Closed
lpd mannequin opened this issue Nov 27, 2012 · 7 comments
Closed

re.match loops forever on simple regexp #60767

lpd mannequin opened this issue Nov 27, 2012 · 7 comments
Labels
topic-regex type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@lpd
Copy link
Mannequin

lpd mannequin commented Nov 27, 2012

BPO 16563
Nosy @tim-one, @mdickinson, @ezio-melotti
Superseder
  • bpo-1662581: the re module can perform poorly: O(2n) versus O(n2)
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-11-27.08:45:44.639>
    created_at = <Date 2012-11-27.00:07:43.246>
    labels = ['expert-regex', 'type-crash']
    title = 're.match loops forever on simple regexp'
    updated_at = <Date 2012-11-27.08:58:46.617>
    user = 'https://bugs.python.org/lpd'

    bugs.python.org fields:

    activity = <Date 2012-11-27.08:58:46.617>
    actor = 'mark.dickinson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-11-27.08:45:44.639>
    closer = 'mark.dickinson'
    components = ['Regular Expressions']
    creation = <Date 2012-11-27.00:07:43.246>
    creator = 'lpd'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 16563
    keywords = []
    message_count = 7.0
    messages = ['176458', '176459', '176460', '176461', '176465', '176466', '176469']
    nosy_count = 5.0
    nosy_names = ['tim.peters', 'lpd', 'mark.dickinson', 'ezio.melotti', 'mrabarnett']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = None
    status = 'closed'
    superseder = '1662581'
    type = 'crash'
    url = 'https://bugs.python.org/issue16563'
    versions = ['Python 2.6']

    @lpd
    Copy link
    Mannequin Author

    lpd mannequin commented Nov 27, 2012

    I've read a number of reports of exponential-time regexp matching, but this regexp uses no unusual features, requires no backtracking, and only loops "forever" on certain input strings.

    I listed the Python version # as 2.6; I actually observed the behavior in 2.5.1 and 2.5.2, but I'm almost certain it's still there, because I saw the same behavior in a very recent build of Google's V8 interpreter, which I believe uses the same regexp engine.

    Here's the test case:

    import re
    re_utf8 = r'^([\x00-\x7f]+|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf][\x80-\xbf]|[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf])*$'
    s = "\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x14\x00\x00\x00\x01\x00\x00,`\x00\x00\x004\x00\x01\x8d"
    print re.match(re_utf8, s)

    If you pass s[0:34] or s[34:35] as the argument of re.match, it returns the correct answer, but the code above loops apparently forever.

    @lpd lpd mannequin added topic-regex type-crash A hard crash of the interpreter, possibly with a core dump labels Nov 27, 2012
    @tim-one
    Copy link
    Member

    tim-one commented Nov 27, 2012

    There's actually enormous backtracking here. Try this much shorter regexp and you'll see much the same behavior:

    re_utf8 = r'^([\x00-\x7f]+)*$'

    That's the original re_utf8 with all but the first alternative removed.

    Looks like passing s[0:34] "works" because it eliminates the trailing \x8d that prevents the regexp from matching the whole string. Because the regexp cannot match the whole string, it takes a very long time to try all the futile combinations implied by the nested quantifiers. As the much simpler re_utf8 above shows, it's not the alternatives in the regexp that matter here, it's the nested quantifiers.

    @ezio-melotti
    Copy link
    Member

    I think the problem is the first '+', but I'm not sure if this is similar to other problems (I remember something similar to (x+)* causing problems).

    (On a side note: the regex matches non-valid utf-8, see table 3-7 on http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf).

    @tim-one
    Copy link
    Member

    tim-one commented Nov 27, 2012

    Yes, if you remove the first "+", the example quickly prints None (i.e., reports that the regexp cannot match the string).

    @lpd
    Copy link
    Mannequin Author

    lpd mannequin commented Nov 27, 2012

    It never occurred to me that the regexp package would be so poorly designed that a pattern that so clearly never requires backtracking could require exponential time. I'll change the pattern (taking out the + has no effect on what strings it matches) and leave it up to others to decide whether the performance issue is worth addressing. And thanks for the pointer to the table in the Unicode standard.

    @mdickinson
    Copy link
    Member

    Closing as a duplicate of bpo-1662581.

    @mdickinson
    Copy link
    Member

    @LPD: you may want to look at the 'regex' module on PyPI [1] to see if it solves your problems. The idea is that some form of this should eventually go into the standard library, but we've been lacking volunteers for the huge code review task involved.

    [1] http://pypi.python.org/pypi/regex.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-regex type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants