Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.sub stalls forever on an unmatched non-greedy case #74163

Closed
RobertLujo mannequin opened this issue Apr 4, 2017 · 5 comments
Closed

re.sub stalls forever on an unmatched non-greedy case #74163

RobertLujo mannequin opened this issue Apr 4, 2017 · 5 comments
Labels
performance Performance or resource usage topic-regex

Comments

@RobertLujo
Copy link
Mannequin

RobertLujo mannequin commented Apr 4, 2017

BPO 29977
Nosy @ezio-melotti, @gareth-rees, @serhiy-storchaka

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-04-04.08:09:37.546>
created_at = <Date 2017-04-04.07:00:42.046>
labels = ['expert-regex', 'performance']
title = 're.sub stalls forever on an unmatched non-greedy case'
updated_at = <Date 2017-04-04.20:49:12.686>
user = 'https://bugs.python.org/RobertLujo'

bugs.python.org fields:

activity = <Date 2017-04-04.20:49:12.686>
actor = 'mrabarnett'
assignee = 'none'
closed = True
closed_date = <Date 2017-04-04.08:09:37.546>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2017-04-04.07:00:42.046>
creator = 'Robert Lujo'
dependencies = []
files = []
hgrepos = []
issue_num = 29977
keywords = []
message_count = 5.0
messages = ['291107', '291109', '291110', '291111', '291140']
nosy_count = 5.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'gdr@garethrees.org', 'serhiy.storchaka', 'Robert Lujo']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue29977'
versions = ['Python 2.7']

@RobertLujo
Copy link
Mannequin Author

RobertLujo mannequin commented Apr 4, 2017

Hello,

I assume I have hit some bug/misbehaviour in re module. I will provide you "working" example:

import re
RE_C_COMMENTS    = re.compile(r"/\*(.|\s)*?\*/", re.MULTILINE|re.DOTALL|re.UNICODE)
text = "Special section /* valves:\n\n\nsilicone\n\n\n\n\n\n\nHarness:\n\n\nmetal and plastic fibre\n\n\n\n\n\n\nInner frame:\n\n\nmultibutylene\n\n\n\n\n\n\nWeight:\n\n\n147 g\n\n\n\n\n\n\n\n\n\n\n\n\n\nSelection guide\n"

and then this command takes forever:
RE_C_COMMENTS.sub(" ", text, re.MULTILINE|re.DOTALL|re.UNICODE)

and the same problem you can notice on first 90 chars, it takes 10s on my machine:
RE_C_COMMENTS.sub(" ", text[:90], re.MULTILINE|re.DOTALL|re.UNICODE)

Some clarification: I try to remove the C style comments from text with non-greedy regular expression, and in this case start of comment (/) is found, and end of comment (/) can not be found. Notice the multiline and other re options.

Python versions used:

'2.7.11 (default, Jan 22 2016, 16:30:50) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]' / macOs 10.12.13

and:
'2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]' ->
Linux 84-Ubuntu SMP Wed Feb 1 17:20:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

@RobertLujo RobertLujo mannequin added topic-regex performance Performance or resource usage labels Apr 4, 2017
@gareth-rees
Copy link
Mannequin

gareth-rees mannequin commented Apr 4, 2017

The problem here is that both "." and "\s" match a whitespace character, and because you have the re.DOTALL flag turned on this includes "\n", and so the number of different ways in which (.|\s)* can be matched against a string is exponential in the number of whitespace characters in the string.

It is best to design your regular expression so as to limit the number of different ways it can match. Here I recommend the expression:

/\*(?:[^*]|\*[^/])*\*/

which can match in only one way.

@gareth-rees
Copy link
Mannequin

gareth-rees mannequin commented Apr 4, 2017

See also bpo-28690, bpo-212521, bpo-753711, bpo-1515829, etc.

@serhiy-storchaka
Copy link
Member

This is a well known issue called catastrophic backtracking. It can't be solved with the current implementation of the regular expression engine. The best you can rewrite your regular expression. Even replacing "(.|\s)" with just "." can help.

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Apr 4, 2017

A slightly shorter form:

/\*(?:(?!\*/).)*\*/

Basically it's:

match start
    while not match end:
        consume character
match end

If the "match end" is a single character, you can use a negated character set, for example:

[^\n]*

otherwise you need a negative lookahead, for example:

(?:(?!\r\n).)*

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-regex
Projects
None yet
Development

No branches or pull requests

1 participant