-
-
Notifications
You must be signed in to change notification settings - Fork 29.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re.sub stalls forever on an unmatched non-greedy case #74163
Comments
Hello, I assume I have hit some bug/misbehaviour in re module. I will provide you "working" example: import re
RE_C_COMMENTS = re.compile(r"/\*(.|\s)*?\*/", re.MULTILINE|re.DOTALL|re.UNICODE)
text = "Special section /* valves:\n\n\nsilicone\n\n\n\n\n\n\nHarness:\n\n\nmetal and plastic fibre\n\n\n\n\n\n\nInner frame:\n\n\nmultibutylene\n\n\n\n\n\n\nWeight:\n\n\n147 g\n\n\n\n\n\n\n\n\n\n\n\n\n\nSelection guide\n" and then this command takes forever: and the same problem you can notice on first 90 chars, it takes 10s on my machine: Some clarification: I try to remove the C style comments from text with non-greedy regular expression, and in this case start of comment (/) is found, and end of comment (/) can not be found. Notice the multiline and other re options. Python versions used: '2.7.11 (default, Jan 22 2016, 16:30:50) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]' / macOs 10.12.13 and: |
The problem here is that both "." and "\s" match a whitespace character, and because you have the re.DOTALL flag turned on this includes "\n", and so the number of different ways in which (.|\s)* can be matched against a string is exponential in the number of whitespace characters in the string. It is best to design your regular expression so as to limit the number of different ways it can match. Here I recommend the expression:
which can match in only one way. |
See also bpo-28690, bpo-212521, bpo-753711, bpo-1515829, etc. |
This is a well known issue called catastrophic backtracking. It can't be solved with the current implementation of the regular expression engine. The best you can rewrite your regular expression. Even replacing "(.|\s)" with just "." can help. |
A slightly shorter form:
Basically it's:
while not match end:
consume character
If the "match end" is a single character, you can use a negated character set, for example:
otherwise you need a negative lookahead, for example:
|
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: