-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
important performance regression on regular expression parsing #81904
Comments
On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7 Example (ipython): In [1]: import re The test was run on Amazon Linux AMI 2017.03. On Python 3.6.1, the regexp compiled in 2.6 seconds: On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this case): Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6 The complexity of this function is on average O(N^2) but can be easily reduced to O(N). The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression. |
Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code? Could you please show benchmarking results for different implementations and different sizes? |
I can't precisely answer that, but sacremoses (a tokenization package) for example is strongly impacted. See https://github.com/alvations/sacremoses/issues/61#issuecomment-516401853 |
Oh, this is convincing. |
I've just had a look at _uniq, and the code surprises me. The obvious way to detect duplicates is with a set, but that requires the items to be hashable. Are they? Well, the first line of the function uses 'set', so they are. Why, then, isn't it using a set to detect the duplicates? How about this: def _uniq(items):
newitems = []
seen = set()
for item in items:
if item not in seen:
newitems.append(item)
seen.add(item)
return newitems |
Hey Matthew, we decided to go for this, which is simpler and straightforward: def _uniq(items):
return list(dict.fromkeys(items)) (see #15030) |
Thank you for your contribution yannvgn. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: