-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize case-insensitive regular expressions #74471
Comments
Matching and searching case-insensitive regular expressions is much slower than matching and searching case-sensitive regular expressions. Case-insensitivity requires converting every character in input string to lower case and disables some optimizations. But there are only 2669 cased characters (52 in ASCII mode). For all other characters in the pattern we can use case-sensitive matching. Results of microbenchmarks: $ ./python -m timeit -s "import re; p = re.compile(r'(?ai) +x'); s = ' '*10000+'x'" "p.match(s)"
Unpatched: 5000 loops, best of 5: 47.1 usec per loop
Patched: 20000 loops, best of 5: 17.6 usec per loop
$ ./python -m timeit -s "import re; p = re.compile(r'(?i) +x'); s = ' '*10000+'x'" "p.match(s)"
Unpatched: 2000 loops, best of 5: 109 usec per loop
Patched: 20000 loops, best of 5: 17.6 usec per loop
$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[ \t]+x'); s = ' '*10000+'x'" "p.match(s)"
Unpatched: 500 loops, best of 5: 537 usec per loop
Patched: 5000 loops, best of 5: 84.2 usec per loop
$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[ \t]+x'); s = ' '*10000+'x'" "p.match(s)"
Unpatched: 500 loops, best of 5: 608 usec per loop
Patched: 5000 loops, best of 5: 84.1 usec per loop
$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)/x'); s = ' '*10000+'/x'" "p.search(s)"
Unpatched: 1000 loops, best of 5: 226 usec per loop
Patched: 20000 loops, best of 5: 13.4 usec per loop
$ ./python -m timeit -s "import re; p = re.compile(r'(?i)/x'); s = ' '*10000+'/x'" "p.search(s)"
Unpatched: 1000 loops, best of 5: 284 usec per loop
Patched: 20000 loops, best of 5: 13.4 usec per loop
$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[/%]x'); s = ' '*10000+'/x'" "p.search(s)"
Unpatched: 500 loops, best of 5: 483 usec per loop
Patched: 1000 loops, best of 5: 279 usec per loop
$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[/%]x'); s = ' '*10000+'/x'" "p.search(s)"
Unpatched: 500 loops, best of 5: 549 usec per loop
Patched: 1000 loops, best of 5: 279 usec per loop The patch also optimizes searching some complex case-sensitive patterns. This is a side effect, I'll provide more comprehensive optimization in other issue. $ ./python -m timeit -s "import re; p = re.compile(r'([xy])'); s = ' '*10000+'x'" "p.search(s)"
Unpatched: 500 loops, best of 5: 633 usec per loop
Patched: 1000 loops, best of 5: 250 usec per loop
$ ./python -m timeit -s "import re; p = re.compile(r'((x|y)z)'); s = ' '*10000+'xz'" "p.search(s)"
Unpatched: 500 loops, best of 5: 716 usec per loop
Patched: 1000 loops, best of 5: 250 usec per loop |
This seems like a great idea. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: