-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re.escape() escapes too much #74181
Comments
re.escape() escapes all the characters except ASCII letters, numbers and '_'. This is too excessive, makes escaping and compiling slower and makes the pattern less human-readable. Characters "!\"%&\',/:;<=>@_`~" as well as non-ASCII characters are always literal in a regular expression and don't need escaping. Proposed patch makes re.escape() escaping only minimal set of characters that can have special meaning in regular expressions. This includes special characters ".\\[]{}()*+?^$|", "-" (a range in a character set), "#" (starts a comment in verbose mode) and ASCII whitespaces (ignored in verbose mode). The null character no longer need a special escaping. The patch also increases the speed of re.escape() (even if it produces the same result). $ ./python -m perf timeit -s 'from re import escape; s = "()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 42.2 us +- 0.8 us
Patched: Median +- std dev: 11.4 us +- 0.1 us
$ ./python -m perf timeit -s 'from re import escape; s = b"()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 38.7 us +- 0.7 us
Patched: Median +- std dev: 18.4 us +- 0.2 us
$ ./python -m perf timeit -s 'from re import escape; s = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 40.3 us +- 0.5 us
Patched: Median +- std dev: 33.1 us +- 0.6 us
$ ./python -m perf timeit -s 'from re import escape; s = b"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 54.4 us +- 0.7 us
Patched: Median +- std dev: 40.6 us +- 0.5 us
$ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 156 us +- 3 us
Patched: Median +- std dev: 43.5 us +- 0.5 us
$ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode()' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 200 us +- 4 us
Patched: Median +- std dev: 77.0 us +- 0.6 us And the speed of compilation of escaped string. $ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"; p = escape(s)' -- --duplicate 100 'compile(p)'
Unpatched: Median +- std dev: 1.96 ms +- 0.02 ms
Patched: Median +- std dev: 1.16 ms +- 0.02 ms
$ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode(); p = escape(s)' -- --duplicate 100 'compile(p)'
Unpatched: Median +- std dev: 3.69 ms +- 0.04 ms
Patched: Median +- std dev: 2.13 ms +- 0.03 ms |
Serhiy, please nosy me when you change idlelib files. |
Aaaand this broke my unit tests when moving from 3.6 to 3.7! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: