Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complementary re patterns such as [\s\S] or [\w\W] are much slower than . with DOTALL #111259

Open
pan324 opened this issue Oct 24, 2023 · 3 comments
Assignees
Labels
performance Performance or resource usage topic-regex

Comments

@pan324
Copy link
Contributor

pan324 commented Oct 24, 2023

Bug report

Bug description:

import re
from time import perf_counter as time

p1 = re.compile(r"[\s\S]*")
p2 = re.compile(".*", re.DOTALL)

s = "a"*10000
for p in (p1,p2):
    t0 = time()
    for i in range(10000): _=p.match(s)
    print(time()-t0)

Runtimes are 0.44 s vs 0.0016 s on my system. Instead of simplification, the [\s\S] is stepped through one after another. \s does not match so then \S is checked (the order [\S\s] is twice as fast for the string here). This is not solely an issue for larger matches. A 40 char string is processed half as fast when using [\s\S]. Even 10 chars take about 25% longer to process. I'm not completely sure whether this qualifies as a bug or an issue with documentation. Other languages don't have the DOTALL option and always rely on the first option. Plenty of posts on SO and elsewhere will thus advocate using [\s\S] as an all-matching regex pattern. Unsuspecting Python programmers such as @barneygale may expect [\s\S] to be identical to using a dot with DOTALL as seen below.

@serhiy-storchaka

cpython/Lib/pathlib.py

Lines 126 to 133 in 9bb202a

elif part == '**\n':
# '**/' component: we use '[\s\S]' rather than '.' so that path
# separators (i.e. newlines) are matched. The trailing '^' ensures
# we terminate after a path separator (i.e. on a new line).
part = r'[\s\S]*^'
elif part == '**':
# '**' component.
part = r'[\s\S]*'

CPython versions tested on:

3.11, 3.13

Operating systems tested on:

Linux, Windows

Linked PRs

@pan324 pan324 added the type-bug An unexpected behavior, bug, or error label Oct 24, 2023
@AlexWaygood AlexWaygood added performance Performance or resource usage topic-regex labels Oct 24, 2023
@serhiy-storchaka
Copy link
Member

It is neither a bug in the re module nor issue an issue with documentation. It works as expected, and there is no promise that it will in the work the fastest way (it does not in many other cases).

This can be qualified as a bug in the pathlib module, as well as a performance improvement request for the re module or a documentation enhancement request for the re module.

You can use the (?s:.) pattern to match any character regardless of the pattern flags.

@pan324
Copy link
Contributor Author

pan324 commented Oct 24, 2023

Is there a document of slow and fast patterns for the Python regex engine? I just stumbled across this pattern here by chance.

If it does work as expected then I think it cannot qualify as a bug in pathlib.

Feel free to set different labels that match the issue more closely.

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Oct 25, 2023

  • Use more efficient pattern (?s:.) instead of [\s\S] in pathlib.
  • Explicitly document (?s:.) (and maybe other idiomatic patterns).
  • (?) Compile [\s\S] in the same opcode as (?s:.).

I see that [\s\S] is used in third-party code, for example in pyparsing and pygments. I am not even sure that [\s\S] and [\w\W] are always equivalent to (?s:.). Currently it is true, but it is because the support of the Unicode standard (https://www.unicode.org/reports/tr18/) in the re module is so loose. After resolving #44795 it may no longer be true, at least for [\w\W]. So I am not sure about making such optimization automatically.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 25, 2023
Regular expression pattern `(?s:.)` is mach faster than `[\s\S]`.
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 25, 2023
Regular expression pattern `(?s:.)` is much faster than `[\s\S]`.
@serhiy-storchaka serhiy-storchaka self-assigned this Oct 25, 2023
@serhiy-storchaka serhiy-storchaka removed the type-bug An unexpected behavior, bug, or error label Oct 25, 2023
serhiy-storchaka added a commit that referenced this issue Oct 26, 2023
Regular expression pattern `(?s:.)` is much faster than `[\s\S]`.
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…1303)

Regular expression pattern `(?s:.)` is much faster than `[\s\S]`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-regex
Projects
None yet
Development

No branches or pull requests

3 participants