Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize (?!) in regular expressions #106566

Closed
serhiy-storchaka opened this issue Jul 9, 2023 · 4 comments
Closed

Optimize (?!) in regular expressions #106566

serhiy-storchaka opened this issue Jul 9, 2023 · 4 comments
Labels
performance Performance or resource usage topic-regex type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Jul 9, 2023

Some regular expression engines support (*FAIL) as a pattern which fails to match anything. (?!) is an idiomatic way to write this in engines which do not support (*FAIL).

It works pretty well, but it can be optimized. Instead of compiling it as ASSERT_NOT opcode

>>> re.compile(r'12(?!)|3', re.DEBUG)
BRANCH
  LITERAL 49
  LITERAL 50
  ASSERT_NOT 1
OR
  LITERAL 51

 0. INFO 9 0b100 1 2 (to 10)
      in
 5.     LITERAL 0x31 ('1')
 7.     LITERAL 0x33 ('3')
 9.     FAILURE
10: BRANCH 11 (to 22)
12.   LITERAL 0x31 ('1')
14.   LITERAL 0x32 ('2')
16.   ASSERT_NOT 3 0 (to 20)
19.     SUCCESS
20:   JUMP 7 (to 28)
22: branch 5 (to 27)
23.   LITERAL 0x33 ('3')
25.   JUMP 2 (to 28)
27: FAILURE
28: SUCCESS
re.compile('12(?!)|3', re.DEBUG)

it can be compiled as FAILURE opcode.

>>> re.compile(r'12(?!)|3', re.DEBUG)
BRANCH
  LITERAL 49
  LITERAL 50
  FAILURE
OR
  LITERAL 51

 0. INFO 9 0b100 1 2 (to 10)
      in
 5.     LITERAL 0x31 ('1')
 7.     LITERAL 0x33 ('3')
 9.     FAILURE
10: BRANCH 8 (to 19)
12.   LITERAL 0x31 ('1')
14.   LITERAL 0x32 ('2')
16.   FAILURE
17.   JUMP 7 (to 25)
19: branch 5 (to 24)
20.   LITERAL 0x33 ('3')
22.   JUMP 2 (to 25)
24: FAILURE
25: SUCCESS
re.compile('12(?!)|3', re.DEBUG)

Unfortunately I do not know good examples of using (*FAIL) in regular expressions (without using (*SKIP)) to include them in the documentation. Perhaps other patterns of using (*FAIL) could be optimized future, but I do not know what to optimize.

Linked PRs

@serhiy-storchaka serhiy-storchaka added type-feature A feature request or enhancement performance Performance or resource usage topic-regex labels Jul 9, 2023
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Jul 9, 2023
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Jul 9, 2023
@picnixz
Copy link
Contributor

picnixz commented Jul 11, 2023

As an example of a (*FAIL) without a (*SKIP) (I think?):

>>> pattern = re.compile(r'^(\[)?\d+(?:\]|;(?(1)(?!)))$')
>>> pattern.match('[123]')
<_sre.SRE_Match object; span=(0, 5), match='[123]'>
>>> pattern.match('123;')
<_sre.SRE_Match object; span=(0, 4), match='123;'>
>>> pattern.match('123') is None
True

What it does: you want to match integers either enclosed in brackets or ending with a semi-colon (didn't say it's an interesting example!). A more interesting pattern would be in Perl1:

$ perl -e ''aaab' =~ /a+b?(?{print "$&\n";})(*FAIL)/;'
aaab
aaa
aa
a
aab
aa
a
ab
a

Footnotes

  1. https://perldoc.perl.org/perlre#(*PRUNE)-(*PRUNE:NAME)

@serhiy-storchaka
Copy link
Member Author

Thank you for your example @picnixz. But is not it equivalent to a simpler r'^(\[)?\d+(?(1)\]|;)$'?

Perl's example is interesting, but Python does not support embedding arbitrary code (and perhaps should not, this feature looks too unsafe).

@picnixz
Copy link
Contributor

picnixz commented Jul 13, 2023

But is not it equivalent to a simpler r'^(\[)?\d+(?(1)\]|;)$'?

Yes it is but then my example won't be using (?!) anymore 😢. I honestly cannot come up with a better idea in pure Python.

Perl's example is interesting, but Python does not support embedding arbitrary code (and perhaps should not, this feature looks too unsafe).

I totally agree that this feature is unsafe but AFAIK, Perl doesn't care (it only gives you the possibility to do it although I didn't find an advice telling you not do it due to security concerns).

In .NET, you can use (?!) as follows to match balanced brackets:

^(?<stack>\[)+[^[\]]*(?<-stack>\])+(?(stack)(?!))$

What it does:

  • (?<stack>\[)+ matches the opening brackets and store them in the stack.
  • [^[\]]* matches all non-brackets delimiters.
  • (?<-stack>\])+ matches the closing brackets, popping one item from the group stack at every match.
  • (?(stack)(?!)) checks if the stack is empty. If so, this means that the number of opening and closing brackets are the same.

I think this feature (IIRC, .NET refers it to as balancing groups) could be one day incorporated in Python (I don't know if regex has any plan towards that direction).

@serhiy-storchaka
Copy link
Member Author

Seems it is mostly useful with constructions which Python does not support. When (if) these are added, we'll see if (?!) can be specially handled in these cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-regex type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants