Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undocumented feature prevents re module from finding certain matches #67880

Open
abacabadabacaba mannequin opened this issue Mar 17, 2015 · 10 comments
Open

Undocumented feature prevents re module from finding certain matches #67880

abacabadabacaba mannequin opened this issue Mar 17, 2015 · 10 comments
Labels
3.7 (EOL) end of life topic-regex type-bug An unexpected behavior, bug, or error

Comments

@abacabadabacaba
Copy link
Mannequin

abacabadabacaba mannequin commented Mar 17, 2015

BPO 23692
Nosy @ezio-melotti, @serhiy-storchaka, @animalize, @LewisGaul, @jacksonriley

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-03-17.18:49:44.539>
labels = ['expert-regex', 'type-bug', '3.7']
title = 'Undocumented feature prevents re module from finding certain matches'
updated_at = <Date 2019-11-12.10:33:07.346>
user = 'https://bugs.python.org/abacabadabacaba'

bugs.python.org fields:

activity = <Date 2019-11-12.10:33:07.346>
actor = 'jacksonriley'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Regular Expressions']
creation = <Date 2015-03-17.18:49:44.539>
creator = 'abacabadabacaba'
dependencies = []
files = []
hgrepos = []
issue_num = 23692
keywords = []
message_count = 10.0
messages = ['238330', '355471', '355493', '355646', '355950', '355953', '355955', '355982', '355989', '356435']
nosy_count = 7.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'abacabadabacaba', 'serhiy.storchaka', 'malin', 'LewisGaul', 'jacksonriley']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue23692'
versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7']

@abacabadabacaba
Copy link
Mannequin Author

abacabadabacaba mannequin commented Mar 17, 2015

This pattern matches:

re.match('(?:()|(?(1)()|z)){2}(?(2)a|z)', 'a')

But this doesn't:

re.match('(?:()|(?(1)()|z)){0,2}(?(2)a|z)', 'a')

The difference is that {2} is replaced by {0,2}. This shouldn't prevent the pattern from matching anywhere where it matched before.

The reason for this misbehavior is a feature which is designed to protect re engine from infinite loops, but in fact it sometimes prevents patterns from matching where they should. I think that this feature should be at least properly documented, by properly I mean that it should be possible to reconstruct the exact behavior from documentation, as the implementation is not particularly easy to understand.

@abacabadabacaba abacabadabacaba mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Mar 17, 2015
@serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Oct 16, 2016
@LewisGaul
Copy link
Mannequin

LewisGaul mannequin commented Oct 27, 2019

Hi there, if anyone's able to provide any guidance on this issue I'd be happy to take a look into it.

Is this a behaviour that is feasible to fix, or should this just be documented in some way as suggested by Evgeny?

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Oct 27, 2019

Suppose you had a pattern:

.*

It would advance one character on each iteration of the * until the . failed to match. The text is finite, so it would stop matching eventually.

Now suppose you had a pattern:

(?:)*

On each iteration of the * it wouldn't advance, so it would keep matching forever.

A way to avoid that is to stop the * if it hasn't advanced.

The example pattern shows that there's still a problem. It advances if a group has matched, but that group doens't match until the first iteration, after the test, and does not, itself, advance. The * stops because it hasn't advanced, but, in this instance, that doesn't mean it never will.

The solution is for the * to check not only whether it has advanced, but also whether a group has changed. (Strictly speaking, the latter check is needed only if the repeated part tests whether a group also in the repeated part has changed, but it's probably not worth "optimising" for that possibility.)

In the regex module, it increments a "capture changed" counter whenever any group is changed (a group's first match or a change to a group's span). That makes it easier for the * to check. The code needs to save that counter for backtracking and restore it when backtracking.

I've mentioned only the *, but the same remarks apply to + and {...}, except that the {...} should keep repeating until it has reached its prescribed minimum.

@LewisGaul
Copy link
Mannequin

LewisGaul mannequin commented Oct 29, 2019

Thanks for the explanation Matthew, I'll take a further look at some point in the coming weeks.

@jacksonriley
Copy link
Mannequin

jacksonriley mannequin commented Nov 4, 2019

Hi Matthew, thank you for your suggestions of where to start.
Could you possibly give a pointer to the place in the code where the 'capture changed' counter is incremented? I had a bit of a hunt and couldn't find it but may have been looking in the wrong places!

@serhiy-storchaka
Copy link
Member

Matthew referred to the code of the regex module (of which he is the author).

https://pypi.org/project/regex/

@jacksonriley
Copy link
Mannequin

jacksonriley mannequin commented Nov 4, 2019

Ah thank you very much Serhiy, that's super helpful!

@jacksonriley
Copy link
Mannequin

jacksonriley mannequin commented Nov 4, 2019

I've got a bit confused and am doubting myself - is the below output expected?
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')
>>> m.groups()
('', '')
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')
>>> m.groups()
('', None)

The first pattern doesn't behave as I would (probably naively expect) given Matthew's explanation of this bug - wouldn't the bug cause the match to fail with {1,2} as well?
Also, it seems odd that changing the condition in the if clause should change what gets captured.

Anyone have any thoughts?

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Nov 5, 2019

It's been many years since I looked at the code, and there have been changes since then, so some of the details might not be correct.

As to have it should behave:

re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 2 matched? No.
Try to match 'z'. Fail and backtrack.
Retry the repeated part.
Iteration 2.
Has group 1 matched? Yes.
Group 2 matches.
Has group 2 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 matched.

re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 1 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 didn't match.

@jacksonriley
Copy link
Mannequin

jacksonriley mannequin commented Nov 12, 2019

Hi Matthew, Serhiy,

I tried to identify the right places in re to fix things but have found it a bit difficult.
I wrote up my attempt (at https://enhackathon.github.io/2019/11/04/JacksonRiley.html) which includes some examples that behave differently from how I would have expected this bug to manifest itself.
Any further guidance would be greatly appreciated!

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.7 (EOL) end of life topic-regex type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant