Undocumented feature prevents re module from finding certain matches #67880

abacabadabacaba · 2015-03-17T18:49:45Z

BPO	23692
Nosy	@ezio-melotti, @serhiy-storchaka, @animalize, @LewisGaul, @jacksonriley

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-03-17.18:49:44.539>
labels = ['expert-regex', 'type-bug', '3.7']
title = 'Undocumented feature prevents re module from finding certain matches'
updated_at = <Date 2019-11-12.10:33:07.346>
user = 'https://bugs.python.org/abacabadabacaba'

bugs.python.org fields:

activity = <Date 2019-11-12.10:33:07.346>
actor = 'jacksonriley'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Regular Expressions']
creation = <Date 2015-03-17.18:49:44.539>
creator = 'abacabadabacaba'
dependencies = []
files = []
hgrepos = []
issue_num = 23692
keywords = []
message_count = 10.0
messages = ['238330', '355471', '355493', '355646', '355950', '355953', '355955', '355982', '355989', '356435']
nosy_count = 7.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'abacabadabacaba', 'serhiy.storchaka', 'malin', 'LewisGaul', 'jacksonriley']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue23692'
versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7']

abacabadabacaba · 2015-03-17T18:49:44Z

This pattern matches:

re.match('(?:()|(?(1)()|z)){2}(?(2)a|z)', 'a')

But this doesn't:

re.match('(?:()|(?(1)()|z)){0,2}(?(2)a|z)', 'a')

The difference is that {2} is replaced by {0,2}. This shouldn't prevent the pattern from matching anywhere where it matched before.

The reason for this misbehavior is a feature which is designed to protect re engine from infinite loops, but in fact it sometimes prevents patterns from matching where they should. I think that this feature should be at least properly documented, by properly I mean that it should be possible to reconstruct the exact behavior from documentation, as the implementation is not particularly easy to understand.

LewisGaul · 2019-10-27T15:14:56Z

Hi there, if anyone's able to provide any guidance on this issue I'd be happy to take a look into it.

Is this a behaviour that is feasible to fix, or should this just be documented in some way as suggested by Evgeny?

mrabarnett · 2019-10-27T18:36:00Z

Suppose you had a pattern:

.*

It would advance one character on each iteration of the * until the . failed to match. The text is finite, so it would stop matching eventually.

Now suppose you had a pattern:

(?:)*

On each iteration of the * it wouldn't advance, so it would keep matching forever.

A way to avoid that is to stop the * if it hasn't advanced.

The example pattern shows that there's still a problem. It advances if a group has matched, but that group doens't match until the first iteration, after the test, and does not, itself, advance. The * stops because it hasn't advanced, but, in this instance, that doesn't mean it never will.

The solution is for the * to check not only whether it has advanced, but also whether a group has changed. (Strictly speaking, the latter check is needed only if the repeated part tests whether a group also in the repeated part has changed, but it's probably not worth "optimising" for that possibility.)

In the regex module, it increments a "capture changed" counter whenever any group is changed (a group's first match or a change to a group's span). That makes it easier for the * to check. The code needs to save that counter for backtracking and restore it when backtracking.

I've mentioned only the *, but the same remarks apply to + and {...}, except that the {...} should keep repeating until it has reached its prescribed minimum.

LewisGaul · 2019-10-29T11:18:52Z

Thanks for the explanation Matthew, I'll take a further look at some point in the coming weeks.

jacksonriley · 2019-11-04T15:19:05Z

Hi Matthew, thank you for your suggestions of where to start.
Could you possibly give a pointer to the place in the code where the 'capture changed' counter is incremented? I had a bit of a hunt and couldn't find it but may have been looking in the wrong places!

serhiy-storchaka · 2019-11-04T15:35:22Z

Matthew referred to the code of the regex module (of which he is the author).

https://pypi.org/project/regex/

jacksonriley · 2019-11-04T15:45:50Z

Ah thank you very much Serhiy, that's super helpful!

jacksonriley · 2019-11-04T22:02:29Z

I've got a bit confused and am doubting myself - is the below output expected?
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')
>>> m.groups()
('', '')
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')
>>> m.groups()
('', None)

The first pattern doesn't behave as I would (probably naively expect) given Matthew's explanation of this bug - wouldn't the bug cause the match to fail with {1,2} as well?
Also, it seems odd that changing the condition in the if clause should change what gets captured.

Anyone have any thoughts?

mrabarnett · 2019-11-05T00:48:04Z

It's been many years since I looked at the code, and there have been changes since then, so some of the details might not be correct.

As to have it should behave:

re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 2 matched? No.
Try to match 'z'. Fail and backtrack.
Retry the repeated part.
Iteration 2.
Has group 1 matched? Yes.
Group 2 matches.
Has group 2 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 matched.

re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 1 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 didn't match.

jacksonriley · 2019-11-12T10:33:07Z

Hi Matthew, Serhiy,

I tried to identify the right places in re to fix things but have found it a bit difficult.
I wrote up my attempt (at https://enhackathon.github.io/2019/11/04/JacksonRiley.html) which includes some examples that behave differently from how I would have expected this bug to manifest itself.
Any further guidance would be greatly appreciated!

abacabadabacaba mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Mar 17, 2015

serhiy-storchaka added the 3.7 (EOL) end of life label Oct 16, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undocumented feature prevents re module from finding certain matches #67880

Undocumented feature prevents re module from finding certain matches #67880

abacabadabacaba mannequin commented Mar 17, 2015

abacabadabacaba mannequin commented Mar 17, 2015

LewisGaul mannequin commented Oct 27, 2019

mrabarnett mannequin commented Oct 27, 2019

LewisGaul mannequin commented Oct 29, 2019

jacksonriley mannequin commented Nov 4, 2019

serhiy-storchaka commented Nov 4, 2019

jacksonriley mannequin commented Nov 4, 2019

jacksonriley mannequin commented Nov 4, 2019

mrabarnett mannequin commented Nov 5, 2019

jacksonriley mannequin commented Nov 12, 2019

Undocumented feature prevents re module from finding certain matches #67880

Undocumented feature prevents re module from finding certain matches #67880

Comments

abacabadabacaba mannequin commented Mar 17, 2015

abacabadabacaba mannequin commented Mar 17, 2015

LewisGaul mannequin commented Oct 27, 2019

mrabarnett mannequin commented Oct 27, 2019

LewisGaul mannequin commented Oct 29, 2019

jacksonriley mannequin commented Nov 4, 2019

serhiy-storchaka commented Nov 4, 2019

jacksonriley mannequin commented Nov 4, 2019

jacksonriley mannequin commented Nov 4, 2019

mrabarnett mannequin commented Nov 5, 2019

jacksonriley mannequin commented Nov 12, 2019