Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.3a1 computes lastindex incorrectly #37819

Closed
loewis mannequin opened this issue Jan 22, 2003 · 10 comments
Closed

2.3a1 computes lastindex incorrectly #37819

loewis mannequin opened this issue Jan 22, 2003 · 10 comments

Comments

@loewis
Copy link
Mannequin

loewis mannequin commented Jan 22, 2003

BPO 672491
Nosy @loewis

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2003-04-20.00:59:48.000>
created_at = <Date 2003-01-22.15:28:53.000>
labels = ['expert-regex']
title = '2.3a1 computes lastindex incorrectly'
updated_at = <Date 2003-04-20.00:59:48.000>
user = 'https://github.com/loewis'

bugs.python.org fields:

activity = <Date 2003-04-20.00:59:48.000>
actor = 'niemeyer'
assignee = 'niemeyer'
closed = True
closed_date = None
closer = None
components = ['Regular Expressions']
creation = <Date 2003-01-22.15:28:53.000>
creator = 'loewis'
dependencies = []
files = []
hgrepos = []
issue_num = 672491
keywords = []
message_count = 10.0
messages = ['14172', '14173', '14174', '14175', '14176', '14177', '14178', '14179', '14180', '14181']
nosy_count = 3.0
nosy_names = ['loewis', 'glchapman', 'niemeyer']
pr_nums = []
priority = 'high'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue672491'
versions = ['Python 2.3']

@loewis
Copy link
Mannequin Author

loewis mannequin commented Jan 22, 2003

In Python 2.[012], the code

import re
exp = re.compile("(?P<NCName>[a-zA-Z_](\w|[_.-])*)")
print exp.match("namespace").lastgroup

prints "NCName". In Python 2.3a1, it prints "None". The
problem is that last index is 2, instead of 1, as it
should be.

@loewis loewis mannequin closed this as completed Jan 22, 2003
@loewis loewis mannequin assigned niemeyer Jan 22, 2003
@loewis loewis mannequin added the topic-regex label Jan 22, 2003
@glchapman
Copy link
Mannequin

glchapman mannequin commented Feb 5, 2003

Logged In: YES
user_id=86307

I believe the discrepancy was deliberately introduced in revision 2.84 of _sre.c. I agree with you that lastindex should return the the index of the matching group with the rightmost closing parenthesis (perhaps some elaboration in the docs is also in order). If this is the correct interpretation, two places need to be patched: 1) the handling of SRE_OP_MARK needs to be reverted to the 2.22 code and 2) the code in the lastmark_restore function needs to be tweaked so that lastindex is not accidentally set to the last matched group entered.

Thinking further though, given a (contrived) pattern like this:

re.match('((x))y(?:(a)b|ac)', 'xyac')

what should lastindex be? I assume 1, given the definition above (lastindex = matching group with rightmost close parens). In 2.22 it is 3, since group 3 matched before the branch failed at the 'b'. In 2.3a1 it is 2, since lastindex is restored (after the branch fails) using the saved lastmark. Anyway, if it should be 1, then I think _sre.c will have to save lastindex as well as lastmark when processing the three opcodes which may end up calling lastmark_restore.

@loewis
Copy link
Mannequin Author

loewis mannequin commented Apr 19, 2003

Logged In: YES
user_id=21627

Assigning to Gustavo, since he wrote 2.84. Gustavo, can you
confirm that your change broke this feature? Can you also
agree that it should be fixed?

@niemeyer
Copy link
Mannequin

niemeyer mannequin commented Apr 19, 2003

Logged In: YES
user_id=7887

Martin, the lastgroup/lastindex handling was quite broken in
Python < 2.3. We can easily see this if testing your example
in the following way:

>>> import re
>>> exp = re.compile("(?P<NCName>[a-zA-Z_](\w|[_.-])*)")
>>> match = exp.match("namespace")
>>> match.groups()
('namespace', 'e')
>>> match.groupdict()
{'NCName': 'namespace'}

This has the same result in any python you execute. This
also demonstrates that the group numbering has always been
assigned by the opening parenthesis.

About the None result, that's also correct. In the example,
we notice that the second unnamed group is correctly
matching 'e'. Accordingly to the sre module documentation,
that's how lastgroups should behave:

lastgroup
The name of the last matched capturing group, or None if
the group didn't have a name, or if no group was matched at all.

In the case above, the group didn't have a name. If we check
lastindex, we'll see it's correctly set to "2".

Greg, your example is correctly showing one of the bugs in
lastgroup/index handling in Python < 2.3. OTOH, the current
result of 2 is right, given the "count on open" rule, which
was always used. Here's an example in 2.3:

>>> re.match('((x))y(?:(a)b|ac)', 'xyac').groups()
('x', 'x', None)
>>> re.match('((x))y(?:(a)b|ac)', 'xyac').lastindex
2

(notice that groups always start in 1, as group 0 is the
"whole match" group)

Martin, if you agree, please close the bug. If you have any
doubts, it'll be a pleasure to explain.

@loewis
Copy link
Mannequin Author

loewis mannequin commented Apr 19, 2003

Logged In: YES
user_id=21627

Gustavo, I agree that the numbering of groups is and should
be by opening parentheses, i.e. that group 1 is <NCName>,
and group 2 is unnamed.

I also agree that *if* lastindex is 2, lastgroup should be None.

I still think this value is incorrect, though. It is
illogical, unhelpful, and incompatible with earlier
versions. lastindex should be 1, and lastgroup should be
"NCName".

It is illogical because group 1 ends *after* group 2 ends,
as the closing parenthesis of group 1 is after the closing
parenthesis of group 2.

It is unhelpful because one of the primary purposes of
lastgroup is to find efficiently the entire match if you
have a list of top-level alternatives, such as you do in a
scanner. With Python <2.3, you can put named alternatives
into a regex, and use lastindex to find out the alternative.
Indeed, this is what sre.Scanner does; I believe you have
broken that class as well.

It is incompatible because earlier Python versions behaved
differently. As a result of this change, PyXML does not work
with Python 2.3.

@niemeyer
Copy link
Mannequin

niemeyer mannequin commented Apr 19, 2003

Logged In: YES
user_id=7887

Why do you think lastindex is incorrect? Isn't 2 the lastindex?

>>> import re
>>> exp = re.compile("(?P<NCName>[a-zA-Z_](\w|[_.-])*)")
>>> match = exp.match("namespace")
>>> match.group(0)
'namespace'
>>> match.group(1)
'namespace'
>>> match.group(2)
'e'

It works like this in all Python versions. Also, if you
agree that group 2 is unnamed, shouldn't lastindex be 2? It
actually *is* the lastindex, as you see above.

It is incompatible with old versions because old versions
were broken, and didn't follow what was documented. It was
indeed a side effect of a bug. For example, is it logical
that if we remove the second group, lastindex is still 1 (in
Python 2.2)?

>>> exp = re.compile("(?P<NCName>[a-zA-Z_])")
>>> print exp.match("namespace").lastindex
1

It is illogical because group 1 ends *after* group 2 ends,
as the closing parenthesis of group 1 is after the closing
parenthesis of group 2.

How this changes anything? As we agreed, groups are numbered
by the opening parenthesis.

Hummm.. perhaps you think that the old behavior was to show
what group had the last closing parenthesis? No.. that
wasn't the case either. Look at this example, in Python 2.2:

>>> re.compile("(a(b)?)((c)d)?").match("abce").lastindex
4
>>> re.compile("(a(b)?)((c)d)?").match("abce").groups()
('ab', 'b', None, None)

In Python 2.3:

>>> re.compile("(a(b)?)((c)d)?").match("abce").lastindex
2
>>> re.compile("(a(b)?)((c)d)?").match("abce").groups()
('ab', 'b', None, None)

It is unhelpful because one of the primary purposes of
lastgroup is to find efficiently the entire match if you
have a list of top-level alternatives

I don't understand this. If you want the entire match, just
use group 0. Did I misunderstand it? Can you show me some
example?

Can you show me how I've broken the Scanner?

If PyXML is broken, it trusted in an undocumented and broken
feature. Should we maintain it broken to avoid breaking PyXML?

@glchapman
Copy link
Mannequin

glchapman mannequin commented Apr 19, 2003

Logged In: YES
user_id=86307

I'll just add my two cents here. Gustavo, I think given your
defintion of lastindex, there's no reason for the state to track a
seperate lastindex field; the correct value could easily be
determined by examining the mark array after matching is
completed (in the process of initializing the Match object). On
the other hand, I don't think there is an obvious way of
determining lastindex given the 2.22 definition without either
examining the compiled pattern or tracking it as the pattern
executes.

I also agree that it is an incompatible change. Although the
implementation was partly broken, I think the clear intent of the
2.22 code was to report the matching group with the last closing
parens.

FYI, I posted a patch here to revert back to the previous behavior:

http://www.python.org/sf/712900

You two may want to look at it to see if it looks like it's on the right
track. Here is what a python with this patch reports on this
example from Gustavo:

>>> re.compile("(a(b)?)((c)d)?").match("abce").lastindex
1

As you can see, it reports the correct value for lastindex given
the 2.22 definition.

@niemeyer
Copy link
Mannequin

niemeyer mannequin commented Apr 19, 2003

Logged In: YES
user_id=7887

As you can see, it reports the correct value for lastindex
given
the 2.22 definition.

The problem here is defining what's the 2.2 definition.

I've checked with other examples, and it looks like your
definition is correct in all cases where the shown bug is
not acting.

If that's the case, the documentation is *very* misleading,
and should be fixed.

@niemeyer
Copy link
Mannequin

niemeyer mannequin commented Apr 19, 2003

Logged In: YES
user_id=7887

Concluding:

  • I agree that my implementation is uncompliant with the
    previous behavior, and I think it should be modified to
    behave like before.

  • There was a bug which was fixed by my implementation, and
    should be fixed when porting to the old behavior.

  • The documentation is very misleading, and seems closer to
    my implementation than the old behavior. This should be fixed.

@niemeyer
Copy link
Mannequin

niemeyer mannequin commented Apr 20, 2003

Logged In: YES
user_id=7887

I backed out the changes made in 2.84 which changed the
behavior, and maintained the changes which fixed the bug.
Applied as:

Modules/_sre.c: 2.90

I'm also including some examples in the lastindex
documentation, explaining how it works.

Sorry about the trouble this may have caused.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0 participants