re.findall: '\Z' must consume end of string if it matched #87880

alegrigoriev · 2021-04-03T13:03:13Z

BPO	43714
Nosy	@terryjreedy, @ezio-melotti, @serhiy-storchaka

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-04-03.13:03:13.339>
labels = ['type-bug', 'library', '3.9', '3.10']
title = "re.findall: '\\Z' must consume end of string if it matched"
updated_at = <Date 2021-04-10.07:33:07.809>
user = 'https://bugs.python.org/alegrigoriev'

bugs.python.org fields:

activity = <Date 2021-04-10.07:33:07.809>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2021-04-03.13:03:13.339>
creator = 'alegrigoriev'
dependencies = []
files = []
hgrepos = []
issue_num = 43714
keywords = []
message_count = 5.0
messages = ['390124', '390127', '390169', '390686', '390699']
nosy_count = 5.0
nosy_names = ['terry.reedy', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'alegrigoriev']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue43714'
versions = ['Python 3.9', 'Python 3.10']

alegrigoriev · 2021-04-03T13:03:13Z

If '\Z' matches as part of a pattern in re.sub() or re.split(), it should consume the end of string, and then '\Z' alone should not match the end of string again.

Current behavior:

Python 3.9.2 (tags/v3.9.2:1a79785, Feb 19 2021, 13:44:55) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> print(re.split(r'/?\Z', 'a/b/c/d/'))
['a/b/c/d', '', '']
>>> print(re.sub(r'/?\Z', '-', 'a/b/c/d/'))
a/b/c/d--

Wanted behavior:

>>> print(re.split(r'/?\Z', 'a/b/c/d/'))
['a/b/c/d', '']
>>> print(re.sub(r'/?\Z', '-', 'a/b/c/d/'))
a/b/c/d-

mrabarnett · 2021-04-03T16:38:25Z

Do any other regex implementations behave the way you want?

In my experience, there's no single "correct" way for a regex to behave; different implementations might give slightly different results, so if the most common ones behave a certain way, then that's the de facto standard, even if it not what you'd expect or want.

alegrigoriev · 2021-04-04T02:35:59Z

For example, sed:

$ sed --version
sed (GNU sed) 4.8
Copyright (C) 2020 Free Software Foundation, Inc.

$ sed -e 's/-\?$/x/g' <<<'a-b-'
a-bx

Perl:
$ perl --version

This is perl 5, version 32, subversion 0 (v5.32.0) built for x86_64-msys-thread-multi

Copyright 1987-2020, Larry Wall
$ perl -e 'my $x="a-b-"; $x =~ s/-?$/x/g; print $x'
a-bxx

https://www.freeformatter.com/java-regex-tester.html

Java Regular Expression :
-?$
Entry to test against :
a-b-c-
String replacement result:
a-b-cx

During replacement or split, a match consumes the matched character. It's easy to forget that "end of line" should be considered a (pseudo)character and must also be consumed if it matched.

terryjreedy · 2021-04-10T02:54:20Z

Python regexes match slices of a Python string s. The latter include the len(s)+1 empty slices of s. An re Match gives both the slice itself as match attribute and its slice coordinates (span) in the searched string.

https://docs.python.org/3/library/re.html says "\Z Matches only at the end of the string." There are two possible interpretations:

'\Z', by itself, matches the final empty slice s[n:n] of search string s, where n = len(s).
'\Z' modifies the (preceding) re to match "only at the end of the string", where the preceding re can be empty.

For a single left to right search, I believe there is no difference. (I use '$' instead of '\Z', which I believe is the same without the re.MULTILINE flag.)

>>> re.search(r'', 'a')
<re.Match object; span=(0, 0), match=''>
>>> re.search(r'$', 'a')
<re.Match object; span=(1, 1), match=''>

Either interpretation explains and is consistent with the second result.

The issue is functions that look for multiple sequential matches. re.sub and re.split are based on re.finditer, which listed by re.findall. The latter two return all non-overlapping matches (slices), including empty slices. Hence, with an an regex that matches final '/' or '',

>>> re.findall(r'/?$', '/')
['/', '']

I believe Alexander proposes that the 2nd member should not be there, but it is a match starting after '/' and does not overlap.

The word 'consume' only appears in the current doc once -- "(?=...) Matches if ... matches next, but doesn’t consume any of the string." If we consider 'end of string' to be the final null slice, it does seem to be 'consumed' in that the final empty slice is only matched and added to the list once.

I think that this should be closed as 'not a bub'.

As for the desired results for the examples, they involve manipulating the result of deleting a final '/' if there is one (and re is not even needed that).

>>> [re.sub('/$', '', 'a/b/c/d/'), '']
['a/b/c/d', '']
>>> re.sub('/$', '', 'a/b/c/d/') + '-'
'a/b/c/d-'

serhiy-storchaka · 2021-04-10T07:33:08Z

I concur with Matthew. I tested several implementations in different programming languages. Perl, PHP and Java behave the same way as Python. Sed, awk and Go behave other way. We can argue that one or other way is "better", but it looks subjective, and in any case such change is breaking. It is better to keep the current behavior until we have very good reasons to break things.

Old versions of Python had different behavior, but the implementation contained a bug which caused skipping some characters (see bpo-25054). It also prevented support of zero-width patterns in re.split() and the behavior was inconsistent between different re functions. The simplest way of fixing that bug lead to behavior consistent with Perl and Java.

serhiy-storchaka · 2022-04-22T17:42:32Z

Closed as "not a bug".

alegrigoriev mannequin added 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Apr 3, 2021

terryjreedy added the 3.10 only security fixes label Apr 10, 2021

terryjreedy changed the title ~~re.split(), re.sub(): '\Z' must consume end of string if it matched~~ re.findall: '\Z' must consume end of string if it matched Apr 10, 2021

terryjreedy added the 3.10 only security fixes label Apr 10, 2021

terryjreedy changed the title ~~re.split(), re.sub(): '\Z' must consume end of string if it matched~~ re.findall: '\Z' must consume end of string if it matched Apr 10, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

AlexWaygood added the topic-regex label Apr 19, 2022

serhiy-storchaka closed this as completed Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re.findall: '\Z' must consume end of string if it matched #87880

re.findall: '\Z' must consume end of string if it matched #87880

alegrigoriev mannequin commented Apr 3, 2021

alegrigoriev mannequin commented Apr 3, 2021

mrabarnett mannequin commented Apr 3, 2021

alegrigoriev mannequin commented Apr 4, 2021

terryjreedy commented Apr 10, 2021

serhiy-storchaka commented Apr 10, 2021

serhiy-storchaka commented Apr 22, 2022

re.findall: '\Z' must consume end of string if it matched #87880

re.findall: '\Z' must consume end of string if it matched #87880

Comments

alegrigoriev mannequin commented Apr 3, 2021

alegrigoriev mannequin commented Apr 3, 2021

mrabarnett mannequin commented Apr 3, 2021

alegrigoriev mannequin commented Apr 4, 2021

terryjreedy commented Apr 10, 2021

serhiy-storchaka commented Apr 10, 2021

serhiy-storchaka commented Apr 22, 2022