Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in re.sub() #46101

Closed
jmravon mannequin opened this issue Jan 8, 2008 · 14 comments
Closed

Bug in re.sub() #46101

jmravon mannequin opened this issue Jan 8, 2008 · 14 comments
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@jmravon
Copy link
Mannequin

jmravon mannequin commented Jan 8, 2008

BPO 1761
Nosy @gvanrossum, @birkenfeld, @facundobatista, @amauryfa

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2008-01-10.22:01:57.278>
created_at = <Date 2008-01-08.09:22:00.732>
labels = ['expert-regex', 'type-bug']
title = 'Bug in re.sub()'
updated_at = <Date 2008-01-10.22:01:57.277>
user = 'https://bugs.python.org/jmravon'

bugs.python.org fields:

activity = <Date 2008-01-10.22:01:57.277>
actor = 'amaury.forgeotdarc'
assignee = 'effbot'
closed = True
closed_date = <Date 2008-01-10.22:01:57.278>
closer = 'amaury.forgeotdarc'
components = ['Regular Expressions']
creation = <Date 2008-01-08.09:22:00.732>
creator = 'jmravon'
dependencies = []
files = []
hgrepos = []
issue_num = 1761
keywords = []
message_count = 14.0
messages = ['59526', '59528', '59532', '59533', '59534', '59535', '59537', '59540', '59564', '59566', '59597', '59606', '59607', '59681']
nosy_count = 6.0
nosy_names = ['gvanrossum', 'effbot', 'georg.brandl', 'facundobatista', 'amaury.forgeotdarc', 'jmravon']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue1761'
versions = ['Python 2.5']

@jmravon
Copy link
Mannequin Author

jmravon mannequin commented Jan 8, 2008

Here is my source:
def truc ():
line = ' hi \n'
line1 = re.sub('$', 'hello', line)
line2 = re.sub('$', 'you', line1)
print line2

Here is what I get:

>>> trace.truc()
 hi hello
helloyou
>>>

@jmravon jmravon mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Jan 8, 2008
@amauryfa
Copy link
Member

amauryfa commented Jan 8, 2008

In other words, if I understand correctly:
>>> re.sub('$', '#', 'a\nb\nc')
'a\nb\nc#'
>>> re.sub('$', '#', 'a\nb\n')
'a\nb#\n#'

The first sample is correct, but the second one find two matches, even
without the re.MULTILINE option.

Is this normal? The docs say:

'$' Matches the end of the string or just before the newline at
the end of the string [...]
It seems that it matches BOTH the end of the string AND just before the
newline at the end of the string.

@birkenfeld
Copy link
Member

Fredrik?

@effbot
Copy link
Mannequin

effbot mannequin commented Jan 8, 2008

re.findall has the same behaviour. Without looking at the code, I'm not
sure if this is a bug in the code or in the documentation, really.

@facundobatista
Copy link
Member

As re provides regular expression matching operations similar to those
found in Perl, I tried there to see what happens:

"""
use Data::Dumper;

$a = 'a\nb\nc';
$a =~ s/$/#/;
print Dumper($a);

$a = 'a\nb\n';
$a =~ s/$/#/;
print Dumper($a);
"""

$ perl pru_sub.pl
$VAR1 = 'a\\nb\\nc#';
$VAR1 = 'a\\nb\\n#';

@amauryfa
Copy link
Member

amauryfa commented Jan 8, 2008

Careful, Perl strings must be double-quoted for \n to be understood as
the newline character:

"""
use Data::Dumper;

$a = "a\nb\nc";
$a =~ s/$/#/;
print Dumper($a);

$a = "a\nb\n";
$a =~ s/$/#/;
print Dumper($a);
"""

And the output is:

$VAR1 = 'a
b
c#';
$VAR1 = 'a
b#
';

Which is definitely different from python output.

@birkenfeld
Copy link
Member

At least, the docs for re.M are consistent with the current behavior.

@gvanrossum
Copy link
Member

So if the input ends in '\n', '$' matches both before and after that
character, and two substitutions are made (even though multiline is not
set). Seems a bug to me.

@amauryfa
Copy link
Member

amauryfa commented Jan 8, 2008

In the previous samples we forgot the /g option needed to match ALL
occurrences of the pattern:

"""
use Data::Dumper;

$a = "a\nb\nc";
$a =~ s/$/#/g;
print Dumper($a);

$a = "a\nb\n";
$a =~ s/$/#/g;
print Dumper($a);
"""

Which now gives the same output as Python:

$VAR1 = 'a
b
c#';
$VAR1 = 'a
b#
#';

Perl is too difficult for us ;-)

What shall we do?

  • mark the issue as invalid
  • diverge from Perl regular expressions
  • file a bug in the PCRE issue tracker
    And in every case: add these samples to the test suite.

@gvanrossum
Copy link
Member

Then I'd say, this is the correct semantics, for better or for worse;
add an example to the docs, and a test to the test suite, and close this
as wont fix.

@effbot
Copy link
Mannequin

effbot mannequin commented Jan 9, 2008

For the record, $ is defined to match "before a newline at the end of
the string, or at the end of the string" in normal mode, and "before any
newline, or at the end of the string" in multiline mode.

(and I have a vague memory that the "before a newline" behaviour in
normal mode was added for Perl compatibility)

It seems that it matches BOTH the end of the string AND just before
the newline at the end of the string.

Of course it does: re.sub scans the string for matches from left to
right, and does the substitution everywhere the pattern matches, only
skipping over the matched parts. Or in other words, if a pattern
matches N characters on position X has no influence on whether it
matches on position X+N or not.

@gvanrossum
Copy link
Member

Which is why I like to use \Z to match *only* the end of the string.

@amauryfa
Copy link
Member

amauryfa commented Jan 9, 2008

Aha, I always thought that \Z was an alias for $.

@amauryfa
Copy link
Member

This may be a surprising behaviour, but consistent with Perl and the
pcre library.
Added a sentence in documentation, and specific tests.

Committed as r59896.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-regex type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants