Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\0 in re.sub substitutes to space #61628

Closed
techtonik mannequin opened this issue Mar 15, 2013 · 18 comments
Closed

\0 in re.sub substitutes to space #61628

techtonik mannequin opened this issue Mar 15, 2013 · 18 comments
Labels
stdlib Python modules in the Lib dir topic-regex

Comments

@techtonik
Copy link
Mannequin

techtonik mannequin commented Mar 15, 2013

BPO 17426
Nosy @gvanrossum, @amauryfa, @ezio-melotti

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-03-15.05:20:54.778>
created_at = <Date 2013-03-15.04:39:02.762>
labels = ['expert-regex', 'library']
title = '\\0 in re.sub substitutes to space'
updated_at = <Date 2013-03-15.17:30:06.224>
user = 'https://bugs.python.org/techtonik'

bugs.python.org fields:

activity = <Date 2013-03-15.17:30:06.224>
actor = 'amaury.forgeotdarc'
assignee = 'none'
closed = True
closed_date = <Date 2013-03-15.05:20:54.778>
closer = 'gvanrossum'
components = ['Library (Lib)', 'Regular Expressions']
creation = <Date 2013-03-15.04:39:02.762>
creator = 'techtonik'
dependencies = []
files = []
hgrepos = []
issue_num = 17426
keywords = []
message_count = 18.0
messages = ['184210', '184212', '184213', '184214', '184217', '184218', '184223', '184225', '184228', '184229', '184235', '184236', '184238', '184239', '184240', '184242', '184243', '184244']
nosy_count = 5.0
nosy_names = ['gvanrossum', 'amaury.forgeotdarc', 'techtonik', 'ezio.melotti', 'mrabarnett']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue17426'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

@techtonik
Copy link
Mannequin Author

techtonik mannequin commented Mar 15, 2013

According to docs, group 0 is equivalent to the whole match, which is not true for Python.

import re
print( re.sub('aaa', r'__\0__', 'argaaagra') )

arg__ __gra

import re
print( re.sub('(aaa)', r'__\1__', 'argaaagra') )

arg__aaa__gra

See also:
http://www.php.net/manual/en/function.preg-replace.php
http://www.regular-expressions.info/ruby.html

@techtonik techtonik mannequin added the stdlib Python modules in the Lib dir label Mar 15, 2013
@gvanrossum
Copy link
Member

It's not a space, it's a null byte.

Would you mind pointing out exactly where the Python docs state that \0 in re.sub() refers to th ewhole group? (IIRC it should only say that group 0 refers the whole string in the argument to the .group() method on a match object as returned by re.match() or re.search().)

@ezio-melotti
Copy link
Member

The space you see is the character \x00:
>>> re.sub('a+', r'__\0__', 'bbaaabb')
'bb__\x00__bb'

The re documentation says:
"""
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1.
"""
so the re module is behaving as documented (i.e. \0 can't be used to indicate the whole match).

I agree that this is somewhat inconsistent with the behavior of .group(0) and with other languages, however adding support for \0 would probably be backward incompatible, and as you already mentioned in your message there's a simple workaround that can be used instead.

Matthew, does regex.py support \0?
Do you know if there's a reason why this is not supported?

@gvanrossum
Copy link
Member

The doc Ezio quotes for \number is describing the regex syntax, not the substitution string syntax. Unfortunately this syntax is documented somewhat less formally than the regex syntax. Fortunately, it does mention explicitly that \g<0> substitutes the entire string, and that does work:

>>> re.sub(r'xxx', r'(\g<0>)', 'abcxxxdef')
'abc(xxx)def'
>>> 

For backward compatibility reasons I don't think we can change this, and I don't see a need either, given that \g<0> works. Regex syntax in Python is what it is -- other languages can have only limited influence. (We once started out with an approximation of what Perl offered at the time, knowing that we would eventually get out of sync with Perl, and we were okay with that.)

@techtonik
Copy link
Mannequin Author

techtonik mannequin commented Mar 15, 2013

You're right - groups are defined here: http://docs.python.org/2/library/re.html#re.MatchObject.group

The need to fix this is to gain internal language consistency, external consistency with other major implementations, reduce docs and amount of exception to remember, and thus make this part intuitive.

The external inconsistency is that other languages use \0 and don't make distinction between "match group" and "substition group". I wonder if there are any other differences justify the presence of this distinction? Internal inconsistency is in substitution groups notation:

\0 is \x0 but is not \g<0>
\1 is not \x1 but is \g<1>

Let me also put accent that re is a module - not a feature of Python language - that provides ability to work with "regular expressions". Evolution led to the "best practices" that became unwritten standard in different implementations - like using \x for backreferences in replacements. If some library invents its own standards that add nothing to the user rather than "another thing to remember" [1] then the user is automatically granted the right to wear the sign "this library suxx" on her t-shirt.

I'd classify a 'language wart' as an inconsistency in expected behavior, independent of the language, where a technical fix is possible, but can not be fixed due to backward compatibility concerns. Language independent, because single language is by definition "works as documented" and has different "features in implementation details".

  1. http://php.net/manual/en/types.comparisons.php

@techtonik
Copy link
Mannequin Author

techtonik mannequin commented Mar 15, 2013

Am I right that \0 is not supported just because nobody thought about supporting it?

@ezio-melotti
Copy link
Member

PERL uses $& for the whole match rather than $0. That would explain why \0 is not supported. For .group() it probably made sense to access the whole match using 0 rather than passing something else, and that was likely reflected in the \g<...> form, but not in the \X form.

@techtonik
Copy link
Mannequin Author

techtonik mannequin commented Mar 15, 2013

The perl syntax supported $0 according to this doc http://turtle.ee.ncku.edu.tw/docs/perl/manual/pod/perlre.html but was removed for unknown reason. Using the fact that support is removed without knowing the true reason is "cargo cult argument", which I hope is not acceptable for Python development.

Among the possible reason can be binding of $0 to the __file__ analogue according to this doc - http://www.cs.cmu.edu/afs/cs/usr/rgs/mosaic/pl-predef.html

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Mar 15, 2013

The regex behaves the same as re.

The reason it isn't supported is that \0 starts an octal escape sequence.

@gvanrossum
Copy link
Member

Anatoly, your argument for consistency with other languages is ridiculous.

@techtonik
Copy link
Mannequin Author

techtonik mannequin commented Mar 15, 2013

Matthew, finally the right answer. Thanks!

Looking further, there is a bug in processing backslashes in raw literal replacement strings. re.sub ignores raw strings as replacements. This can be even more confusing for people who look for more advanced equivalent for string replace().

  patt = "aaa"
  repl = r"zed \0 org"

  print(" aaa ".replace(patt, repl))

  import re
  print(re.sub(patt, repl, " aaa "))

This gives:

zed \0 org
zed org

With repl = "zed \0 org", the output matches:

zed org
zed org

@gvanrossum
Copy link
Member

Anatoly, your question belongs on python-list or stack overflow, not in the
tracker.

--Guido van Rossum (sent from Android phone)
On Mar 15, 2013 9:28 AM, "anatoly techtonik" <report@bugs.python.org> wrote:

anatoly techtonik added the comment:

Matthew, finally the right answer. Thanks!

Looking further, there is a bug in processing backslashes in raw literal
replacement strings. re.sub ignores raw strings as replacements. This can
be even more confusing for people who look for more advanced equivalent for
string replace().

patt = "aaa"
repl = r"zed \0 org"

print(" aaa ".replace(patt, repl))

import re
print(re.sub(patt, repl, " aaa "))

This gives:

zed \0 org
zed org

With repl = "zed \0 org", the output matches:

zed org
zed org

----------


Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue17426\>


@techtonik
Copy link
Mannequin Author

techtonik mannequin commented Mar 15, 2013

I thought that trackers are used to track the sources of the bugs. Aren't they?

@techtonik
Copy link
Mannequin Author

techtonik mannequin commented Mar 15, 2013

Users list the effect. Then a research is made to find the source. Then a decision is made to find the right cause for the source of the bug, and then a decision about if the fix is possible.

The bug is closed, but that doesn't mean we can not dedicate some time trying to research the cause. This research can be used to develop other language and explain the mechanism why this feature works like it does. If people are not interested, they can opt-out.

@amauryfa
Copy link
Member

Anatoly, your last question about re.sub is covered by the documentation:
re.sub will process the replacement string, and interpret the sequence \ 0 as the NUL character. So you get the NUL character in the returned string.

This is unrelated to raw literal strings.

And yes, sometimes you need 4 backslashes to get one in the output:

>>> print(re.sub("b", "\\\\", "abc"))
a\c

@techtonik
Copy link
Mannequin Author

techtonik mannequin commented Mar 15, 2013

Amaury, the documentation could make it more clear that it is a double replacement. Of course I payed attention to the repeated instructions about string substitution, but I thought that it is just a reminder, not an extra processing layer on top of standard string processing logic. Currently it reads like:

...if it is a string, any backslash escapes in it are processed. That is, \n is...

The correct text would be like:

...if it is a string, any backslash escapes in it are processed in addition to standard string escapes. That is, \n is... ... Note that re.sub backslash processing for replacement string occurs even if the raw strings.

@techtonik
Copy link
Mannequin Author

techtonik mannequin commented Mar 15, 2013

FWIW, I reimplemented substitution logic in my wikify [1] engine some time ago. I was kind of disappointed that I have to reinvent the bicycle, but now I see that this was for good. Thanks to people in this report I now understand the whole stuff much better and this will definitely make wikify more useful and easier to use.

  1. https://bitbucket.org/techtonik/wikify

@amauryfa
Copy link
Member

It's not a double replacement: chr(92)+chr(0) is processed only once.
And the second paragraph of the re documentation already contains such a warning.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-regex
Projects
None yet
Development

No branches or pull requests

3 participants