\0 in re.sub substitutes to space #61628

techtonik · 2013-03-15T04:39:03Z

BPO	17426
Nosy	@gvanrossum, @amauryfa, @ezio-melotti

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-03-15.05:20:54.778>
created_at = <Date 2013-03-15.04:39:02.762>
labels = ['expert-regex', 'library']
title = '\\0 in re.sub substitutes to space'
updated_at = <Date 2013-03-15.17:30:06.224>
user = 'https://bugs.python.org/techtonik'

bugs.python.org fields:

activity = <Date 2013-03-15.17:30:06.224>
actor = 'amaury.forgeotdarc'
assignee = 'none'
closed = True
closed_date = <Date 2013-03-15.05:20:54.778>
closer = 'gvanrossum'
components = ['Library (Lib)', 'Regular Expressions']
creation = <Date 2013-03-15.04:39:02.762>
creator = 'techtonik'
dependencies = []
files = []
hgrepos = []
issue_num = 17426
keywords = []
message_count = 18.0
messages = ['184210', '184212', '184213', '184214', '184217', '184218', '184223', '184225', '184228', '184229', '184235', '184236', '184238', '184239', '184240', '184242', '184243', '184244']
nosy_count = 5.0
nosy_names = ['gvanrossum', 'amaury.forgeotdarc', 'techtonik', 'ezio.melotti', 'mrabarnett']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue17426'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

techtonik · 2013-03-15T04:39:02Z

According to docs, group 0 is equivalent to the whole match, which is not true for Python.

import re
print( re.sub('aaa', r'__\0__', 'argaaagra') )

arg__ __gra

import re
print( re.sub('(aaa)', r'__\1__', 'argaaagra') )

arg__aaa__gra

See also:
http://www.php.net/manual/en/function.preg-replace.php
http://www.regular-expressions.info/ruby.html

gvanrossum · 2013-03-15T05:02:19Z

It's not a space, it's a null byte.

Would you mind pointing out exactly where the Python docs state that \0 in re.sub() refers to th ewhole group? (IIRC it should only say that group 0 refers the whole string in the argument to the .group() method on a match object as returned by re.match() or re.search().)

ezio-melotti · 2013-03-15T05:05:00Z

The space you see is the character \x00:
>>> re.sub('a+', r'__\0__', 'bbaaabb')
'bb__\x00__bb'

The re documentation says:
"""
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1.
"""
so the re module is behaving as documented (i.e. \0 can't be used to indicate the whole match).

I agree that this is somewhat inconsistent with the behavior of .group(0) and with other languages, however adding support for \0 would probably be backward incompatible, and as you already mentioned in your message there's a simple workaround that can be used instead.

Matthew, does regex.py support \0?
Do you know if there's a reason why this is not supported?

gvanrossum · 2013-03-15T05:20:54Z

The doc Ezio quotes for \number is describing the regex syntax, not the substitution string syntax. Unfortunately this syntax is documented somewhat less formally than the regex syntax. Fortunately, it does mention explicitly that \g<0> substitutes the entire string, and that does work:

>>> re.sub(r'xxx', r'(\g<0>)', 'abcxxxdef')
'abc(xxx)def'
>>>

For backward compatibility reasons I don't think we can change this, and I don't see a need either, given that \g<0> works. Regex syntax in Python is what it is -- other languages can have only limited influence. (We once started out with an approximation of what Perl offered at the time, knowing that we would eventually get out of sync with Perl, and we were okay with that.)

techtonik · 2013-03-15T06:50:37Z

You're right - groups are defined here: http://docs.python.org/2/library/re.html#re.MatchObject.group

The need to fix this is to gain internal language consistency, external consistency with other major implementations, reduce docs and amount of exception to remember, and thus make this part intuitive.

The external inconsistency is that other languages use \0 and don't make distinction between "match group" and "substition group". I wonder if there are any other differences justify the presence of this distinction? Internal inconsistency is in substitution groups notation:

\0 is \x0 but is not \g<0>
\1 is not \x1 but is \g<1>

Let me also put accent that re is a module - not a feature of Python language - that provides ability to work with "regular expressions". Evolution led to the "best practices" that became unwritten standard in different implementations - like using \x for backreferences in replacements. If some library invents its own standards that add nothing to the user rather than "another thing to remember" [1] then the user is automatically granted the right to wear the sign "this library suxx" on her t-shirt.

I'd classify a 'language wart' as an inconsistency in expected behavior, independent of the language, where a technical fix is possible, but can not be fixed due to backward compatibility concerns. Language independent, because single language is by definition "works as documented" and has different "features in implementation details".

http://php.net/manual/en/types.comparisons.php

techtonik · 2013-03-15T06:51:54Z

Am I right that \0 is not supported just because nobody thought about supporting it?

ezio-melotti · 2013-03-15T07:43:20Z

PERL uses $& for the whole match rather than $0. That would explain why \0 is not supported. For .group() it probably made sense to access the whole match using 0 rather than passing something else, and that was likely reflected in the \g<...> form, but not in the \X form.

techtonik · 2013-03-15T09:04:28Z

The perl syntax supported $0 according to this doc http://turtle.ee.ncku.edu.tw/docs/perl/manual/pod/perlre.html but was removed for unknown reason. Using the fact that support is removed without knowing the true reason is "cargo cult argument", which I hope is not acceptable for Python development.

Among the possible reason can be binding of $0 to the __file__ analogue according to this doc - http://www.cs.cmu.edu/afs/cs/usr/rgs/mosaic/pl-predef.html

mrabarnett · 2013-03-15T13:43:53Z

The regex behaves the same as re.

The reason it isn't supported is that \0 starts an octal escape sequence.

gvanrossum · 2013-03-15T14:17:41Z

Anatoly, your argument for consistency with other languages is ridiculous.

techtonik · 2013-03-15T16:28:41Z

Matthew, finally the right answer. Thanks!

Looking further, there is a bug in processing backslashes in raw literal replacement strings. re.sub ignores raw strings as replacements. This can be even more confusing for people who look for more advanced equivalent for string replace().

  patt = "aaa"
  repl = r"zed \0 org"

  print(" aaa ".replace(patt, repl))

  import re
  print(re.sub(patt, repl, " aaa "))

This gives:

zed \0 org
zed org

With repl = "zed \0 org", the output matches:

zed org
zed org

gvanrossum · 2013-03-15T16:41:59Z

Anatoly, your question belongs on python-list or stack overflow, not in the
tracker.

--Guido van Rossum (sent from Android phone)
On Mar 15, 2013 9:28 AM, "anatoly techtonik" <report@bugs.python.org> wrote:

anatoly techtonik added the comment:

Matthew, finally the right answer. Thanks!

Looking further, there is a bug in processing backslashes in raw literal
replacement strings. re.sub ignores raw strings as replacements. This can
be even more confusing for people who look for more advanced equivalent for
string replace().

patt = "aaa"
repl = r"zed \0 org"

print(" aaa ".replace(patt, repl))

import re
print(re.sub(patt, repl, " aaa "))

This gives:

zed \0 org
zed org

With repl = "zed \0 org", the output matches:

zed org
zed org

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue17426\>

techtonik · 2013-03-15T17:09:04Z

I thought that trackers are used to track the sources of the bugs. Aren't they?

techtonik · 2013-03-15T17:12:58Z

Users list the effect. Then a research is made to find the source. Then a decision is made to find the right cause for the source of the bug, and then a decision about if the fix is possible.

The bug is closed, but that doesn't mean we can not dedicate some time trying to research the cause. This research can be used to develop other language and explain the mechanism why this feature works like it does. If people are not interested, they can opt-out.

amauryfa · 2013-03-15T17:17:29Z

Anatoly, your last question about re.sub is covered by the documentation:
re.sub will process the replacement string, and interpret the sequence \ 0 as the NUL character. So you get the NUL character in the returned string.

This is unrelated to raw literal strings.

And yes, sometimes you need 4 backslashes to get one in the output:

>>> print(re.sub("b", "\\\\", "abc"))
a\c

techtonik · 2013-03-15T17:24:33Z

Amaury, the documentation could make it more clear that it is a double replacement. Of course I payed attention to the repeated instructions about string substitution, but I thought that it is just a reminder, not an extra processing layer on top of standard string processing logic. Currently it reads like:

...if it is a string, any backslash escapes in it are processed. That is, \n is...

The correct text would be like:

...if it is a string, any backslash escapes in it are processed in addition to standard string escapes. That is, \n is... ... Note that re.sub backslash processing for replacement string occurs even if the raw strings.

techtonik · 2013-03-15T17:28:07Z

FWIW, I reimplemented substitution logic in my wikify [1] engine some time ago. I was kind of disappointed that I have to reinvent the bicycle, but now I see that this was for good. Thanks to people in this report I now understand the whole stuff much better and this will definitely make wikify more useful and easier to use.

https://bitbucket.org/techtonik/wikify

amauryfa · 2013-03-15T17:30:06Z

It's not a double replacement: chr(92)+chr(0) is processed only once.
And the second paragraph of the re documentation already contains such a warning.

techtonik mannequin added the stdlib Python modules in the Lib dir label Mar 15, 2013

ezio-melotti added the topic-regex label Mar 15, 2013

gvanrossum closed this as completed Mar 15, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

\0 in re.sub substitutes to space #61628

\0 in re.sub substitutes to space #61628

techtonik mannequin commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

gvanrossum commented Mar 15, 2013

ezio-melotti commented Mar 15, 2013

gvanrossum commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

ezio-melotti commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

mrabarnett mannequin commented Mar 15, 2013

gvanrossum commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

gvanrossum commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

amauryfa commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

amauryfa commented Mar 15, 2013

\0 in re.sub substitutes to space #61628

\0 in re.sub substitutes to space #61628

Comments

techtonik mannequin commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

gvanrossum commented Mar 15, 2013

ezio-melotti commented Mar 15, 2013

gvanrossum commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

ezio-melotti commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

mrabarnett mannequin commented Mar 15, 2013

gvanrossum commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

gvanrossum commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

amauryfa commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

techtonik mannequin commented Mar 15, 2013

amauryfa commented Mar 15, 2013