New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
\0 in re.sub substitutes to space #61628
Comments
According to docs, group 0 is equivalent to the whole match, which is not true for Python. import re
print( re.sub('aaa', r'__\0__', 'argaaagra') ) arg__ __gra import re
print( re.sub('(aaa)', r'__\1__', 'argaaagra') ) arg__aaa__gra See also: |
It's not a space, it's a null byte. Would you mind pointing out exactly where the Python docs state that \0 in re.sub() refers to th ewhole group? (IIRC it should only say that group 0 refers the whole string in the argument to the .group() method on a match object as returned by re.match() or re.search().) |
The space you see is the character \x00:
>>> re.sub('a+', r'__\0__', 'bbaaabb')
'bb__\x00__bb' The re documentation says: I agree that this is somewhat inconsistent with the behavior of .group(0) and with other languages, however adding support for \0 would probably be backward incompatible, and as you already mentioned in your message there's a simple workaround that can be used instead. Matthew, does regex.py support \0? |
The doc Ezio quotes for \number is describing the regex syntax, not the substitution string syntax. Unfortunately this syntax is documented somewhat less formally than the regex syntax. Fortunately, it does mention explicitly that \g<0> substitutes the entire string, and that does work: >>> re.sub(r'xxx', r'(\g<0>)', 'abcxxxdef')
'abc(xxx)def'
>>> For backward compatibility reasons I don't think we can change this, and I don't see a need either, given that \g<0> works. Regex syntax in Python is what it is -- other languages can have only limited influence. (We once started out with an approximation of what Perl offered at the time, knowing that we would eventually get out of sync with Perl, and we were okay with that.) |
You're right - groups are defined here: http://docs.python.org/2/library/re.html#re.MatchObject.group The need to fix this is to gain internal language consistency, external consistency with other major implementations, reduce docs and amount of exception to remember, and thus make this part intuitive. The external inconsistency is that other languages use \0 and don't make distinction between "match group" and "substition group". I wonder if there are any other differences justify the presence of this distinction? Internal inconsistency is in substitution groups notation: \0 is \x0 but is not \g<0> Let me also put accent that re is a module - not a feature of Python language - that provides ability to work with "regular expressions". Evolution led to the "best practices" that became unwritten standard in different implementations - like using \x for backreferences in replacements. If some library invents its own standards that add nothing to the user rather than "another thing to remember" [1] then the user is automatically granted the right to wear the sign "this library suxx" on her t-shirt. I'd classify a 'language wart' as an inconsistency in expected behavior, independent of the language, where a technical fix is possible, but can not be fixed due to backward compatibility concerns. Language independent, because single language is by definition "works as documented" and has different "features in implementation details". |
Am I right that \0 is not supported just because nobody thought about supporting it? |
PERL uses $& for the whole match rather than $0. That would explain why \0 is not supported. For .group() it probably made sense to access the whole match using 0 rather than passing something else, and that was likely reflected in the \g<...> form, but not in the \X form. |
The perl syntax supported $0 according to this doc http://turtle.ee.ncku.edu.tw/docs/perl/manual/pod/perlre.html but was removed for unknown reason. Using the fact that support is removed without knowing the true reason is "cargo cult argument", which I hope is not acceptable for Python development. Among the possible reason can be binding of $0 to the __file__ analogue according to this doc - http://www.cs.cmu.edu/afs/cs/usr/rgs/mosaic/pl-predef.html |
The regex behaves the same as re. The reason it isn't supported is that \0 starts an octal escape sequence. |
Anatoly, your argument for consistency with other languages is ridiculous. |
Matthew, finally the right answer. Thanks! Looking further, there is a bug in processing backslashes in raw literal replacement strings. re.sub ignores raw strings as replacements. This can be even more confusing for people who look for more advanced equivalent for string replace(). patt = "aaa"
repl = r"zed \0 org"
print(" aaa ".replace(patt, repl))
import re
print(re.sub(patt, repl, " aaa ")) This gives: zed \0 org With zed org |
Anatoly, your question belongs on python-list or stack overflow, not in the --Guido van Rossum (sent from Android phone)
|
I thought that trackers are used to track the sources of the bugs. Aren't they? |
Users list the effect. Then a research is made to find the source. Then a decision is made to find the right cause for the source of the bug, and then a decision about if the fix is possible. The bug is closed, but that doesn't mean we can not dedicate some time trying to research the cause. This research can be used to develop other language and explain the mechanism why this feature works like it does. If people are not interested, they can opt-out. |
Anatoly, your last question about re.sub is covered by the documentation: This is unrelated to raw literal strings. And yes, sometimes you need 4 backslashes to get one in the output: >>> print(re.sub("b", "\\\\", "abc"))
a\c |
Amaury, the documentation could make it more clear that it is a double replacement. Of course I payed attention to the repeated instructions about string substitution, but I thought that it is just a reminder, not an extra processing layer on top of standard string processing logic. Currently it reads like: ...if it is a string, any backslash escapes in it are processed. That is, \n is... The correct text would be like: ...if it is a string, any backslash escapes in it are processed in addition to standard string escapes. That is, \n is... ... Note that re.sub backslash processing for replacement string occurs even if the raw strings. |
FWIW, I reimplemented substitution logic in my wikify [1] engine some time ago. I was kind of disappointed that I have to reinvent the bicycle, but now I see that this was for good. Thanks to people in this report I now understand the whole stuff much better and this will definitely make wikify more useful and easier to use. |
It's not a double replacement: chr(92)+chr(0) is processed only once. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: