Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace empty matches adjacent to a previous non-empty match in re.sub() #76489

Closed
serhiy-storchaka opened this issue Dec 13, 2017 · 16 comments
Closed
Assignees
Labels
3.7 expert-regex stdlib type-feature

Comments

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Dec 13, 2017

BPO 32308
Nosy @ezio-melotti, @serhiy-storchaka, @boxed
PRs
  • #4846
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2018-01-04.09:06:40.130>
    created_at = <Date 2017-12-13.18:28:37.987>
    labels = ['expert-regex', 'type-feature', 'library', '3.7']
    title = 'Replace empty matches adjacent to a previous non-empty match in re.sub()'
    updated_at = <Date 2020-04-16.16:28:26.175>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2020-04-16.16:28:26.175>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2018-01-04.09:06:40.130>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2017-12-13.18:28:37.987>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 32308
    keywords = ['patch']
    message_count = 14.0
    messages = ['308229', '309055', '309458', '339949', '339950', '339989', '340040', '340102', '360352', '360355', '366595', '366602', '366604', '366606']
    nosy_count = 6.0
    nosy_names = ['ezio.melotti', 'mrabarnett', 'mu_mind', 'serhiy.storchaka', 'Anders.Hovm\xc3\xb6ller', 'Mark Borgerding']
    pr_nums = ['4846']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue32308'
    versions = ['Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    @serhiy-storchaka serhiy-storchaka commented Dec 13, 2017

    Currently re.sub() replaces empty matches only when not adjacent to a previous match. This makes it inconsistent with re.findall() and re.finditer() which finds empty matches adjacent to a previous non-empty match and with other RE engines.

    Proposed PR makes all functions that makes repeated searching (re.split(), re.sub(), re.findall(), re.finditer()) mutually consistent.

    The PR change the behavior of re.split() too, but this doesn't matter, since it already is different from the 3.6 behavior.

    BDFL have approved this change.

    This change doesn't break any stdlib code. It is expected that it will not break much third-party code, and even if it will break some code, it can be easily rewritten. For example replacing re.sub('(.*)', ...) (which now matches an empty string at the end of the string) with re.sub('(.+)', ...) is an obvious fix.

    @serhiy-storchaka serhiy-storchaka self-assigned this Dec 13, 2017
    @serhiy-storchaka serhiy-storchaka added stdlib expert-regex type-feature labels Dec 13, 2017
    @serhiy-storchaka
    Copy link
    Member Author

    @serhiy-storchaka serhiy-storchaka commented Dec 26, 2017

    Could anybody please make a review of at least the documentation part?

    @serhiy-storchaka
    Copy link
    Member Author

    @serhiy-storchaka serhiy-storchaka commented Jan 4, 2018

    New changeset fbb490f by Serhiy Storchaka in branch 'master':
    bpo-32308: Replace empty matches adjacent to a previous non-empty match in re.sub(). (bpo-4846)
    fbb490f

    @boxed
    Copy link
    Mannequin

    @boxed boxed mannequin commented Apr 11, 2019

    This was a really bad idea in my opinion. We just found this and we have no way to know how this will impact production. It's really absurd that

    re.sub('(.*)', r'foo', 'asd')

    is "foo" in python 1 to 3.6 but 'foofoo' in python 3.7.

    @boxed
    Copy link
    Mannequin

    @boxed boxed mannequin commented Apr 11, 2019

    Just as a comparison, sed does the 3.6 thing:

    echo foo | sed 's/\(.*\)/x\1y/g'
    xfooy

    @mrabarnett
    Copy link
    Mannequin

    @mrabarnett mrabarnett mannequin commented Apr 11, 2019

    It's now consistent with Perl, PCRE and .Net (C#), as well as re.split(), re.sub(), re.findall() and re.finditer().

    @boxed
    Copy link
    Mannequin

    @boxed boxed mannequin commented Apr 12, 2019

    That might be true, but that seems like a weak argument. If anything, it means those others are broken. What is the logic behind "(.*)" returning the entire string (which is what you asked for) and exactly one empty string? Why not two empty strings? 3? 4? 5? Why not an empty string at the beginning? It makes no practical sense.

    We will have to spend considerable effort to work around this change and adapt our code to 3.7. The lack of a discussion about backwards compatibility in this, and the other, thread before making this change is also a problem I think.

    @mrabarnett
    Copy link
    Mannequin

    @mrabarnett mrabarnett mannequin commented Apr 12, 2019

    Consider re.findall(r'.{0,2}', 'abcde').

    It finds 'ab', then continues where it left off to find 'cd', then 'e'.

    It can also find ''; re.match(r'.*', '') does match, after all.

    It could, in fact, an infinite number of ''.

    And what about re.match(r'()*', '')?

    What should it do? Run forever? Raise an exception?

    At some point you have to make a decision as to what should happen, and the general consensus has been to match once.

    @mumind
    Copy link
    Mannequin

    @mumind mumind mannequin commented Jan 21, 2020

    We were also bitten by this behavior change in google/vroom#110. I'm kinda baffled by the new behavior and assumed it had to be an accidental regression, but I guess not. If you have any other context on the BDFL conversation and reasoning for calling this behavior correct, I'd love to see additional info.

    @boxed
    Copy link
    Mannequin

    @boxed boxed mannequin commented Jan 21, 2020

    We were also bitten by this. In fact we still run a compatibility shim in production where we log if the new and old behavior are different. We also didn't think this "bug fix" made sense or was treated with the appropriate gravity in the release notes.

    I understand the logic in the bug tracker and they it matches other languages is good. But the bahvior also makes no sense for the .* case unfortunately.

    On 21 Jan 2020, at 05:56, David Barnett <report@bugs.python.org> wrote:

    
    David Barnett <davidbarnett2@gmail.com> added the comment:

    We were also bitten by this behavior change in google/vroom#110. I'm kinda baffled by the new behavior and assumed it had to be an accidental regression, but I guess not. If you have any other context on the BDFL conversation and reasoning for calling this behavior correct, I'd love to see additional info.

    ----------
    nosy: +mu_mind


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue32308\>


    @MarkBorgerding
    Copy link
    Mannequin

    @MarkBorgerding MarkBorgerding mannequin commented Apr 16, 2020

    So third-party code was knowingly broken to satisfy an aesthetic notion that substitution should be more like iteration.

    Would not a FutureWarning have been a kinder way to stage this implementation?

    A foolish consistency, indeed.

    @serhiy-storchaka
    Copy link
    Member Author

    @serhiy-storchaka serhiy-storchaka commented Apr 16, 2020

    The former implementation was wrong. See bpo-25054 which contains more obvious examples of that bug:

    >>> re.sub(r"\b|:+", "-", "a::bc")
    '-a-:-bc-'

    Not all colons were replaced despite the fact that the pattern matches all colons.

    @MarkBorgerding
    Copy link
    Mannequin

    @MarkBorgerding MarkBorgerding mannequin commented Apr 16, 2020

    @serhiy.storchaka Thanks for the link to bpo-25054 to clarify this change was not done solely for aesthetics.
    Hopefully that will mollify others like me who find their way to this discussion as they try to figure out why their code broke with a new version of python.

    I wish it had been done in a more staged and overt way, but that is just spitting in the wind at this point.

    Thanks for all your work, my gripe du jour notwithstanding.

    @serhiy-storchaka
    Copy link
    Member Author

    @serhiy-storchaka serhiy-storchaka commented Apr 16, 2020

    If the behavior is obviously wrong (like in bpo-25054), we can fix it without warnings, and even backport the fix to older versions, because we do not expect that anybody depends on such weird behavior. If we are going to change the behavior, but expect that users can depend on the current behavior, we emit a FutureWarning first (and we did it for other changes in re). But this issue is the hard one. Before 3.7 we did not know that it is related to bpo-25054. We were not going to change this behavior (at least not in near future). But when a fix for bpo-25054 was written we did see that it is the same issue. We did not want to keep a bug in bpo-25054 few versions more, so we changed the behavior in this issue without warnings. It was an exceptional case.

    This change was documented, in the module documentation, and in "What's New in Python 3.7" (section "Porting to Python 3.7"). If this is not enough we will be happy to get help to make it better.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @Fernando-Cabral
    Copy link

    @Fernando-Cabral Fernando-Cabral commented May 24, 2022

    I still think re.sub() behaves wrongly in certain cases like the following:
    re.sub('.*', "needle", "One line haystack")
    should return 'needle', but returns 'needleneedle'. I find no rational reason for this result.
    So much so that the following operations return what is to be expected:
    re.sub('^.*$', "needle", "One line haystack")
    re.sub('^.*', "needle", "One line haystack")
    In this context "." == "^." == "^.$" so they sould all give the same result. Nevertheless, they don't. Besides, ".$" should, but does not return the same result as "^.". In summary:
    ".
    " == ".$" != "^.$" == "^.".
    This is highly inconsistent.
    Also,
    re.seach(".
    ", "One line haystack")
    re.match(".", "One line haystack")
    re.findall(".
    ", "One line haystack")
    they find "One line laystack". So, why re.sub doesn't? I don't think it makes sense.

    @serhiy-storchaka
    Copy link
    Member Author

    @serhiy-storchaka serhiy-storchaka commented May 24, 2022

    Please read all previous discussions and explanations scattered across several issues.

    In your example, look how many occurrences of the pattern are found in the string:

    >>> re.findall('.*', "One line haystack")
    ['One line haystack', '']
    

    And where they are found:

    >>> [m.span() for m in re.finditer('.*', "One line haystack")]
    [(0, 17), (17, 17)]
    

    Now, if you replace characters in the ranges 0 to 17 and 17 to 17 with the replacement string "needle", what do you get?

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 expert-regex stdlib type-feature
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants