Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.split doesn't split with zero-width regex #47512

Closed
mrabarnett mannequin opened this issue Jul 2, 2008 · 15 comments
Closed

re.split doesn't split with zero-width regex #47512

mrabarnett mannequin opened this issue Jul 2, 2008 · 15 comments
Labels
tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Jul 2, 2008

BPO 3262
PRs
  • bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471
  • bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns (alternate version). #4678
  • Files
  • split_zero_width.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-08-04.05:05:56.318>
    created_at = <Date 2008-07-02.22:07:48.608>
    labels = ['type-bug', 'tests']
    title = "re.split doesn't split with zero-width regex"
    updated_at = <Date 2021-11-04.14:19:04.511>
    user = 'https://bugs.python.org/mrabarnett'

    bugs.python.org fields:

    activity = <Date 2021-11-04.14:19:04.511>
    actor = 'eryksun'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-08-04.05:05:56.318>
    closer = 'terry.reedy'
    components = ['Tests']
    creation = <Date 2008-07-02.22:07:48.608>
    creator = 'mrabarnett'
    dependencies = []
    files = ['10797']
    hgrepos = []
    issue_num = 3262
    keywords = ['patch']
    message_count = 15.0
    messages = ['69134', '69139', '69146', '69150', '69157', '69408', '69438', '69852', '70749', '70752', '73523', '73567', '73592', '104226', '104257']
    nosy_count = 0.0
    nosy_names = []
    pr_nums = ['4471', '4678']
    priority = 'normal'
    resolution = 'rejected'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue3262'
    versions = ['Python 2.7']

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Jul 2, 2008

    re.split doesn't split a string when the regex matches a zero characters.

    For example:

    re.split(r'\b', 'a b') returns ['a b'] instead of ['', 'a', ' ', 'b', ''].

    re.split(r'(?<!\w)(?=\w)', 'a b') returns ['a b'] instead of ['', 'a ',
    'b'].

    @mrabarnett mrabarnett mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Jul 2, 2008
    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Jul 2, 2008

    The attached patch appears to work.

    @gvanrossum
    Copy link
    Member

    Probably by design. There's probably even a unittest for this behavior.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Jul 2, 2008

    I've found that this issue has been discussed before: bpo-988761.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Jul 3, 2008

    New patch version after studying bpo-988761 and doing more testing.

    @mkc
    Copy link
    Mannequin

    mkc mannequin commented Jul 8, 2008

    I don't want to discourage you, but bpo-852532, which is essentially the
    same bug report, was closed--without explanation--as 'wont fix' in
    April, after four-plus years. I wish you good luck--this is an
    important and irritating bug, in my opinion...

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Jul 8, 2008

    There appear to be 2 opinions on this issue:

    1. It's a bug, a corner case that got missed.

    2. It's always been like this, so it's probably a design decision,
      although no-one can't point to where or when the decision was made...

    Looking at the code, I think it's a bug.

    Expected behaviour: if 'pattern' is a non-capturing regex, then
    re.split(pattern, text) == re.sub(pattern, MARKER, text).split(MARKER).

    @mkc
    Copy link
    Mannequin

    mkc mannequin commented Jul 16, 2008

    I think it's probably both. The original design was incorrect, though
    this probably wasn't apparent to the designer. But as a significant
    user of 're', it really stands out as a problem.

    @gvanrossum
    Copy link
    Member

    I think it's better to leave this alone. Such a subtle change is likely
    to trip over more people in worse ways than the alleged "bug".

    @mkc
    Copy link
    Mannequin

    mkc mannequin commented Aug 5, 2008

    Okay. For what it's worth, note that my original 2004 patch for this
    (bpo-988761) is completely backward compatible (a flag must be set in the
    call to get the new behavior).

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 21, 2008

    I wonder whether it could be put into Python 3 where certain breaks in
    backwards compatibility are to be expected.

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented Sep 22, 2008

    I think Mike Coleman proposal of enabling this behaviour via flag is
    probably best and IMHO we should consider it under these circumstances.
    Intuitively, I think you're interpretation of what re.split should do
    under zero-width conditions is logical, and I almost think this should
    be a 2-minor number transition à la from __future__ import
    zeroWidthRegexpSplit if we are to consider it as the long-term 'right
    thing to do'. 3000 (3.0) seems a good place to also consider it for
    true overhaul / reexamination, especially as we are writing 'upgrade'
    scripts for many of the other Python features. However, I would say
    this, Guido has spoken and it may be too late for the pebbles to vote.

    I would like to add this patch as a new item to the general Regexp
    Enhancements thread of bpo-2636 though, as I think it is an idea worth
    considering when overhauling Regexp.

    @gvanrossum
    Copy link
    Member

    The problem with doing this per 3.0 is that it's impossible to write a
    conversion script.

    I'm okay with adding a flag to enable this behavior though. Please open
    a new bug with a new patch, preferably one that applies cleanly to the
    trunk, and a separate patch for the py3k branch unless the trunk patch
    merges cleanly. There should also be unittests and documentation. The
    patches should be marked for Python 2.7 and 3.1 -- it's way too late to
    get this into 2.6 and 3.0.

    @pietzcker
    Copy link
    Mannequin

    pietzcker mannequin commented Apr 26, 2010

    Sorry to revive this dormant (?) topic - has anybody brought this any further? This "feature" has tripped me up a few times, and I would be all for adding a flag to enable the "split on zero-size matches" behavior, but I myself am not competent enough to code a patch.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Apr 26, 2010

    You could try the regex module mentioned in bpo-2636.

    @ahmedsayeed1982 ahmedsayeed1982 mannequin added tests Tests in the Lib/test dir and removed topic-regex labels Nov 4, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants