Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-greedy regexp duplicating match bug #34572

Closed
donut mannequin opened this issue Jun 1, 2001 · 6 comments
Closed

non-greedy regexp duplicating match bug #34572

donut mannequin opened this issue Jun 1, 2001 · 6 comments

Comments

@donut
Copy link
Mannequin

donut mannequin commented Jun 1, 2001

BPO 429357
Files
  • python-sre-429357.patch: patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2002-11-06.16:39:22.000>
    created_at = <Date 2001-06-01.16:29:19.000>
    labels = ['expert-regex']
    title = 'non-greedy regexp duplicating match bug'
    updated_at = <Date 2002-11-06.16:39:22.000>
    user = 'https://bugs.python.org/donut'

    bugs.python.org fields:

    activity = <Date 2002-11-06.16:39:22.000>
    actor = 'niemeyer'
    assignee = 'effbot'
    closed = True
    closed_date = None
    closer = None
    components = ['Regular Expressions']
    creation = <Date 2001-06-01.16:29:19.000>
    creator = 'donut'
    dependencies = []
    files = ['69']
    hgrepos = []
    issue_num = 429357
    keywords = []
    message_count = 6.0
    messages = ['4937', '4938', '4939', '4940', '4941', '4942']
    nosy_count = 5.0
    nosy_names = ['nobody', 'effbot', 'gregsmith', 'donut', 'niemeyer']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue429357'
    versions = []

    @donut
    Copy link
    Mannequin Author

    donut mannequin commented Jun 1, 2001

    I found some weird bug, where when a non-greedy match doesn't match anything,
    it will duplicate the rest of the string instead of being None.

    #pyrebug.py:
    import re
    urlrebug=re.compile("""
    	(.*?)://			#scheme
    	(
    		(.*?)			#user
    		(?:
    			:(.*)		#pass
    		)?
    	@)?
    	(.*?)				#addr
    	(?::([0-9]+))?			#port
    	(/.*)?$				#path
    """, re.VERBOSE)
    
    testbad='foo://bah:81/pth'

    print urlrebug.match(testbad).groups()

    Bug Output:

    python2.1 pyrebug.py
    ('foo', None, 'bah:81/pth', None, 'bah', '81', '/pth')
    python-cvs pyrebug.py
    ('foo', None, 'bah:81/pth', None, 'bah', '81', '/pth')

    Good (expected) Output:

    python1.5 pyrebug.py
    ('foo', None, None, None, 'bah', '81', '/pth')

    @donut donut mannequin closed this as completed Jun 1, 2001
    @donut donut mannequin assigned effbot Jun 1, 2001
    @donut donut mannequin added the topic-regex label Jun 1, 2001
    @donut donut mannequin closed this as completed Jun 1, 2001
    @donut donut mannequin assigned effbot Jun 1, 2001
    @donut donut mannequin added the topic-regex label Jun 1, 2001
    @nobody
    Copy link
    Mannequin

    nobody mannequin commented Jun 13, 2001

    Logged In: NO

    What's happening makes sense, on one level.
    When the regex engine gets to the user:pass@ part

    ((.*?)(?::(.*))?@)?

    which fill groups 2, 3, and 4, the .*? of group 3 has
    to try at every character in the rest of the string before
    admitting overall defeat. In doing that, the last time
    that group 3 successfully completely locally, it has the
    rest of the string matched. Of course, overall, group
    three is enclosed within group 2, and when group two
    couldn't complete successfully, the engine knows it can
    skip group two (due to the ? modifying it), so it totally
    bails on groups 2, 3 and 4 to continue with the rest of
    the expression.

    What you'd like to happen is when that "bailing" happens
    for group 2, the enclosing groups 3 and 4 would get zereoed
    out (since they didn't participate in the *final* overall
    match). That makes sense, and is what I would expect to
    happen. However, what *is* happening is that group 3 is
    keeping the string that *it* last matched (even thought
    that last match didn't contribute to the final, overall
    match).

    I'm not explaining this well -- I hope you can understand
    it despite that. Sorry.

    Jeffrey
    

    @donut
    Copy link
    Mannequin Author

    donut mannequin commented Jun 14, 2001

    Logged In: YES
    user_id=65253

    I think I understand what you are saying, and in the context
    of the test, it doesn't seem too bad. BUT, my original code
    (and what I'd like to have) did not have the surrounding group.

    So I'd just get: ('foo', 'bah:81/pth', None, 'bah', '81',
    '/pth')

    Knowing the general ease of messing up regexs when writing
    them, I'm sure you can image the pain I went through before
    actually realizing it was a python bug :)

    @gregsmith
    Copy link
    Mannequin

    gregsmith mannequin commented Aug 30, 2001

    Logged In: YES
    user_id=292741

    This looks like the same bug I have reported (with a much simpler example)
    as bpo-448951 (missed this one before because I was looking for 'group').
    What I found is consistent with Jeffrey's comments -
    if you have a situation where an optional part is fully scanned before the
    state machine can tell if it should actually be matched, the contained tentative
    match(es) are stored in the group() even if the optional part turns out
    to fail. Presumably, such a case needs to be handled by going back and
    deleting these after the s.m. determines that the optional part was not
    matched. In my example, I mention a small modification to the test case where
    the failure of the optional ? is decided one character later (at the end of the
    () group, not beyond it); this is enough to make it start working again.

    @donut
    Copy link
    Mannequin Author

    donut mannequin commented Oct 5, 2001

    Logged In: YES
    user_id=65253

    Ok, after poking and prodding the _sre.c code a bunch until
    I (hopefully) understand what is happening, I've created a
    patch. It passes all existing re tests as well as new ones
    I added for this bug. (I've also made a patch for the
    similar, but seperate, bug bpo-448951 which I will post there
    shortly.)

    @niemeyer
    Copy link
    Mannequin

    niemeyer mannequin commented Nov 6, 2002

    Logged In: YES
    user_id=7887

    This problem was fixed in the following CVS revisions:

    Lib/test/re_tests.py:1.30->1.31
    Lib/test/test_sre.py:1.37->1.38
    Misc/NEWS:1.511->1.512
    Modules/_sre.c:2.83->2.84

    Thank you!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    0 participants