Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large regex handling very slow on Linux #52311

Closed
olivers mannequin opened this issue Mar 4, 2010 · 9 comments
Closed

Large regex handling very slow on Linux #52311

olivers mannequin opened this issue Mar 4, 2010 · 9 comments
Labels
performance Performance or resource usage topic-regex

Comments

@olivers
Copy link
Mannequin

olivers mannequin commented Mar 4, 2010

BPO 8064
Nosy @vstinner, @ezio-melotti, @voidspace
Files
  • regextest.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-03-05.00:00:31.746>
    created_at = <Date 2010-03-04.23:14:44.784>
    labels = ['expert-regex', 'invalid', 'performance']
    title = 'Large regex handling very slow on Linux'
    updated_at = <Date 2010-03-05.15:16:21.414>
    user = 'https://bugs.python.org/olivers'

    bugs.python.org fields:

    activity = <Date 2010-03-05.15:16:21.414>
    actor = 'michael.foord'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-03-05.00:00:31.746>
    closer = 'ezio.melotti'
    components = ['Regular Expressions']
    creation = <Date 2010-03-04.23:14:44.784>
    creator = 'olivers'
    dependencies = []
    files = ['16440']
    hgrepos = []
    issue_num = 8064
    keywords = []
    message_count = 9.0
    messages = ['100434', '100435', '100436', '100437', '100438', '100439', '100441', '100447', '100485']
    nosy_count = 6.0
    nosy_names = ['exarkun', 'vstinner', 'ezio.melotti', 'mrabarnett', 'michael.foord', 'olivers']
    pr_nums = []
    priority = 'normal'
    resolution = 'not a bug'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue8064'
    versions = ['Python 2.6']

    @olivers
    Copy link
    Mannequin Author

    olivers mannequin commented Mar 4, 2010

    The code in regextest.py (attached) uses a large regex to analyze a piece of text. I have tried this test program on two Macs, using the standard Python distributions.

    On a MacBook, 2.4 GHz dual core, Snow Leopard with Python 2.6.1, it takes 0.08 seconds
    On a MacPro, 2.something GHz 8 core, Leopard with Python 2.5.1, it takes 0.09 seconds

    Now I've also tried it on several Linux machines, all of them running Ubuntu. They are all extremely slow. The machine I've been testing with now is a 2.0 GHz dual core machine with Python 2.5.2. A test run of the program takes 97.5 (that's not a typo) seconds on this machine, 1200 times as long as on the Macs.

    @olivers olivers mannequin added topic-regex performance Performance or resource usage labels Mar 4, 2010
    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Mar 4, 2010

    I think it's likely that the test program does drastically different things on Linux than it does on OS X:

    Python 2.6.4 (r264:75706, Dec  7 2009, 18:45:15)
    [GCC 4.4.1] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import regextest
    >>> len(regextest.makew())
    93455
    >>>

    Compare to:

    Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51)
    [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import regextest
    >>> len(regextest.makew())
    47672
    >>> 

    If I modify it to use 65535 (sys.maxunicode on OS X) and then run it on Linux, it completes quite quickly.

    @vstinner
    Copy link
    Member

    vstinner commented Mar 4, 2010

    Results on Linux (Debian Sid) with different Python versions:

    • Python 2.4.6: 14112.8 ms
    • Python 2.5.5: 14246.7 ms
    • Python 2.6.4+: 14753.4 ms
    • Python trunk (2.7a3+): 69.3 ms

    It looks like re engine was optimized in trunk :-)

    Note: I replaced stopwatch by time.time() and used 100 iterations instead of 500 iterations. Values are not important, only the ratio between the different results.

    --

    exarkun> I think it's likely that the test program does drastically
    exarkun> different things on Linux than it does on OS X

    sys.maxunicode should be different on the two OS.

    @vstinner
    Copy link
    Member

    vstinner commented Mar 4, 2010

    Ooops, my benchmark was wrong. It looks like the result depends sys.maxunicode:

    $ python2.4 -c "import sys; print sys.maxunicode"
    1114111
    $ python2.5 -c "import sys; print sys.maxunicode"
    1114111
    $ python2.6 -c "import sys; print sys.maxunicode"
    1114111
    $ ./python -c "import sys; print sys.maxunicode"
    65535

    The last on (./python) is Python trunk.

    @voidspace
    Copy link
    Contributor

    So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling?

    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Mar 4, 2010

    So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling?

    No, but it's probably reasonable / unavoidable that a more complex regex should be some number of times slower than a simpler regex.

    On Linux, the regex being constructed is more complex. On OS X, it's simpler. The reason for the difference is that the Linux build is UCS4, but that's only because the unicode character width is being used as part of the function that constructs the regular expression.

    If you take this variable out of that function, so that it returns the same string regardless of the width of a unicode character, then performance evens out.

    @ezio-melotti
    Copy link
    Member

    A workaround could be using [^\\W\\d], but this includes some extra chars in the categories Pc, Nl, and No that maybe you don't want. Generate a list of chars in these 3 categories and add them in the regex should be cheaper though.
    Since this is not a bug of Python I'll close this issue.

    @ezio-melotti
    Copy link
    Member

    This is a proof that you can have an equivalent regex without including all the 'letter chars' (tested on both narrow and wide builds):
    >>> s = u''.join(unichr(c) for c in range(sys.maxunicode))
    >>> diff = set(re.findall(u'[^\W\d]', s, re.U)) ^ set(re.findall(u'[%s_-]' % makew(), s, re.U))
    >>> diff.remove('-')
    >>> re.findall(u'(?:[^\W\d%s]|-)' % ''.join(diff), s, re.U) == re.findall(u'[%s_-]' % makew(), s, re.U)
    True

    (I don't like the way I included the '-' but I couldn't find anything better.)
    It looks however that most of the time is spent during the findall and from a quick benchmark it seems that my regex is slower (even if it's shorter and it compiles faster).

    @voidspace
    Copy link
    Contributor

    Interestingly, the code olivers is using was originally written by Martin v. Loewis:

    http://www.velocityreviews.com/forums/t646421-unicode-regex-and-hindi-language.html

    In response to a still open bug report on \w in the Python re module:

    http://bugs.python.org/issue1693050

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage topic-regex
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants