Large regex handling very slow on Linux #52311

olivers · 2010-03-04T23:14:45Z

BPO	8064
Nosy	@vstinner, @ezio-melotti, @voidspace
Files	regextest.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-03-05.00:00:31.746>
created_at = <Date 2010-03-04.23:14:44.784>
labels = ['expert-regex', 'invalid', 'performance']
title = 'Large regex handling very slow on Linux'
updated_at = <Date 2010-03-05.15:16:21.414>
user = 'https://bugs.python.org/olivers'

bugs.python.org fields:

activity = <Date 2010-03-05.15:16:21.414>
actor = 'michael.foord'
assignee = 'none'
closed = True
closed_date = <Date 2010-03-05.00:00:31.746>
closer = 'ezio.melotti'
components = ['Regular Expressions']
creation = <Date 2010-03-04.23:14:44.784>
creator = 'olivers'
dependencies = []
files = ['16440']
hgrepos = []
issue_num = 8064
keywords = []
message_count = 9.0
messages = ['100434', '100435', '100436', '100437', '100438', '100439', '100441', '100447', '100485']
nosy_count = 6.0
nosy_names = ['exarkun', 'vstinner', 'ezio.melotti', 'mrabarnett', 'michael.foord', 'olivers']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue8064'
versions = ['Python 2.6']

olivers · 2010-03-04T23:14:43Z

The code in regextest.py (attached) uses a large regex to analyze a piece of text. I have tried this test program on two Macs, using the standard Python distributions.

On a MacBook, 2.4 GHz dual core, Snow Leopard with Python 2.6.1, it takes 0.08 seconds
On a MacPro, 2.something GHz 8 core, Leopard with Python 2.5.1, it takes 0.09 seconds

Now I've also tried it on several Linux machines, all of them running Ubuntu. They are all extremely slow. The machine I've been testing with now is a 2.0 GHz dual core machine with Python 2.5.2. A test run of the program takes 97.5 (that's not a typo) seconds on this machine, 1200 times as long as on the Macs.

exarkun · 2010-03-04T23:25:33Z

I think it's likely that the test program does drastically different things on Linux than it does on OS X:

Python 2.6.4 (r264:75706, Dec  7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import regextest
>>> len(regextest.makew())
93455
>>>

Compare to:

Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regextest
>>> len(regextest.makew())
47672
>>>

If I modify it to use 65535 (sys.maxunicode on OS X) and then run it on Linux, it completes quite quickly.

vstinner · 2010-03-04T23:27:41Z

Results on Linux (Debian Sid) with different Python versions:

Python 2.4.6: 14112.8 ms
Python 2.5.5: 14246.7 ms
Python 2.6.4+: 14753.4 ms
Python trunk (2.7a3+): 69.3 ms

It looks like re engine was optimized in trunk :-)

Note: I replaced stopwatch by time.time() and used 100 iterations instead of 500 iterations. Values are not important, only the ratio between the different results.

--

exarkun> I think it's likely that the test program does drastically
exarkun> different things on Linux than it does on OS X

sys.maxunicode should be different on the two OS.

vstinner · 2010-03-04T23:29:04Z

Ooops, my benchmark was wrong. It looks like the result depends sys.maxunicode:

$ python2.4 -c "import sys; print sys.maxunicode"
1114111
$ python2.5 -c "import sys; print sys.maxunicode"
1114111
$ python2.6 -c "import sys; print sys.maxunicode"
1114111
$ ./python -c "import sys; print sys.maxunicode"
65535

The last on (./python) is Python trunk.

voidspace · 2010-03-04T23:33:57Z

So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling?

exarkun · 2010-03-04T23:37:27Z

So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling?

No, but it's probably reasonable / unavoidable that a more complex regex should be some number of times slower than a simpler regex.

On Linux, the regex being constructed is more complex. On OS X, it's simpler. The reason for the difference is that the Linux build is UCS4, but that's only because the unicode character width is being used as part of the function that constructs the regular expression.

If you take this variable out of that function, so that it returns the same string regardless of the width of a unicode character, then performance evens out.

ezio-melotti · 2010-03-05T00:00:32Z

A workaround could be using [^\\W\\d], but this includes some extra chars in the categories Pc, Nl, and No that maybe you don't want. Generate a list of chars in these 3 categories and add them in the regex should be cheaper though.
Since this is not a bug of Python I'll close this issue.

ezio-melotti · 2010-03-05T02:14:30Z

This is a proof that you can have an equivalent regex without including all the 'letter chars' (tested on both narrow and wide builds):
>>> s = u''.join(unichr(c) for c in range(sys.maxunicode))
>>> diff = set(re.findall(u'[^\W\d]', s, re.U)) ^ set(re.findall(u'[%s_-]' % makew(), s, re.U))
>>> diff.remove('-')
>>> re.findall(u'(?:[^\W\d%s]|-)' % ''.join(diff), s, re.U) == re.findall(u'[%s_-]' % makew(), s, re.U)
True

(I don't like the way I included the '-' but I couldn't find anything better.)
It looks however that most of the time is spent during the findall and from a quick benchmark it seems that my regex is slower (even if it's shorter and it compiles faster).

voidspace · 2010-03-05T15:16:21Z

Interestingly, the code olivers is using was originally written by Martin v. Loewis:

http://www.velocityreviews.com/forums/t646421-unicode-regex-and-hindi-language.html

In response to a still open bug report on \w in the Python re module:

http://bugs.python.org/issue1693050

olivers mannequin added topic-regex performance Performance or resource usage labels Mar 4, 2010

ezio-melotti closed this as completed Mar 5, 2010

ezio-melotti added the invalid label Mar 5, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large regex handling very slow on Linux #52311

Large regex handling very slow on Linux #52311

olivers mannequin commented Mar 4, 2010

olivers mannequin commented Mar 4, 2010

exarkun mannequin commented Mar 4, 2010

vstinner commented Mar 4, 2010

vstinner commented Mar 4, 2010

voidspace commented Mar 4, 2010

exarkun mannequin commented Mar 4, 2010

ezio-melotti commented Mar 5, 2010

ezio-melotti commented Mar 5, 2010

voidspace commented Mar 5, 2010

Large regex handling very slow on Linux #52311

Large regex handling very slow on Linux #52311

Comments

olivers mannequin commented Mar 4, 2010

olivers mannequin commented Mar 4, 2010

exarkun mannequin commented Mar 4, 2010

vstinner commented Mar 4, 2010

vstinner commented Mar 4, 2010

voidspace commented Mar 4, 2010

exarkun mannequin commented Mar 4, 2010

ezio-melotti commented Mar 5, 2010

ezio-melotti commented Mar 5, 2010

voidspace commented Mar 5, 2010