New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large regex handling very slow on Linux #52311
Comments
The code in regextest.py (attached) uses a large regex to analyze a piece of text. I have tried this test program on two Macs, using the standard Python distributions. On a MacBook, 2.4 GHz dual core, Snow Leopard with Python 2.6.1, it takes 0.08 seconds Now I've also tried it on several Linux machines, all of them running Ubuntu. They are all extremely slow. The machine I've been testing with now is a 2.0 GHz dual core machine with Python 2.5.2. A test run of the program takes 97.5 (that's not a typo) seconds on this machine, 1200 times as long as on the Macs. |
I think it's likely that the test program does drastically different things on Linux than it does on OS X: Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import regextest
>>> len(regextest.makew())
93455
>>> Compare to: Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regextest
>>> len(regextest.makew())
47672
>>> If I modify it to use 65535 (sys.maxunicode on OS X) and then run it on Linux, it completes quite quickly. |
Results on Linux (Debian Sid) with different Python versions:
It looks like re engine was optimized in trunk :-) Note: I replaced stopwatch by time.time() and used 100 iterations instead of 500 iterations. Values are not important, only the ratio between the different results. -- exarkun> I think it's likely that the test program does drastically sys.maxunicode should be different on the two OS. |
Ooops, my benchmark was wrong. It looks like the result depends sys.maxunicode: $ python2.4 -c "import sys; print sys.maxunicode"
1114111
$ python2.5 -c "import sys; print sys.maxunicode"
1114111
$ python2.6 -c "import sys; print sys.maxunicode"
1114111
$ ./python -c "import sys; print sys.maxunicode"
65535 The last on (./python) is Python trunk. |
So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling? |
No, but it's probably reasonable / unavoidable that a more complex regex should be some number of times slower than a simpler regex. On Linux, the regex being constructed is more complex. On OS X, it's simpler. The reason for the difference is that the Linux build is UCS4, but that's only because the unicode character width is being used as part of the function that constructs the regular expression. If you take this variable out of that function, so that it returns the same string regardless of the width of a unicode character, then performance evens out. |
A workaround could be using [^\\W\\d], but this includes some extra chars in the categories Pc, Nl, and No that maybe you don't want. Generate a list of chars in these 3 categories and add them in the regex should be cheaper though. |
This is a proof that you can have an equivalent regex without including all the 'letter chars' (tested on both narrow and wide builds):
>>> s = u''.join(unichr(c) for c in range(sys.maxunicode))
>>> diff = set(re.findall(u'[^\W\d]', s, re.U)) ^ set(re.findall(u'[%s_-]' % makew(), s, re.U))
>>> diff.remove('-')
>>> re.findall(u'(?:[^\W\d%s]|-)' % ''.join(diff), s, re.U) == re.findall(u'[%s_-]' % makew(), s, re.U)
True (I don't like the way I included the '-' but I couldn't find anything better.) |
Interestingly, the code olivers is using was originally written by Martin v. Loewis: http://www.velocityreviews.com/forums/t646421-unicode-regex-and-hindi-language.html In response to a still open bug report on \w in the Python re module: |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: