important performance regression on regular expression parsing #81904

yannvgn · 2019-07-30T22:02:34Z

BPO	37723
Nosy	@ezio-melotti, @serhiy-storchaka, @miss-islington, @miss-islington, @yannvgn
PRs	bpo-37723: fix performance regression on regular expression parsing #15030 [3.8] bpo-37723: Fix performance regression on regular expression parsing. (GH-15030) #15059 [3.7] bpo-37723: Fix performance regression on regular expression parsing. (GH-15030) #15060

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-08-04.07:41:20.307>
created_at = <Date 2019-07-30.22:02:34.352>
labels = ['expert-regex', '3.7', '3.8', '3.9', 'performance']
title = 'important performance regression on regular expression parsing'
updated_at = <Date 2019-08-04.07:41:20.299>
user = 'https://github.com/yannvgn'

bugs.python.org fields:

activity = <Date 2019-08-04.07:41:20.299>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2019-08-04.07:41:20.307>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2019-07-30.22:02:34.352>
creator = 'yannvgn'
dependencies = []
files = []
hgrepos = []
issue_num = 37723
keywords = ['patch']
message_count = 10.0
messages = ['348771', '348789', '348813', '348814', '348817', '348821', '348822', '348823', '348824', '348975']
nosy_count = 5.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'miss-islington', 'miss-islington', 'yannvgn']
pr_nums = ['15030', '15059', '15060']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue37723'
versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

yannvgn · 2019-07-30T22:02:34Z

On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7

Example (ipython):

In [1]: import re
In [2]: char_list = ''.join([chr(i) for i in range(0xffff)])
In [3]: long_char_list = char_list * 10
In [4]: pattern = f'[{re.escape(long_char_list)}]'
In [5]: %time compiled = re.compile(pattern)

The test was run on Amazon Linux AMI 2017.03.

On Python 3.6.1, the regexp compiled in 2.6 seconds:
CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s
Wall time: 2.64 s

On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this case):
CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s
Wall time: 15min 9s

Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6

The complexity of this function is on average O(N^2) but can be easily reduced to O(N).

The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression.

serhiy-storchaka · 2019-07-31T07:58:01Z

Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code?

Could you please show benchmarking results for different implementations and different sizes?

yannvgn · 2019-07-31T16:28:13Z

Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code?

Could you please show benchmarking results for different implementations and different sizes?

I can't precisely answer that, but sacremoses (a tokenization package) for example is strongly impacted. See https://github.com/alvations/sacremoses/issues/61#issuecomment-516401853

serhiy-storchaka · 2019-07-31T16:32:52Z

Oh, this is convincing.

mrabarnett · 2019-07-31T17:23:32Z

I've just had a look at _uniq, and the code surprises me.

The obvious way to detect duplicates is with a set, but that requires the items to be hashable. Are they?

Well, the first line of the function uses 'set', so they are.

Why, then, isn't it using a set to detect the duplicates?

How about this:

def _uniq(items):
    newitems = []
    seen = set()
    for item in items:
        if item not in seen:
            newitems.append(item)
            seen.add(item)
    return newitems

yannvgn · 2019-07-31T18:45:08Z

Hey Matthew,

we decided to go for this, which is simpler and straightforward:

def _uniq(items):
    return list(dict.fromkeys(items))

(see #15030)

serhiy-storchaka · 2019-07-31T18:51:06Z

New changeset 9f55551 by Serhiy Storchaka (yannvgn) in branch 'master':
bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
9f55551

miss-islington · 2019-07-31T20:22:38Z

New changeset 33b700b by Miss Islington (bot) in branch '3.7':
bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
33b700b

miss-islington · 2019-07-31T20:22:38Z

New changeset 77fcccb by Miss Islington (bot) in branch '3.8':
bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
77fcccb

serhiy-storchaka · 2019-08-04T07:41:20Z

Thank you for your contribution yannvgn.

yannvgn mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes topic-regex performance Performance or resource usage labels Jul 30, 2019

serhiy-storchaka closed this as completed Aug 4, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

important performance regression on regular expression parsing #81904

important performance regression on regular expression parsing #81904

yannvgn mannequin commented Jul 30, 2019

yannvgn mannequin commented Jul 30, 2019

serhiy-storchaka commented Jul 31, 2019

yannvgn mannequin commented Jul 31, 2019

serhiy-storchaka commented Jul 31, 2019

mrabarnett mannequin commented Jul 31, 2019

yannvgn mannequin commented Jul 31, 2019

serhiy-storchaka commented Jul 31, 2019

miss-islington commented Jul 31, 2019

miss-islington commented Jul 31, 2019

serhiy-storchaka commented Aug 4, 2019

important performance regression on regular expression parsing #81904

important performance regression on regular expression parsing #81904

Comments

yannvgn mannequin commented Jul 30, 2019

yannvgn mannequin commented Jul 30, 2019

serhiy-storchaka commented Jul 31, 2019

yannvgn mannequin commented Jul 31, 2019

serhiy-storchaka commented Jul 31, 2019

mrabarnett mannequin commented Jul 31, 2019

yannvgn mannequin commented Jul 31, 2019

serhiy-storchaka commented Jul 31, 2019

miss-islington commented Jul 31, 2019

miss-islington commented Jul 31, 2019

serhiy-storchaka commented Aug 4, 2019