Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

important performance regression on regular expression parsing #81904

Closed
yannvgn mannequin opened this issue Jul 30, 2019 · 10 comments
Closed

important performance regression on regular expression parsing #81904

yannvgn mannequin opened this issue Jul 30, 2019 · 10 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes performance Performance or resource usage topic-regex

Comments

@yannvgn
Copy link
Mannequin

yannvgn mannequin commented Jul 30, 2019

BPO 37723
Nosy @ezio-melotti, @serhiy-storchaka, @miss-islington, @miss-islington, @yannvgn
PRs
  • bpo-37723: fix performance regression on regular expression parsing #15030
  • [3.8] bpo-37723: Fix performance regression on regular expression parsing. (GH-15030) #15059
  • [3.7] bpo-37723: Fix performance regression on regular expression parsing. (GH-15030) #15060
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-08-04.07:41:20.307>
    created_at = <Date 2019-07-30.22:02:34.352>
    labels = ['expert-regex', '3.7', '3.8', '3.9', 'performance']
    title = 'important performance regression on regular expression parsing'
    updated_at = <Date 2019-08-04.07:41:20.299>
    user = 'https://github.com/yannvgn'

    bugs.python.org fields:

    activity = <Date 2019-08-04.07:41:20.299>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-08-04.07:41:20.307>
    closer = 'serhiy.storchaka'
    components = ['Regular Expressions']
    creation = <Date 2019-07-30.22:02:34.352>
    creator = 'yannvgn'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 37723
    keywords = ['patch']
    message_count = 10.0
    messages = ['348771', '348789', '348813', '348814', '348817', '348821', '348822', '348823', '348824', '348975']
    nosy_count = 5.0
    nosy_names = ['ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'miss-islington', 'miss-islington', 'yannvgn']
    pr_nums = ['15030', '15059', '15060']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue37723'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @yannvgn
    Copy link
    Mannequin Author

    yannvgn mannequin commented Jul 30, 2019

    On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7

    Example (ipython):

    In [1]: import re
    In [2]: char_list = ''.join([chr(i) for i in range(0xffff)])
    In [3]: long_char_list = char_list * 10
    In [4]: pattern = f'[{re.escape(long_char_list)}]'
    In [5]: %time compiled = re.compile(pattern)

    The test was run on Amazon Linux AMI 2017.03.

    On Python 3.6.1, the regexp compiled in 2.6 seconds:
    CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s
    Wall time: 2.64 s

    On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this case):
    CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s
    Wall time: 15min 9s

    Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6

    The complexity of this function is on average O(N^2) but can be easily reduced to O(N).

    The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression.

    @yannvgn yannvgn mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes topic-regex performance Performance or resource usage labels Jul 30, 2019
    @serhiy-storchaka
    Copy link
    Member

    Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code?

    Could you please show benchmarking results for different implementations and different sizes?

    @yannvgn
    Copy link
    Mannequin Author

    yannvgn mannequin commented Jul 31, 2019

    Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code?

    Could you please show benchmarking results for different implementations and different sizes?

    I can't precisely answer that, but sacremoses (a tokenization package) for example is strongly impacted. See https://github.com/alvations/sacremoses/issues/61#issuecomment-516401853

    @serhiy-storchaka
    Copy link
    Member

    Oh, this is convincing.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Jul 31, 2019

    I've just had a look at _uniq, and the code surprises me.

    The obvious way to detect duplicates is with a set, but that requires the items to be hashable. Are they?

    Well, the first line of the function uses 'set', so they are.

    Why, then, isn't it using a set to detect the duplicates?

    How about this:

    def _uniq(items):
        newitems = []
        seen = set()
        for item in items:
            if item not in seen:
                newitems.append(item)
                seen.add(item)
        return newitems

    @yannvgn
    Copy link
    Mannequin Author

    yannvgn mannequin commented Jul 31, 2019

    Hey Matthew,

    we decided to go for this, which is simpler and straightforward:

    def _uniq(items):
        return list(dict.fromkeys(items))

    (see #15030)

    @serhiy-storchaka
    Copy link
    Member

    New changeset 9f55551 by Serhiy Storchaka (yannvgn) in branch 'master':
    bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
    9f55551

    @miss-islington
    Copy link
    Contributor

    New changeset 33b700b by Miss Islington (bot) in branch '3.7':
    bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
    33b700b

    @miss-islington
    Copy link
    Contributor

    New changeset 77fcccb by Miss Islington (bot) in branch '3.8':
    bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
    77fcccb

    @serhiy-storchaka
    Copy link
    Member

    Thank you for your contribution yannvgn.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes performance Performance or resource usage topic-regex
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants