Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use optimized string search function in mmap.find() #91004

Closed
rumpelsepp mannequin opened this issue Feb 24, 2022 · 7 comments
Closed

Use optimized string search function in mmap.find() #91004

rumpelsepp mannequin opened this issue Feb 24, 2022 · 7 comments
Labels
performance Performance or resource usage

Comments

@rumpelsepp
Copy link
Mannequin

rumpelsepp mannequin commented Feb 24, 2022

BPO 46848
Nosy @vstinner, @rumpelsepp, @sweeneyde
PRs
  • bpo-46848: Optimize mmap.find() with memmem(3) #31554
  • bpo-46848: Use stringlib/fastsearch in mmap #31625
  • bpo-46848: Move _PyBytes_Find() to internal C API #31642
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-03-02.04:49:27.432>
    created_at = <Date 2022-02-24.12:43:50.418>
    labels = ['performance']
    title = 'Use optimized string search function in mmap.find()'
    updated_at = <Date 2022-03-02.13:15:30.865>
    user = 'https://github.com/rumpelsepp'

    bugs.python.org fields:

    activity = <Date 2022-03-02.13:15:30.865>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-03-02.04:49:27.432>
    closer = 'Dennis Sweeney'
    components = []
    creation = <Date 2022-02-24.12:43:50.418>
    creator = 'rumpelsepp'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46848
    keywords = ['patch']
    message_count = 7.0
    messages = ['413902', '413903', '413968', '414237', '414326', '414327', '414342']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'benrg', 'rumpelsepp', 'Dennis Sweeney']
    pr_nums = ['31554', '31625', '31642']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue46848'
    versions = []

    @rumpelsepp
    Copy link
    Mannequin Author

    rumpelsepp mannequin commented Feb 24, 2022

    The mmap.find() in function uses a naive loop to search string matches. This can be optimized “for free” by using libc's memmap(3) function instead.

    The relevant file is Modules/mmapmodule.c, the relevant function is mmap_gfind().

    @rumpelsepp
    Copy link
    Mannequin Author

    rumpelsepp mannequin commented Feb 24, 2022

    Sorry, I mean memmem(3). :)

    @benrg
    Copy link
    Mannequin

    benrg mannequin commented Feb 25, 2022

    memmem isn't a standard C function, and some libraries don't have it, notably Microsoft's.

    newlib's memmem seems to be the same as glibc's, but is under a BSD 3-clause license instead of LGPL. An older version of newlib's memmem (prior to 2019-01-01) has the license "Permission to use, copy, modify, and distribute this software is freely granted, provided that this notice is preserved", and is still highly optimized and much better than a naive implementation.

    Of course, bundling it would no longer be quite so "free".

    Old newlib memmem: https://sourceware.org/git/?p=newlib-cygwin.git;a=blob_plain;f=newlib/libc/string/memmem.c;h=25704e467decff5971b34f4189ddfff04ac5fa8e

    New newlib memmem: https://sourceware.org/git/?p=newlib-cygwin.git;a=blob_plain;f=newlib/libc/string/memmem.c

    Helper file for both: https://sourceware.org/git/?p=newlib-cygwin.git;a=blob_plain;f=newlib/libc/string/str-two-way.h

    @sweeneyde
    Copy link
    Member

    PR 31625 is an alternative proposal.

    It uses the Crochemore and Perrin's Two-Way algorithm that @benrg references (see Objects/stringlib/fastsearch.h and Objects/stringlib/stringlib_find_two_way_notes.txt), and is platform-independent.

    @sweeneyde
    Copy link
    Member

    New changeset 6ddb09f by Dennis Sweeney in branch 'main':
    bpo-46848: Use stringlib/fastsearch in mmap (GH-31625)
    6ddb09f

    @sweeneyde
    Copy link
    Member

    Thanks for the report!

    @sweeneyde sweeneyde added the performance Performance or resource usage label Mar 2, 2022
    @vstinner
    Copy link
    Member

    vstinner commented Mar 2, 2022

    New changeset b6b711a by Victor Stinner in branch 'main':
    bpo-46848: Move _PyBytes_Find() to internal C API (GH-31642)
    b6b711a

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants