Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster implementation to collapse non-consecutive ip-addresses #67455

Closed
cmn mannequin opened this issue Jan 18, 2015 · 15 comments
Closed

Faster implementation to collapse non-consecutive ip-addresses #67455

cmn mannequin opened this issue Jan 18, 2015 · 15 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@cmn
Copy link
Mannequin

cmn mannequin commented Jan 18, 2015

BPO 23266
Nosy @pitrou, @serhiy-storchaka
Files
  • ipaddress_collapse_non_consecutive_performance.diff: The patch on ipaddress.py
  • testrig.tar.gz: test rig for performance eval
  • collapse.patch
  • ipaddress_faster_collapse3.patch
  • ipaddress_faster_collapse4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-01-18.22:43:28.773>
    created_at = <Date 2015-01-18.12:32:34.139>
    labels = ['library', 'performance']
    title = 'Faster implementation to collapse non-consecutive ip-addresses'
    updated_at = <Date 2015-01-20.23:48:27.416>
    user = 'https://bugs.python.org/cmn'

    bugs.python.org fields:

    activity = <Date 2015-01-20.23:48:27.416>
    actor = 'cmn'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-01-18.22:43:28.773>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2015-01-18.12:32:34.139>
    creator = 'cmn'
    dependencies = []
    files = ['37762', '37763', '37764', '37768', '37794']
    hgrepos = []
    issue_num = 23266
    keywords = ['patch']
    message_count = 15.0
    messages = ['234239', '234240', '234241', '234242', '234245', '234253', '234254', '234282', '234283', '234286', '234287', '234288', '234382', '234389', '234410']
    nosy_count = 5.0
    nosy_names = ['pitrou', 'pmoody', 'python-dev', 'serhiy.storchaka', 'cmn']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue23266'
    versions = ['Python 3.5']

    @cmn
    Copy link
    Mannequin Author

    cmn mannequin commented Jan 18, 2015

    I found the code used to collapse addresses to be very slow on a large number (64k) of island addresses which are not collapseable.

    The code at

    # sort and dedup

    was found to be guilty, especially the index lookup.
    The patch changes the code to discard the index lookup and have _find_address_range return the number of items consumed.
    That way the set operation to dedup the addresses can be dropped as well.

    Numbers from the testrig I adapted from http://bugs.python.org/issue20826 with 8k non-consecutive addresses:

    Execution time: 0.6893927365541458 seconds
    vs.
    Execution time: 12.116527611762285 seconds

    MfG
    Markus Kötter

    @cmn cmn mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Jan 18, 2015
    @cmn
    Copy link
    Mannequin Author

    cmn mannequin commented Jan 18, 2015

    Added the testrig.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2015

    This is great, thank you. Can you sign the contributor's agreement?
    https://www.python.org/psf/contrib/contrib-form/

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2015

    Here is an updated patch with a fix to the tests and docstrings.

    @cmn
    Copy link
    Mannequin Author

    cmn mannequin commented Jan 18, 2015

    I just signed the agreement, ewa@ is processing it.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 18, 2015

    New changeset f7508a176a09 by Antoine Pitrou in branch 'default':
    Issue bpo-23266: Much faster implementation of ipaddress.collapse_addresses() when there are many non-consecutive addresses.
    https://hg.python.org/cpython/rev/f7508a176a09

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2015

    Ok, I've committed the patch. Thank you!

    @pitrou pitrou closed this as completed Jan 18, 2015
    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 18, 2015

    Deduplication should not be omitted. This slowed down collapsing of duplicated addresses.

    $ ./python -m timeit -s "import ipaddress; ips = [ipaddress.ip_address('2001:db8::1000') for i in range(1000)]" -- "ipaddress.collapse_addresses(ips)"

    Before f7508a176a09:
    100 loops, best of 3: 13.4 msec per loop

    After f7508a176a09:
    10 loops, best of 3: 129 msec per loop

    Proposed patch restores performance for duplicated addresses and simplifies the code using generators.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2015

    Good catch. What is the performance on the benchmark posted here?

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 18, 2015

    What is the performance on the benchmark posted here?

    The same as with current code.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2015

    Then +1. The patch looks fine to me.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 18, 2015

    New changeset 021b23a40f9f by Serhiy Storchaka in branch 'default':
    Issue bpo-23266: Restore the performance of ipaddress.collapse_addresses() whith
    https://hg.python.org/cpython/rev/021b23a40f9f

    @cmn
    Copy link
    Mannequin Author

    cmn mannequin commented Jan 20, 2015

    My initial patch was wrong wrt. _find_address_range.
    It did not loop over equal addresses.
    Thats why performance with many equal addresses was degraded when dropping the set().

    Here is a patch to fix _find_address_range, drop the set, and improve performance again.

    python3 -m timeit -s "import bipaddress; ips = [bipaddress.ip_address('2001:db8::1000') for i in range(1000)]" -- "bipaddress.collapse_addresses(ips)"
    1000 loops, best of 3: 1.76 msec per loop

    python3 -m timeit -s "import aipaddress; ips = [aipaddress.ip_address('2001:db8::1000') for i in range(1000)]" -- "aipaddress.collapse_addresses(ips)"
    1000 loops, best of 3: 1.32 msec per loop

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 20, 2015

    Only one duplicated address is degenerated case. When there is a lot of duplicated addresses in range the patch causes regression.

    $ ./python -m timeit -s "import ipaddress; ips = [ipaddress.ip_address('2001:db8::%x' % (i%100)) for i in range(100000)]" -- "ipaddress.collapse_addresses(ips)"

    Unpatched: 10 loops, best of 3: 369 msec per loop
    Patched: 10 loops, best of 3: 1.04 sec per loop

    @cmn
    Copy link
    Mannequin Author

    cmn mannequin commented Jan 20, 2015

    Eleminating duplicates before processing is faster once the overhead of the set operation is less than the time required to sort the larger dataset with duplicates.

    So we are basically comparing sort(data) to sort(set(data)).
    The optimum depends on the input data.

    python3 -m timeit -s "import random; import bipaddress; ips = [bipaddress.ip_address('2001:db8::') + i for i in range(100000)]; random.shuffle(ips)" -- "bipaddress.collapse_addresses(ips)"

    10 loops, best of 3: 1.49 sec per loop
    vs.
    10 loops, best of 3: 1.59 sec per loop

    If the data is pre-sorted, possible if you retrieve from database, things are drastically different:

    python3 -m timeit -s "import random; import bipaddress; ips = [bipaddress.ip_address('2001:db8::') + i for i in range(100000)]; " -- "bipaddress.collapse_addresses(ips)"
    10 loops, best of 3: 136 msec per loop
    vs
    10 loops, best of 3: 1.57 sec per loop

    So for my usecase, I basically have less than 0.1% duplicates (if at all), dropping the set would be better, but ... other usecases will exist.

    Still, it is easy to "emulate" the use of "sorted(set())" from a users perspective - just call collapse_addresses(set(data)) in case you expect to have duplicates and experience a speedup by inserting unique, possibly even sorted, data.

    On the other hand, if you have a huge load of 99.99% sorted non collapseable addresses, it is not possible to drop the set() operation in your sorted(set()) from a users perspective, no way to speed things up, and the slowdown you get is x10.

    That said, I'd drop the set().
    Optimization depends on data input, dropping the set() allows the user to optimize base on the nature of his input data.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants