Faster implementation to collapse non-consecutive ip-addresses #67455
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = None closed_at = <Date 2015-01-18.22:43:28.773> created_at = <Date 2015-01-18.12:32:34.139> labels = ['library', 'performance'] title = 'Faster implementation to collapse non-consecutive ip-addresses' updated_at = <Date 2015-01-20.23:48:27.416> user = 'https://bugs.python.org/cmn'
activity = <Date 2015-01-20.23:48:27.416> actor = 'cmn' assignee = 'none' closed = True closed_date = <Date 2015-01-18.22:43:28.773> closer = 'serhiy.storchaka' components = ['Library (Lib)'] creation = <Date 2015-01-18.12:32:34.139> creator = 'cmn' dependencies =  files = ['37762', '37763', '37764', '37768', '37794'] hgrepos =  issue_num = 23266 keywords = ['patch'] message_count = 15.0 messages = ['234239', '234240', '234241', '234242', '234245', '234253', '234254', '234282', '234283', '234286', '234287', '234288', '234382', '234389', '234410'] nosy_count = 5.0 nosy_names = ['pitrou', 'pmoody', 'python-dev', 'serhiy.storchaka', 'cmn'] pr_nums =  priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue23266' versions = ['Python 3.5']
The text was updated successfully, but these errors were encountered:
I found the code used to collapse addresses to be very slow on a large number (64k) of island addresses which are not collapseable.
The code at
was found to be guilty, especially the index lookup.
The patch changes the code to discard the index lookup and have _find_address_range return the number of items consumed.
That way the set operation to dedup the addresses can be dropped as well.
Numbers from the testrig I adapted from http://bugs.python.org/issue20826 with 8k non-consecutive addresses:
Execution time: 0.6893927365541458 seconds
Deduplication should not be omitted. This slowed down collapsing of duplicated addresses.
$ ./python -m timeit -s "import ipaddress; ips = [ipaddress.ip_address('2001:db8::1000') for i in range(1000)]" -- "ipaddress.collapse_addresses(ips)"
Proposed patch restores performance for duplicated addresses and simplifies the code using generators.
My initial patch was wrong wrt. _find_address_range.
Here is a patch to fix _find_address_range, drop the set, and improve performance again.
python3 -m timeit -s "import bipaddress; ips = [bipaddress.ip_address('2001:db8::1000') for i in range(1000)]" -- "bipaddress.collapse_addresses(ips)"
python3 -m timeit -s "import aipaddress; ips = [aipaddress.ip_address('2001:db8::1000') for i in range(1000)]" -- "aipaddress.collapse_addresses(ips)"
Only one duplicated address is degenerated case. When there is a lot of duplicated addresses in range the patch causes regression.
$ ./python -m timeit -s "import ipaddress; ips = [ipaddress.ip_address('2001:db8::%x' % (i%100)) for i in range(100000)]" -- "ipaddress.collapse_addresses(ips)"
Unpatched: 10 loops, best of 3: 369 msec per loop
Eleminating duplicates before processing is faster once the overhead of the set operation is less than the time required to sort the larger dataset with duplicates.
So we are basically comparing sort(data) to sort(set(data)).
python3 -m timeit -s "import random; import bipaddress; ips = [bipaddress.ip_address('2001:db8::') + i for i in range(100000)]; random.shuffle(ips)" -- "bipaddress.collapse_addresses(ips)"
10 loops, best of 3: 1.49 sec per loop
If the data is pre-sorted, possible if you retrieve from database, things are drastically different:
python3 -m timeit -s "import random; import bipaddress; ips = [bipaddress.ip_address('2001:db8::') + i for i in range(100000)]; " -- "bipaddress.collapse_addresses(ips)"
So for my usecase, I basically have less than 0.1% duplicates (if at all), dropping the set would be better, but ... other usecases will exist.
Still, it is easy to "emulate" the use of "sorted(set())" from a users perspective - just call collapse_addresses(set(data)) in case you expect to have duplicates and experience a speedup by inserting unique, possibly even sorted, data.
On the other hand, if you have a huge load of 99.99% sorted non collapseable addresses, it is not possible to drop the set() operation in your sorted(set()) from a users perspective, no way to speed things up, and the slowdown you get is x10.
That said, I'd drop the set().