Skip to content

Conversation

@groutr
Copy link
Contributor

@groutr groutr commented Oct 18, 2018

Dissoc was really fast when the keys to remove were relatively few
compared to the size of the dictionary. This version will pick the smaller
container to iterate over, either building up a new dictionary or tearing
down a copy.

A small heuristic is used to decide if the overhead of building a set is
worth the effort. My rudimentry testing shows that if the number of keys is
less than 60% the length of the dictionary, tearing down a copy is faster.
If keys is very large, we calculate the difference between mapping and keys
and build a new dictionary out of that difference.

This PR trys to change the signature of dissoc in a backwards compatible way by adding the factory keyword argument. This makes the function consistent with the call signatures of all the other functions that mutate mappings. This change was required otherwise, we'd be unable to initialize a new empty mapping from a factory. If this is undesirable, we can resort to using d.__class__() as the factory.

Microbenchmarks coming soon...

Do you have any feedback, @eriknw?

Dissoc was really fast when the keys to remove were relatively few
compared to the size of the dictionary.  This version will pick the smaller
container to iterate over, either building up a new dictionary or tearing
down a copy.

A small heuristic is used to decide if the overhead of building a set is
worth the effort.  My rudimentry testing shows that if the number of keys is
less than 60% the length of the dictionary, tearing down a copy is faster.
If keys is very large, we calculate the intersection between keys and mapping
and build a new dictionary out of that intersection.
@groutr
Copy link
Contributor Author

groutr commented Oct 18, 2018

The tests failing indicate that the addition of the factory keyword argument isn't totally backwards compatible.

Looks like it might have been forgotten.
@groutr
Copy link
Contributor Author

groutr commented Oct 19, 2018

For these benchmarks, I tried to exercise the behavior of the the loops and conditionals inside those loops. The conditional inside the original loop should evaluate false 50% of the time. The range removed also exceeds the size of the container to exercise the new heuristic and set-difference method vs the original looping method. It looks like there could be room for improvement in the heuristic to avoid any irregular jumps in runtimes.

Using this loop:

for i in range(1, len(<container>)*2, (len(<container>)*2)//10):
    keys = list(range(i))
    print "Removing {} keys".format(len(keys))
    %timeit dissoc(<container>, *keys)

Results

a = dict(zip(range(0, 10*2, 2), range(0, 10*2, 2)))

# old dissoc
Removing 1 keys
1000000 loops, best of 3: 1.16 µs per loop
Removing 5 keys
1000000 loops, best of 3: 1.41 µs per loop
Removing 9 keys
1000000 loops, best of 3: 1.6 µs per loop
Removing 13 keys
1000000 loops, best of 3: 1.81 µs per loop
Removing 17 keys
1000000 loops, best of 3: 2 µs per loop

# new dissoc
Removing 1 keys
1000000 loops, best of 3: 1.17 µs per loop
Removing 5 keys
1000000 loops, best of 3: 1.39 µs per loop
Removing 9 keys
1000000 loops, best of 3: 1.83 µs per loop
Removing 13 keys
1000000 loops, best of 3: 1.86 µs per loop
Removing 17 keys
1000000 loops, best of 3: 1.8 µs per loop


b = dict(zip(range(0, 1000*2, 2), range(0, 1000*2, 2)))

# old dissoc
Removing 1 keys
100000 loops, best of 3: 17 µs per loop
Removing 401 keys
10000 loops, best of 3: 37.7 µs per loop
Removing 801 keys
10000 loops, best of 3: 63.4 µs per loop
Removing 1201 keys
10000 loops, best of 3: 89.7 µs per loop
Removing 1601 keys
10000 loops, best of 3: 117 µs per loop


# new dissoc
Removing 1 keys
100000 loops, best of 3: 16.8 µs per loop
Removing 401 keys
10000 loops, best of 3: 38.4 µs per loop
Removing 801 keys
10000 loops, best of 3: 81.3 µs per loop
Removing 1201 keys
10000 loops, best of 3: 84.8 µs per loop
Removing 1601 keys
10000 loops, best of 3: 78.9 µs per loop


c = dict(zip(range(0, 10000*2, 2), range(0, 10000*2, 2)))

# old dissoc
Removing 1 keys
10000 loops, best of 3: 184 µs per loop
Removing 4001 keys
1000 loops, best of 3: 443 µs per loop
Removing 8001 keys
1000 loops, best of 3: 716 µs per loop
Removing 12001 keys
1000 loops, best of 3: 992 µs per loop
Removing 16001 keys
1000 loops, best of 3: 1.29 ms per loop

# new dissoc
Removing 1 keys
10000 loops, best of 3: 187 µs per loop
Removing 4001 keys
1000 loops, best of 3: 443 µs per loop
Removing 8001 keys
1000 loops, best of 3: 922 µs per loop
Removing 12001 keys
1000 loops, best of 3: 843 µs per loop
Removing 16001 keys
1000 loops, best of 3: 846 µs per loop


d = dict(zip(range(0, 100000*2, 2), range(0, 100000*2, 2)))

#old dissoc
Removing 1 keys
100 loops, best of 3: 2.87 ms per loop
Removing 40001 keys
100 loops, best of 3: 5.84 ms per loop
Removing 80001 keys
100 loops, best of 3: 9.04 ms per loop
Removing 120001 keys
100 loops, best of 3: 12.3 ms per loop
Removing 160001 keys
100 loops, best of 3: 15.5 ms per loop

# new dissoc
Removing 1 keys
100 loops, best of 3: 2.84 ms per loop
Removing 40001 keys
100 loops, best of 3: 5.87 ms per loop
Removing 80001 keys
100 loops, best of 3: 10.3 ms per loop
Removing 120001 keys
100 loops, best of 3: 11.4 ms per loop
Removing 160001 keys
100 loops, best of 3: 10.9 ms per loop

@eriknw
Copy link
Member

eriknw commented Nov 13, 2018

LGTM, thanks @groutr!

Looks like the tests need updated to use the factory= keyword argument. I understand this is backwards incompatible, but it's consistent with using factory= elsewhere, so I'm okay with it.

@groutr
Copy link
Contributor Author

groutr commented Nov 16, 2018

@eriknw, thanks. I'll update the tests.

Computing the remaining keys benefits a lot from being able to use mapping views (which are set-like). In Python 3, custom mapping classes automatically get these set-like views. Does toolz anticipate dropping Python 2 support in 2020?

@groutr
Copy link
Contributor Author

groutr commented Dec 6, 2018

ping @eriknw. This is ready for review.

@eriknw eriknw merged commit 1b23ada into pytoolz:master Jul 9, 2019
@eriknw
Copy link
Member

eriknw commented Jul 9, 2019

Thanks @groutr, sorry this took so long to get in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants