Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OrderedDicts #232

Merged
merged 19 commits into from Mar 27, 2015
Merged

Add support for OrderedDicts #232

merged 19 commits into from Mar 27, 2015

Conversation

bartvm
Copy link
Contributor

@bartvm bartvm commented Mar 12, 2015

I like the Dicttoolz package, but for many of my use cases I need the deterministic behaviour of OrderedDict. Adapting Dicttoolz to return OrderedDict if all of its inputs are one is relatively straightforward. Is this something you would consider merging?

@mrocklin
Copy link
Member

Things like using type(d) instead of dict seem innocuous and should definitely go in. The other changes also seem fairly respectful. The two concerns that I would have are

  1. How much does this impact common case performance? It'd be good to see a couple of reassuring benchmarks
  2. How costly is it to implement this in cytoolz?

@mrocklin
Copy link
Member

Also thanks, this will be cool if we can make it work. I often bemoan the lack of universal support for OrderedDicts in the ecosystem. Never occurred to me to improve toolz in this way.

@bartvm
Copy link
Contributor Author

bartvm commented Mar 12, 2015

Great! I'll run some benchmarks tomorrow. I can't think of anything that I changed which would negatively affect performance too much. OrderedDict can be quite a bit slower than dict, but I switch to using dict in merge and merge_with as soon as it is clear that one of the inputs isn't an OrderedDict.

I guess there is a strange corner case which could be slow: If the first inputs to merge/merge_with are very large OrderedDict instances followed by a normal dict, the switch to using dict means copying all of the data so far, which could be slow and memory-intensive. I guess that's a pretty unlikely scenario though.

I haven't looked at cytoolz and am not too skilled in Cython, but I'd be happy to try and help to make it work if I can.

@bartvm
Copy link
Contributor Author

bartvm commented Mar 12, 2015

Not very scientific or exhaustive benchmarks, but gives an idea.

Things seem to be okay when using type(d) instead of dict:

In [1]: ordered_d = OrderedDict((i, i + 1) for i in range(10000))

In [2]: d = {i: i + 1 for i in range(10000)}

In [9]: %timeit valmap(lambda x: x * 2, d)
1000 loops, best of 3: 1.41 ms per loop

In [10]: %timeit valmap(lambda x: x * 2, ordered_d)
100 loops, best of 3: 3.93 ms per loop

In [11]: %timeit ordered_valmap(lambda x: x * 2, d)
1000 loops, best of 3: 1.42 ms per loop

In [12]: %timeit ordered_valmap(lambda x: x * 2, ordered_d)
100 loops, best of 3: 3.94 ms per loop

I made a small change, and now small number of dictionaries the overhead of converting the intermediary result from OrderedDict to dict in merge and merge_with is noticeable (about 2.5%), but for large number of dictionaries you can't tell:

In [11]: %timeit merge(d, d)
1000 loops, best of 3: 277 µs per loop

In [12]: %timeit ordered_merge(d, d)
1000 loops, best of 3: 284 µs per loop

In [13]: many_d = [d.copy() for _ in range(1000)]

In [7]: %timeit merge(many_d)
10 loops, best of 3: 128 ms per loop

In [10]: %timeit ordered_merge(many_d)
10 loops, best of 3: 128 ms per loop

The corner case I mentioned has a large effect on performance, but it seems pretty artificial:

In [14]: mix_d = [ordered_d.copy() for _ in range(100)] + [d]

In [15]: %timeit merge(mix_d)
100 loops, best of 3: 13.2 ms per loop

In [16]: %timeit ordered_merge(mix_d)
1 loops, best of 3: 552 ms per loop

@eriknw
Copy link
Member

eriknw commented Mar 12, 2015

Interesting. Regarding impact to cytoolz, it is very fast to check whether an item is a dict and not a subclass of dict via PyDict_CheckExact. There are actually a few places where we can speed up iteration over pure dicts in Cython that we have not yet done. Whether it's worth branching based on type (dict or not dict) is yet to be determined, but it's something we can do with minimal impact on performance with pure dicts.

if return_ordered_dict and not isinstance(d, OrderedDict):
result = dict(result)
dict_ = dict
return_ordered_dict = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this logic can be pulled out of the loop into a separate loop

if all(isinstance(d, OrderedDict) for d in dicts):
    result = OrderedDict()
else:
    result = dict()

This does walk through the dicts twice, but I suspect that this is cheap. It also requires that we make dicts concrete and not lazy, but this was already the case due to how we implement merge_with, which brings everything in to memory anyway.

If this doesn't significantly impact performance then it might be preferred for code simplicity's sake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't do that is because if dicts is an iterator it ends up being exhausted before the loop ever starts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion in that case is to actually make dicts concrete by calling list on it. This would be bad if we benfitted by laziness but, due to the current implementation of merge_with we're not lazy anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, read that too quickly, sorry! Good point, I'll change it.

@mrocklin
Copy link
Member

This should probably have some tests to ensure that OrderedDicts emerge in the appropriate cases.

As pointed out by @mrocklin, the function already loaded all
dictionaries into memory anyway. Timing before:

d = [{i: i + 1 for i in range(10)} for j in range(10000)]
%timeit merge_with(d)
100000 loops, best of 3: 11.2 µs per loop

And after

%timeit merge_with(d)
100000 loops, best of 3: 11.7 µs per loop
@@ -175,7 +190,7 @@ def assoc(d, key, value):
>>> assoc({'x': 1}, 'y', 3) # doctest: +SKIP
{'x': 1, 'y': 3}
"""
return merge(d, {key: value})
return merge(d, type(d)([(key, value)]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the following will be faster.

d = d.copy()
d[key] = value

@eriknw
Copy link
Member

eriknw commented Mar 15, 2015

I think we should think about supporting (for input and output) any dict-like object as long as it supports the mapping protocol.

Previously, functions in dicttoolz consume mappings and return dict. This PR adds support for returning OrderedDict, but breaks support for generic mappings. For example type(d)(rv) fails if d is a defaultdict.

One option to support generic mappings is for functions in dicttoolz to accept a factory function that creates a new mapping to return. This would default to dict. Note that copy and update are not part of the mapping protocol.

@bartvm
Copy link
Contributor Author

bartvm commented Mar 15, 2015

Something along those lines sounds good to me. Another approach, that wouldn't require the passing of a factory keyword argument, could be this. It's messier though and perhaps not a 100% fail-safe.

def factory(d):
    if type(d) is dict:
        return {}
    try:
        r = d.__reduce__()
    except TypeError:
        raise TypeError
    if len(r) == 2:
        callable_, args = r
        return callable_()
    elif len(r) == 5:
        callable_, args, state, list_items, dict_items = r
        if state is not None or list_items is not None:
            raise TypeError
        return callable_(*args)
    raise TypeError

@eriknw
Copy link
Member

eriknw commented Mar 15, 2015

That's pretty clever! Relying on the pickle protocol may not be particularly robust though.

A factory keyword will let you do this: factory=lambda: collections.defaultdict(lambda: 0). This is explicit and hopefully easy enough to understand.

@bartvm
Copy link
Contributor Author

bartvm commented Mar 15, 2015

I agree it's more explicit and robust.

How would you want to deal with mappings that don't support update and copy? For copy I guess you could switch from d.copy() to copy.copy(d), but not using update is quite a bit slower than doing it manually:

In [135]: def update(d1, d2):
    for k, v in d2.iteritems():
        d1[k] = v

In [137]: d1 = {i: i +1 for i in range(1000)}

In [138]: d2 = {i: i +1 for i in range(1000)}

In [139]: %timeit update(d1, d2)
10000 loops, best of 3: 63.9 µs per loop

In [140]: %timeit d1.update(d2)
10000 loops, best of 3: 20.6 µs per loop

Maybe something like this?

def update(d1, d2):
    if callable(getattr(d1, 'update', None)):
        d1.update(d2)
    else:
        for k, v in d2.iteritems():
            d1[k] = v

@bartvm
Copy link
Contributor Author

bartvm commented Mar 25, 2015

New version with factory keyword argument instead of type(d). Made it compatible with the mapping protocol, so .copy -> copy.copy and .update to my suggestion from #232 (comment), which seemed fast enough:

In [1]: def update(d1, d2):
    if callable(getattr(d1, 'update', None)):
        d1.update(d2)
    else:
        for k, v in d2.iteritems():
            d1[k] = v

In [2]: d1 = {i: i +1 for i in range(1000)}

In [3]: d2 = {i: i +1 for i in range(1000)}

In [4]: %timeit update(d1, d2)
10000 loops, best of 3: 20.3 µs per loop

In [5]: %timeit d1.update(d2)
10000 loops, best of 3: 19.9 µs per loop

@eriknw
Copy link
Member

eriknw commented Mar 25, 2015

Very cool.

@bartvm
Copy link
Contributor Author

bartvm commented Mar 25, 2015

After having a closer look I'd actually suggest switching back to the update method. It's part of the MutableMapping protocol, and update is only used in merge, where we can assume that the mapping used is mutable (it must be).

I made that change and all tests should pass now. Let me know if you'd like to see any other changes.

@eriknw
Copy link
Member

eriknw commented Mar 25, 2015

Oh, you're right about that. Thanks for being pedantic. I was looking at Mapping, not MutableMapping.

We should check to see if anything needs done for toolz.curried or toolz.curried_exceptions. Presumably, it should be easy to curry or partial everything in toolz.dicttoolz to use a different factory.

We should probably mention this in the documentation somewhere too.

This looks pretty good to me. @mrocklin, thoughts?

@eriknw
Copy link
Member

eriknw commented Mar 25, 2015

Would you like to add yourself to AUTHORS.md, @bartvm? Welcome to PyToolz!

@bartvm
Copy link
Contributor Author

bartvm commented Mar 25, 2015

Cool, I will, thanks!

Currying is problematic for merge and merge_with, because they unpack their arguments (so merge() without any arguments works just fine and returns {}). This means curry can't tell whether merge(factory=defaultdict) is a partial application or not, and simply returns an empty defaultdict.

You could force merge/merge_with to take at least one input by changing the signature, but I think it's probably preferable that merge works even with merge(*()).

@mrocklin
Copy link
Member

Seems good to me. It's a nice approach. Explicitness as release valve.

@eriknw
Copy link
Member

eriknw commented Mar 25, 2015

Okay, so merge_with is already in "toolz/curried_exceptions.py" and needs to be updated. merge should be added as well and should raise a TypeError if len(dicts) == 0.

@bartvm
Copy link
Contributor Author

bartvm commented Mar 25, 2015

Updated curried_exceptions.

I tested with OrderedDict myself so far, but ran into a problem when trying some of the other functions with factory=lambda: defaultdict(int): factory sometimes gets called without an argument, expected to return an empty mapping container, and sometimes with an argument (an iterable with items).

Two solutions:

  • Expect the user to solve this directly by passing e.g. factory=partial(defaultdict, int).
  • The solution I implemented for now: change the code to always construct an empty mapping first, and then use update to load the items from an iterable if needed. Speed-wise I couldn't tell a difference, and it should work for any factory that returns a mutable mapping object (factory=lambda: defaultdict(int) and factory=partial(defaultdict, int) both work).

@eriknw
Copy link
Member

eriknw commented Mar 26, 2015

This is excellent. I hope you find it convenient enough for your original use case with OrderedDicts.

+1 to merge (which I'll do soon if no comment).

It'll be interesting to see how easily and how well cytoolz handles this.

eriknw added a commit that referenced this pull request Mar 27, 2015
Add support for OrderedDicts
@eriknw eriknw merged commit 638f48c into pytoolz:master Mar 27, 2015
@eriknw
Copy link
Member

eriknw commented Mar 27, 2015

This is in!

@bartvm
Copy link
Contributor Author

bartvm commented Mar 27, 2015

Great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants