Add support for OrderedDicts #232

bartvm · 2015-03-12T00:28:19Z

I like the Dicttoolz package, but for many of my use cases I need the deterministic behaviour of OrderedDict. Adapting Dicttoolz to return OrderedDict if all of its inputs are one is relatively straightforward. Is this something you would consider merging?

mrocklin · 2015-03-12T03:18:11Z

Things like using type(d) instead of dict seem innocuous and should definitely go in. The other changes also seem fairly respectful. The two concerns that I would have are

How much does this impact common case performance? It'd be good to see a couple of reassuring benchmarks
How costly is it to implement this in cytoolz?

mrocklin · 2015-03-12T03:19:10Z

Also thanks, this will be cool if we can make it work. I often bemoan the lack of universal support for OrderedDicts in the ecosystem. Never occurred to me to improve toolz in this way.

bartvm · 2015-03-12T03:48:10Z

Great! I'll run some benchmarks tomorrow. I can't think of anything that I changed which would negatively affect performance too much. OrderedDict can be quite a bit slower than dict, but I switch to using dict in merge and merge_with as soon as it is clear that one of the inputs isn't an OrderedDict.

I guess there is a strange corner case which could be slow: If the first inputs to merge/merge_with are very large OrderedDict instances followed by a normal dict, the switch to using dict means copying all of the data so far, which could be slow and memory-intensive. I guess that's a pretty unlikely scenario though.

I haven't looked at cytoolz and am not too skilled in Cython, but I'd be happy to try and help to make it work if I can.

bartvm · 2015-03-12T17:55:31Z

Not very scientific or exhaustive benchmarks, but gives an idea.

Things seem to be okay when using type(d) instead of dict:

In [1]: ordered_d = OrderedDict((i, i + 1) for i in range(10000))

In [2]: d = {i: i + 1 for i in range(10000)}

In [9]: %timeit valmap(lambda x: x * 2, d)
1000 loops, best of 3: 1.41 ms per loop

In [10]: %timeit valmap(lambda x: x * 2, ordered_d)
100 loops, best of 3: 3.93 ms per loop

In [11]: %timeit ordered_valmap(lambda x: x * 2, d)
1000 loops, best of 3: 1.42 ms per loop

In [12]: %timeit ordered_valmap(lambda x: x * 2, ordered_d)
100 loops, best of 3: 3.94 ms per loop

I made a small change, and now small number of dictionaries the overhead of converting the intermediary result from OrderedDict to dict in merge and merge_with is noticeable (about 2.5%), but for large number of dictionaries you can't tell:

In [11]: %timeit merge(d, d)
1000 loops, best of 3: 277 µs per loop

In [12]: %timeit ordered_merge(d, d)
1000 loops, best of 3: 284 µs per loop

In [13]: many_d = [d.copy() for _ in range(1000)]

In [7]: %timeit merge(many_d)
10 loops, best of 3: 128 ms per loop

In [10]: %timeit ordered_merge(many_d)
10 loops, best of 3: 128 ms per loop

The corner case I mentioned has a large effect on performance, but it seems pretty artificial:

In [14]: mix_d = [ordered_d.copy() for _ in range(100)] + [d]

In [15]: %timeit merge(mix_d)
100 loops, best of 3: 13.2 ms per loop

In [16]: %timeit ordered_merge(mix_d)
1 loops, best of 3: 552 ms per loop

eriknw · 2015-03-12T18:22:09Z

Interesting. Regarding impact to cytoolz, it is very fast to check whether an item is a dict and not a subclass of dict via PyDict_CheckExact. There are actually a few places where we can speed up iteration over pure dicts in Cython that we have not yet done. Whether it's worth branching based on type (dict or not dict) is yet to be determined, but it's something we can do with minimal impact on performance with pure dicts.

mrocklin · 2015-03-15T03:24:18Z

toolz/dicttoolz.py

+        if return_ordered_dict and not isinstance(d, OrderedDict):
+            result = dict(result)
+            dict_ = dict
+            return_ordered_dict = False


I wonder if this logic can be pulled out of the loop into a separate loop

if all(isinstance(d, OrderedDict) for d in dicts): result = OrderedDict() else: result = dict()

This does walk through the dicts twice, but I suspect that this is cheap. It also requires that we make dicts concrete and not lazy, but this was already the case due to how we implement merge_with, which brings everything in to memory anyway.

If this doesn't significantly impact performance then it might be preferred for code simplicity's sake

The reason I didn't do that is because if dicts is an iterator it ends up being exhausted before the loop ever starts.

My suggestion in that case is to actually make dicts concrete by calling list on it. This would be bad if we benfitted by laziness but, due to the current implementation of merge_with we're not lazy anyway.

Ah, read that too quickly, sorry! Good point, I'll change it.

mrocklin · 2015-03-15T03:24:50Z

This should probably have some tests to ensure that OrderedDicts emerge in the appropriate cases.

@mrocklin

As pointed out by @mrocklin, the function already loaded all dictionaries into memory anyway. Timing before: d = [{i: i + 1 for i in range(10)} for j in range(10000)] %timeit merge_with(d) 100000 loops, best of 3: 11.2 µs per loop And after %timeit merge_with(d) 100000 loops, best of 3: 11.7 µs per loop

mrocklin · 2015-03-15T20:21:20Z

toolz/dicttoolz.py

@@ -175,7 +190,7 @@ def assoc(d, key, value):
    >>> assoc({'x': 1}, 'y', 3)   # doctest: +SKIP
    {'x': 1, 'y': 3}
    """
-    return merge(d, {key: value})
+    return merge(d, type(d)([(key, value)]))


I suspect that the following will be faster.

d = d.copy() d[key] = value

eriknw · 2015-03-15T21:11:44Z

I think we should think about supporting (for input and output) any dict-like object as long as it supports the mapping protocol.

Previously, functions in dicttoolz consume mappings and return dict. This PR adds support for returning OrderedDict, but breaks support for generic mappings. For example type(d)(rv) fails if d is a defaultdict.

One option to support generic mappings is for functions in dicttoolz to accept a factory function that creates a new mapping to return. This would default to dict. Note that copy and update are not part of the mapping protocol.

bartvm · 2015-03-15T21:42:06Z

Something along those lines sounds good to me. Another approach, that wouldn't require the passing of a factory keyword argument, could be this. It's messier though and perhaps not a 100% fail-safe.

def factory(d):
    if type(d) is dict:
        return {}
    try:
        r = d.__reduce__()
    except TypeError:
        raise TypeError
    if len(r) == 2:
        callable_, args = r
        return callable_()
    elif len(r) == 5:
        callable_, args, state, list_items, dict_items = r
        if state is not None or list_items is not None:
            raise TypeError
        return callable_(*args)
    raise TypeError

eriknw · 2015-03-15T21:49:49Z

That's pretty clever! Relying on the pickle protocol may not be particularly robust though.

A factory keyword will let you do this: factory=lambda: collections.defaultdict(lambda: 0). This is explicit and hopefully easy enough to understand.

bartvm · 2015-03-15T22:08:41Z

I agree it's more explicit and robust.

How would you want to deal with mappings that don't support update and copy? For copy I guess you could switch from d.copy() to copy.copy(d), but not using update is quite a bit slower than doing it manually:

In [135]: def update(d1, d2):
    for k, v in d2.iteritems():
        d1[k] = v

In [137]: d1 = {i: i +1 for i in range(1000)}

In [138]: d2 = {i: i +1 for i in range(1000)}

In [139]: %timeit update(d1, d2)
10000 loops, best of 3: 63.9 µs per loop

In [140]: %timeit d1.update(d2)
10000 loops, best of 3: 20.6 µs per loop

Maybe something like this?

def update(d1, d2):
    if callable(getattr(d1, 'update', None)):
        d1.update(d2)
    else:
        for k, v in d2.iteritems():
            d1[k] = v

bartvm · 2015-03-25T00:40:17Z

New version with factory keyword argument instead of type(d). Made it compatible with the mapping protocol, so .copy -> copy.copy and .update to my suggestion from #232 (comment), which seemed fast enough:

In [1]: def update(d1, d2):
    if callable(getattr(d1, 'update', None)):
        d1.update(d2)
    else:
        for k, v in d2.iteritems():
            d1[k] = v

In [2]: d1 = {i: i +1 for i in range(1000)}

In [3]: d2 = {i: i +1 for i in range(1000)}

In [4]: %timeit update(d1, d2)
10000 loops, best of 3: 20.3 µs per loop

In [5]: %timeit d1.update(d2)
10000 loops, best of 3: 19.9 µs per loop

eriknw · 2015-03-25T03:57:00Z

Very cool.

… method

bartvm · 2015-03-25T04:25:25Z

After having a closer look I'd actually suggest switching back to the update method. It's part of the MutableMapping protocol, and update is only used in merge, where we can assume that the mapping used is mutable (it must be).

I made that change and all tests should pass now. Let me know if you'd like to see any other changes.

eriknw · 2015-03-25T04:54:26Z

Oh, you're right about that. Thanks for being pedantic. I was looking at Mapping, not MutableMapping.

We should check to see if anything needs done for toolz.curried or toolz.curried_exceptions. Presumably, it should be easy to curry or partial everything in toolz.dicttoolz to use a different factory.

We should probably mention this in the documentation somewhere too.

This looks pretty good to me. @mrocklin, thoughts?

eriknw · 2015-03-25T04:56:25Z

Would you like to add yourself to AUTHORS.md, @bartvm? Welcome to PyToolz!

bartvm · 2015-03-25T13:40:30Z

Cool, I will, thanks!

Currying is problematic for merge and merge_with, because they unpack their arguments (so merge() without any arguments works just fine and returns {}). This means curry can't tell whether merge(factory=defaultdict) is a partial application or not, and simply returns an empty defaultdict.

You could force merge/merge_with to take at least one input by changing the signature, but I think it's probably preferable that merge works even with merge(*()).

mrocklin · 2015-03-25T14:45:39Z

Seems good to me. It's a nice approach. Explicitness as release valve.

eriknw · 2015-03-25T16:41:44Z

Okay, so merge_with is already in "toolz/curried_exceptions.py" and needs to be updated. merge should be added as well and should raise a TypeError if len(dicts) == 0.

… argument

bartvm · 2015-03-25T19:21:04Z

Updated curried_exceptions.

I tested with OrderedDict myself so far, but ran into a problem when trying some of the other functions with factory=lambda: defaultdict(int): factory sometimes gets called without an argument, expected to return an empty mapping container, and sometimes with an argument (an iterable with items).

Two solutions:

Expect the user to solve this directly by passing e.g. factory=partial(defaultdict, int).
The solution I implemented for now: change the code to always construct an empty mapping first, and then use update to load the items from an iterable if needed. Speed-wise I couldn't tell a difference, and it should work for any factory that returns a mutable mapping object (factory=lambda: defaultdict(int) and factory=partial(defaultdict, int) both work).

eriknw · 2015-03-26T21:02:24Z

This is excellent. I hope you find it convenient enough for your original use case with OrderedDicts.

+1 to merge (which I'll do soon if no comment).

It'll be interesting to see how easily and how well cytoolz handles this.

Add support for OrderedDicts

eriknw · 2015-03-27T14:49:16Z

This is in!

bartvm · 2015-03-27T19:00:55Z

Great, thanks!

Add support for OrderedDicts

45679c0

bartvm mentioned this pull request Mar 12, 2015

Switch from custom dict_union to dicttoolz.merge/merge_with mila-iqia/blocks#456

Open

Simply ignore OrderedDict in Python 2.6

f52d767

bartvm force-pushed the ordereddict branch from 10d25c5 to f52d767 Compare March 12, 2015 00:44

Switch to dict when possible for increased performance

20da4e8

bartvm force-pushed the ordereddict branch from d1939bd to 20da4e8 Compare March 12, 2015 03:36

Avoid isinstance check when not needed

783f13f

mrocklin reviewed Mar 15, 2015
View reviewed changes

bartvm added 4 commits March 15, 2015 10:05

PEP8 of test_dictoolz.py

2896b67

Add tests for OrderedDict support in merge, merge_with

bbe0ddc

Ignore OrderedDict equalities in Python 2.6 tests

269ba3a

bartvm force-pushed the ordereddict branch from 63259a3 to 269ba3a Compare March 15, 2015 15:03

mrocklin reviewed Mar 15, 2015
View reviewed changes

bartvm added 4 commits March 24, 2015 18:52

Revert to master

96316e3

Add support for factory keyword argument

3532529

PEP8 of test_dictoolz.py

40ddfac

Use custom update function

a368e4b

bartvm added 2 commits March 24, 2015 20:26

Use copy module instead of method

d042a14

Add factory test

c691e64

bartvm added 2 commits March 25, 2015 00:16

Test passing of keyword arguments merge; default back to usual update…

8acf4d7

… method

Python 2.6 compatibility

86f6f0f

Add @bartvm to authors list

699d398

Update curry examples, don't expect factory to take items as optional…

d84ce56

… argument

Add tests for curried merge

1621454

eriknw added a commit that referenced this pull request Mar 27, 2015

Merge pull request #232 from bartvm/ordereddict

638f48c

Add support for OrderedDicts

eriknw merged commit 638f48c into pytoolz:master Mar 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for OrderedDicts #232

Add support for OrderedDicts #232

bartvm commented Mar 12, 2015

mrocklin commented Mar 12, 2015

mrocklin commented Mar 12, 2015

bartvm commented Mar 12, 2015

bartvm commented Mar 12, 2015

eriknw commented Mar 12, 2015

mrocklin Mar 15, 2015

bartvm Mar 15, 2015

mrocklin Mar 15, 2015

bartvm Mar 15, 2015

mrocklin commented Mar 15, 2015

mrocklin Mar 15, 2015

eriknw commented Mar 15, 2015

bartvm commented Mar 15, 2015

eriknw commented Mar 15, 2015

bartvm commented Mar 15, 2015

bartvm commented Mar 25, 2015

eriknw commented Mar 25, 2015

bartvm commented Mar 25, 2015

eriknw commented Mar 25, 2015

eriknw commented Mar 25, 2015

bartvm commented Mar 25, 2015

mrocklin commented Mar 25, 2015

eriknw commented Mar 25, 2015

bartvm commented Mar 25, 2015

eriknw commented Mar 26, 2015

eriknw commented Mar 27, 2015

bartvm commented Mar 27, 2015

Add support for OrderedDicts #232

Add support for OrderedDicts #232

Conversation

bartvm commented Mar 12, 2015

mrocklin commented Mar 12, 2015

mrocklin commented Mar 12, 2015

bartvm commented Mar 12, 2015

bartvm commented Mar 12, 2015

eriknw commented Mar 12, 2015

mrocklin Mar 15, 2015

Choose a reason for hiding this comment

bartvm Mar 15, 2015

Choose a reason for hiding this comment

mrocklin Mar 15, 2015

Choose a reason for hiding this comment

bartvm Mar 15, 2015

Choose a reason for hiding this comment

mrocklin commented Mar 15, 2015

mrocklin Mar 15, 2015

Choose a reason for hiding this comment

eriknw commented Mar 15, 2015

bartvm commented Mar 15, 2015

eriknw commented Mar 15, 2015

bartvm commented Mar 15, 2015

bartvm commented Mar 25, 2015

eriknw commented Mar 25, 2015

bartvm commented Mar 25, 2015

eriknw commented Mar 25, 2015

eriknw commented Mar 25, 2015

bartvm commented Mar 25, 2015

mrocklin commented Mar 25, 2015

eriknw commented Mar 25, 2015

bartvm commented Mar 25, 2015

eriknw commented Mar 26, 2015

eriknw commented Mar 27, 2015

bartvm commented Mar 27, 2015