OrderedSet is slower than OrderedDict; should use OrderedDict instead #73

danuker · 2020-11-16T16:34:09Z

Since it requires Python >= 3.5 anyway, OrderedSet can be reduced to an adapter of collections.OrderedDict. One can easily use a dict as a set by ignoring the dict's values.

A synthetic benchmark I ran shows that, if CPU caching does not come into play (a big if), an OrderedDict is faster (especially in the case of deletion, which is O(n) ):

Wrote profile results to set_test.py.lprof
Timer unit: 1e-06 s

Total time: 6.21423 s
File: set_test.py
Function: bench at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                           @profile
    14                                           def bench():
    15                                               # Creation
    16         1       5228.0   5228.0      0.1      s = OrderedSet(included)
    17         1      17110.0  17110.0      0.3      d = OrderedDict({i: 1 for i in included})
    18                                           
    19                                               # Insertion
    20      3001       1518.0      0.5      0.0      for i in missing:
    21      3000       5976.0      2.0      0.1          s.add(i)
    22      3001       1518.0      0.5      0.0      for i in missing:
    23      3000       2231.0      0.7      0.0          d[i] = 1
    24                                           
    25                                               # Deletion
    26      3001       2161.0      0.7      0.0      for i in missing:
    27      3000    6159236.0   2053.1     99.1          s.remove(i)
    28      3001       1483.0      0.5      0.0      for i in missing:
    29      3000       2533.0      0.8      0.0          d.pop(i)
    30                                           
    31                                               # Membership test (included)
    32      3001       1492.0      0.5      0.0      for i in included:
    33      3000       2764.0      0.9      0.0          assert i in s
    34      3001       1534.0      0.5      0.0      for i in included:
    35      3000       1786.0      0.6      0.0          assert i in d
    36                                           
    37                                               # Membership test (missing)
    38      3001       1510.0      0.5      0.0      for i in missing:
    39      3000       2836.0      0.9      0.0          assert i not in s
    40      3001       1519.0      0.5      0.0      for i in missing:
    41      3000       1795.0      0.6      0.0          assert i not in d

I am quite convinced there would be performance improvements, if we changed the implementation to use an OrderedDict instead of a set and a list (or a Python 3.6 dict, which is ordered, as @Eric-Arellano suggests).

Edit: However, this would make the access by index (like OrderedSet[1]) much slower (complexity O(n) instead of the current O(1)).

Would you accept a PR doing so?

The text was updated successfully, but these errors were encountered:

Eric-Arellano · 2020-11-16T16:48:38Z

FYI at https://github.com/pantsbuild/pants/blob/master/src/python/pants/util/ordered_set.py, we ~forked this library to create an OrderedSet and FrozenOrderedSet type backed by a builtin dict, as we use CPython 3.6+ so have insertion ordering. The ordering is an implementation detail in 3.6, and guaranteed in 3.7+.

I don't have benchmarks handy, but I believe dict is generally faster than OrderedDict, so if it's possible to drop 3.5 support, you could use that. 3.5 is EOL, so might be fine to drop.

We didn't upstream our implementation, as we made several major changes (simplifications) like no longer implementing Sequence. But feel free to use our implementation (Pants is Apache 2).

danuker · 2020-11-16T17:01:46Z

@Eric-Arellano Funny how this implementation is also inspired from Raymond Hettinger's, but uses a list instead of a linked list, making access by index O(1), but deletion much slower - O(n).

With a linked list, index access would be O(n), and deletion O(1).

I am not so sure about the pull request anymore. I would never think of using index access on a set, but I suppose it also depends on the other people using this library.

Eric-Arellano · 2020-11-16T17:09:14Z

Funny how this implementation is also inspired from Raymond Hettinger's, but uses a list instead of a linked list, making access by index O(1), but deletion much slower - O(n).

Indeed! We 95% of the time use FrozenOrderedSet. For Pants's caching, it's important that things are immutable. The only time we use OrderedSet is inside a function when creating a collection by continually adding to it, then we convert to a FrozenOrderedSet.

So, we care far more about access by index than deletion.

--

Possibly interesting to you - we also added a simple FrozenDict class that uses dict under-the-hood and simply doesn't expose any of its mutation entry-points: https://github.com/pantsbuild/pants/blob/master/src/python/pants/util/frozendict.py

--

I would never think of using index access on a set

Yeah, this was a big reason we forked. I originally tried adding FrozenOrderedSet here in #63, and we realized this project's goals aren't quite the same as Pants, even though this is still an awesome library.

I think the Sequence behavior is useful for some of the data science use cases like Pandas iirc.

danuker · 2020-11-16T17:28:11Z

@Eric-Arellano

immutable
simply doesn't expose any of its mutation entry-points

Cool! I am into functional programming (functional core / imperative shell), and I agree that immutability is valuable. I might "steal" your ideas 😁

I wish you good luck with your project!

we care far more about access by index than deletion

Hmm, you have no access by index in your OrderedSet. Are you talking about the dict? Or perhaps you mean membership check?

Yeah, this was a big reason we forked.

I can totally understand. Perhaps the author and users might agree.

In that case, I am not sure how to deal with the license (MIT vs. Apache).

The author has to think about whether to keep your implementation under Apache (and have multiple licenses inside this project), or ask all contributors to that file for MIT permission (That is, if a copy of the file is desired instead of a re-implementation).

danuker · 2020-11-16T17:50:22Z

@Eric-Arellano One way you could still keep the mutation methods is by having them return a new object that is a copy of the old, but with some changed things.

That way, you get immutable mutation! 😆

rspeer · 2022-01-26T15:02:19Z

Added a note to the top of the README about when you would prefer to use OrderedDict.

rspeer closed this as completed Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OrderedSet is slower than OrderedDict; should use OrderedDict instead #73

OrderedSet is slower than OrderedDict; should use OrderedDict instead #73

danuker commented Nov 16, 2020 •

edited

Eric-Arellano commented Nov 16, 2020

danuker commented Nov 16, 2020 •

edited

Eric-Arellano commented Nov 16, 2020

danuker commented Nov 16, 2020

danuker commented Nov 16, 2020

rspeer commented Jan 26, 2022

OrderedSet is slower than OrderedDict; should use OrderedDict instead #73

OrderedSet is slower than OrderedDict; should use OrderedDict instead #73

Comments

danuker commented Nov 16, 2020 • edited

Eric-Arellano commented Nov 16, 2020

danuker commented Nov 16, 2020 • edited

Eric-Arellano commented Nov 16, 2020

danuker commented Nov 16, 2020

danuker commented Nov 16, 2020

rspeer commented Jan 26, 2022

danuker commented Nov 16, 2020 •

edited

danuker commented Nov 16, 2020 •

edited