Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OrderedSet is slower than OrderedDict; should use OrderedDict instead #73

Closed
danuker opened this issue Nov 16, 2020 · 6 comments
Closed

Comments

@danuker
Copy link

danuker commented Nov 16, 2020

Since it requires Python >= 3.5 anyway, OrderedSet can be reduced to an adapter of collections.OrderedDict. One can easily use a dict as a set by ignoring the dict's values.

A synthetic benchmark I ran shows that, if CPU caching does not come into play (a big if), an OrderedDict is faster (especially in the case of deletion, which is O(n) ):

Wrote profile results to set_test.py.lprof
Timer unit: 1e-06 s

Total time: 6.21423 s
File: set_test.py
Function: bench at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                           @profile
    14                                           def bench():
    15                                               # Creation
    16         1       5228.0   5228.0      0.1      s = OrderedSet(included)
    17         1      17110.0  17110.0      0.3      d = OrderedDict({i: 1 for i in included})
    18                                           
    19                                               # Insertion
    20      3001       1518.0      0.5      0.0      for i in missing:
    21      3000       5976.0      2.0      0.1          s.add(i)
    22      3001       1518.0      0.5      0.0      for i in missing:
    23      3000       2231.0      0.7      0.0          d[i] = 1
    24                                           
    25                                               # Deletion
    26      3001       2161.0      0.7      0.0      for i in missing:
    27      3000    6159236.0   2053.1     99.1          s.remove(i)
    28      3001       1483.0      0.5      0.0      for i in missing:
    29      3000       2533.0      0.8      0.0          d.pop(i)
    30                                           
    31                                               # Membership test (included)
    32      3001       1492.0      0.5      0.0      for i in included:
    33      3000       2764.0      0.9      0.0          assert i in s
    34      3001       1534.0      0.5      0.0      for i in included:
    35      3000       1786.0      0.6      0.0          assert i in d
    36                                           
    37                                               # Membership test (missing)
    38      3001       1510.0      0.5      0.0      for i in missing:
    39      3000       2836.0      0.9      0.0          assert i not in s
    40      3001       1519.0      0.5      0.0      for i in missing:
    41      3000       1795.0      0.6      0.0          assert i not in d

I am quite convinced there would be performance improvements, if we changed the implementation to use an OrderedDict instead of a set and a list (or a Python 3.6 dict, which is ordered, as @Eric-Arellano suggests).

Edit: However, this would make the access by index (like OrderedSet[1]) much slower (complexity O(n) instead of the current O(1)).

Would you accept a PR doing so?

@Eric-Arellano
Copy link
Contributor

FYI at https://github.com/pantsbuild/pants/blob/master/src/python/pants/util/ordered_set.py, we ~forked this library to create an OrderedSet and FrozenOrderedSet type backed by a builtin dict, as we use CPython 3.6+ so have insertion ordering. The ordering is an implementation detail in 3.6, and guaranteed in 3.7+.

I don't have benchmarks handy, but I believe dict is generally faster than OrderedDict, so if it's possible to drop 3.5 support, you could use that. 3.5 is EOL, so might be fine to drop.

We didn't upstream our implementation, as we made several major changes (simplifications) like no longer implementing Sequence. But feel free to use our implementation (Pants is Apache 2).

@danuker
Copy link
Author

danuker commented Nov 16, 2020

@Eric-Arellano Funny how this implementation is also inspired from Raymond Hettinger's, but uses a list instead of a linked list, making access by index O(1), but deletion much slower - O(n).

With a linked list, index access would be O(n), and deletion O(1).

I am not so sure about the pull request anymore. I would never think of using index access on a set, but I suppose it also depends on the other people using this library.

@Eric-Arellano
Copy link
Contributor

Funny how this implementation is also inspired from Raymond Hettinger's, but uses a list instead of a linked list, making access by index O(1), but deletion much slower - O(n).

Indeed! We 95% of the time use FrozenOrderedSet. For Pants's caching, it's important that things are immutable. The only time we use OrderedSet is inside a function when creating a collection by continually adding to it, then we convert to a FrozenOrderedSet.

So, we care far more about access by index than deletion.

--

Possibly interesting to you - we also added a simple FrozenDict class that uses dict under-the-hood and simply doesn't expose any of its mutation entry-points: https://github.com/pantsbuild/pants/blob/master/src/python/pants/util/frozendict.py

--

I would never think of using index access on a set

Yeah, this was a big reason we forked. I originally tried adding FrozenOrderedSet here in #63, and we realized this project's goals aren't quite the same as Pants, even though this is still an awesome library.

I think the Sequence behavior is useful for some of the data science use cases like Pandas iirc.

@danuker
Copy link
Author

danuker commented Nov 16, 2020

@Eric-Arellano

immutable
simply doesn't expose any of its mutation entry-points

Cool! I am into functional programming (functional core / imperative shell), and I agree that immutability is valuable. I might "steal" your ideas 😁

I wish you good luck with your project!

we care far more about access by index than deletion

Hmm, you have no access by index in your OrderedSet. Are you talking about the dict? Or perhaps you mean membership check?

Yeah, this was a big reason we forked.

I can totally understand. Perhaps the author and users might agree.

In that case, I am not sure how to deal with the license (MIT vs. Apache).

The author has to think about whether to keep your implementation under Apache (and have multiple licenses inside this project), or ask all contributors to that file for MIT permission (That is, if a copy of the file is desired instead of a re-implementation).

@danuker
Copy link
Author

danuker commented Nov 16, 2020

@Eric-Arellano One way you could still keep the mutation methods is by having them return a new object that is a copy of the old, but with some changed things.

That way, you get immutable mutation! 😆

@rspeer
Copy link
Owner

rspeer commented Jan 26, 2022

Added a note to the top of the README about when you would prefer to use OrderedDict.

@rspeer rspeer closed this as completed Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants