Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.0 #57

Closed
wants to merge 3 commits into from
Closed

v4.0 #57

wants to merge 3 commits into from

Conversation

wimglenn
Copy link

@wimglenn wimglenn commented Aug 6, 2019

This changes equality comparisons completely to make them more consistent with how all the other data structures in Python stdlib behave.

  • Comparisons with regular set and frozenset are order agnostic.
  • Comparisons with other OrderedSet (and other ordered-set-like-things) are order sensitive.
  • Previous equality comparison with list/deque/sequence was removed (this was unpythonic and had no precedent)
  • Comparisons with unrelated types return NotImplemented - this is allowing the other type a chance to potentially handle the operation (as opposed to returning False)
  • Major version number is bumped up because this is a backwards-incompat change.

The behaviour of equality comparison was modeled after the way collections.OrderedDict behaves (order agnostic vs regular dict, order sensitive vs other ordered dict).

@rspeer
Copy link
Owner

rspeer commented Aug 7, 2019

I don't agree with breaking the ability to compare to lists. Clearly that's something that people use.

This implementation of OrderedSet (as opposed to the alternate implementation, an OrderedDict where the keys don't matter) turns out to be useful in list-like contexts, much like pandas.Index is. Maybe it should have been called UniqueList or something from the start.

For example, you can use the indexing operator on a list or an OrderedSet, and not on a set, a dict, or an OrderedDict.

I recognize that it's better to return NotImplemented when comparing to an unrelated type.

@wimglenn
Copy link
Author

wimglenn commented Aug 7, 2019

Is it really clear that people use that? How?

Different types in Python almost never compare equal, except in a handful of special cases (numeric types, subclasses). The developer must be explicit about it with constructs such as:

  • list(xs) == list(ys)
  • all(x==y for x,y in zip(xs, ys))

I was surprised to see the comparison special-cased with Sequences and deque in the source code, I think most Python developers would never have expected such a thing.

Be aware of the mathematical consequences as well, for example breaking transitivity of equality:

>>> OrderedSet([0,1]) == [0,1]
True
>>> OrderedSet([0,1]) == (0,1)
True
>>> [0,1] == (0,1)
False

@rspeer
Copy link
Owner

rspeer commented Aug 8, 2019

So I don't think that Python guarantees transitivity of equality, but the order-independent comparison with sets (which I don't have any indication that anyone has wanted) is definitely non-transitive. If it stays implemented:

OrderedSet([1, 2]) == {1, 2} == OrderedSet([2, 1])

The special case with deque, on the other hand, is because someone asked for it.

OrderedSets are sometimes used as a faster equivalent to pd.Index from Pandas, and Pandas also silently converts list into pd.Index in some contexts.

pd.Index implements equality across types, though it does it with NumPy shenanigans where == returns an array of comparisons instead of a bool, and you're expected to run .all() on the result. I don't want to emulate that NumPy gotcha, but regardless, look at how Indexes compare:

>>> pd.Index([1, 2, 3]) == (1, 2, 3)
array([ True,  True,  True])

>>> pd.Index([1, 2, 3]) == [1, 2, 3]
array([ True,  True,  True])

>>> pd.Index([1, 2, 3]) == (1, 2, 3)
array([ True,  True,  True])

>>> pd.Index([1, 2, 3]) == {1, 2, 3}
array([False, False, False])

>>> pd.Index([1, 2, 3]) == [2, 1, 3]
array([False, False,  True])

>>> pd.Index([1, 2, 3]) == OrderedSet([1, 2, 3])
array([ True,  True,  True])

@wimglenn
Copy link
Author

wimglenn commented Aug 8, 2019

The behaviour of array-likes (numpy and pandas broadcasting) is somewhat irrelevant in this context, and should not be taken into account for equality comparisons with stdlib containers which are returning a scalar.

If you want to keep that broadcasting behaviour for array-likes, there is nothing stopping you, but perhaps you should make sure to handle it from either side to avoid this bug-prone inconsistency:

>>> pd.Index([1, 2, 3]) == OrderedSet([1, 2, 3])
array([ True,  True,  True])
>>> OrderedSet([1, 2, 3]) == pd.Index([1, 2, 3])
True

As for OrderedSet([1, 2]) == {1, 2} == OrderedSet([2, 1]), the behaviour has precedent with OrderedDict (subclass), which is also transitivity-breaking. But I'm not aware of any equivalent example using three unrelated types in Python 3. I was using OrderedSet as a dependency, and hit some deal-breaking bugs of which the library's __eq__ implementation was the direct cause.
As a general rule, unrelated types just don't compare equal in Python. If it was, as you mentioned, called a UniqueList then I might expect it to compare equal with a list - however that would also come with other expectations (can store non-hashable items, membership checks are O(n), etc) and it may no longer be sensible to base off of Raymond Hettinger's set-like recipe in the first place.

Anyway, maybe neither of us can convince the other - I'm just putting forth my opinion and the arguments supporting it, it's of course up to you how you want to proceed with the library! I'm happy to fork and release the proposed changes under a different name if necessary.

@rspeer
Copy link
Owner

rspeer commented Aug 8, 2019

I don't want to implement NumPy's gotcha about equality, or depend on NumPy in any way, so the comparison between pd.Index and OrderedSet is necessarily non-commutative. No matter what decisions we chose about equality in OrderedSet, it could not return an np.array, so it can't be commutative.

Given that the bug you reported that introduced this is also about a non-commutative comparison, I'm thinking that probably the best thing to do is leave it alone. When I think about it, changing the value of a tested comparison from True to False requires not just a major version bump, but also a period of warnings and deprecations, and I think it's just not worth it.

Feel free to fork.

@rspeer rspeer closed this Aug 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants