Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conform to set API, update to pytest, add CI badges #27

Merged
merged 9 commits into from Jun 14, 2018

Conversation

Erotemic
Copy link
Contributor

@Erotemic Erotemic commented Jun 9, 2018

This is a fairly large PR in terms of text, but I think the content is manageable for review.

It seems like this is the best implementation of OrderedSet on PYPI, but unfortunately it is missing functions implemented by the builtin set object like: intersection, symmetric_difference_update, and issubset. This PR adds implementations for these functions.

A brief summary of changes is:

  • Implement all functions exposed in Python's builtin set API.
  • Replace nose with pytest
  • Add TravisCI and corresponding badges in the README
  • Change getitem(slice(None)) to return a copy instead of a self reference.

As for reasoning behind these changes:

  • I think OrderedSet should serve as a drop-in replacement for set, and therefore should implement the full API.
  • I've replaced nose with pytest, because nose is [deprecated]. (http://nose.readthedocs.io/en/latest/index.html).
  • TravisCI serves as a public facing CI, so it can see that all tests pass without having to trust text in the README. This also lets the readme publicly display test coverage, and there is a nice pypi integration badge as well.
  • The __getitem__ change may be the most debatable. The previous docs stated that OrderedSet()[:] would return an exact self reference, but I want to argue against this behavior. The main point is that this deviates from a standard behavior taken by other Mutable Python objects. Hopefully this code snippet illustrates my point:
>>> x = [1, 2, 3]
>>> x[:] is x
False

>>> x = np.array([1, 2, 3])
>>> x[:] is x
False

>>> # Why should OrderedSet have behavior different than a list here?
>>> x = OrderedSet([1, 2, 3])
>>> x[:] is x
True

A potential counter point may be that tuple([1, 2, 3])[:], does return a reference to itself, but tuple is an immutable object, where as OrderedSet is not.

Unfortunately, even though this change is relatively minor it is a potentially breaking change that requires a major version update. My PR includes that as well. If I fail to convince you that this is a good idea, it can obviously be taken out and the minor version can be incremented.


Note: I'm submitting the PR in its current state to gauge interest. Some of the code --- especially the tests --- needs a bit of tweaking to match the style of the other code. I will do this work if the maintainers like the general changes.

I have my own implementation of OrderedSet, also based on Raymond Hettiger's original recipe, in my utility package but I'd like to rely on a standard pypi package instead. Also, I agree that nobody wants O(1) deletes in exchange for O(N) lookups. Thus most of this PR was simply copy and pasted from my implementation.

All of the new tests were autogenerated from the doctests in my old package. As such, the stdout way of testing results doesn't work and all tests would need to be updated to actually check for the right values. Lastly, many of the tests might be redundant with existing tests. Again, I do the work to make these fixes if there is any interest in merging this PR.

A summary of the cleanup work that would need to be done is:

  • change travis targets in the readme to point at LuminosoInsight instead of Erotemic.
  • Ensure that all test outputs are being checked
  • Either enable doctests in CI testing? or Remove doctests from the main module?

@rspeer
Copy link
Owner

rspeer commented Jun 11, 2018

Thanks for all the improvements! I think most of this sounds like a good idea.

I see the tests marked with "doctest want" where the outputs aren't being checked. I'd say go ahead and make them into real doctests. Having code examples in the docstrings that are automatically checked is pretty nice.

I can see the advantage of using Travis -- a public-facing CI for a public-facing project. I haven't actually used it before, only Jenkins, but it sounds easy enough to use.

I think you're going to have to sell me on adding both Travis and Appveyor -- what additional benefit does Appveyor provide? Will I have to get an Appveyor account and manage it? This seems like quite a proliferation of tooling for a simple module.

@Erotemic
Copy link
Contributor Author

Actually I meant to remove Appveyor from this PR, but forgot. I think Travis is probably sufficient. The main difference is one runs code on Linux and the other runs on Windows. I don't see any reason why there would be a difference in this application (although I've been burned by that mindset before). I've gone ahead and removed the Appveyor badge. I've also changed all of the usernames to target your repo.

As for Travis CI, you simply go to their website: https://travis-ci.org/. You don't actually have to create a new account. When you click "Sign Up", it gives you the option to simply link to your existing GitHub account. Then you just find this repository and toggle enable CI testing. That's it. New commits and PRs will automatically trigger the appropriate build as long as your repo contains a .travis.yml file, which I've already set up for you.

Something I forgot to mention, is that there is also a badge in there for https://codecov.io. You can create an account for that the same way as you would for travis, but you don't have to enable the project. This is because codecov.io has tight travis integration and will detect when it performs a CI build that produces a coverage file.

Speaking of coverage, I've gone through and patched a few holes to bring this repo from ~97% coverage to 100% coverage. These were mostly due to not testing that type errors were raised.

I've changed all of the unit tests into doctests. Note, the doctests were originally targeted towards xdoctest (a shameless self plug). I converted them to work with "old-style" builtin doctests as to not add any new dependencies to this simple lib.

Lastly I wanted to bring up two other issues I saw when going through the code again:

(1) When a method makes a copy (in copy and __getitem__), it references the class as OrderedSet. I think it would be better to use self.__class__ instead, which will handle the case when another object inherits from OrderedSet.
(2) The behavior of __eq__ seems a bit strange. It checks for order only if the other class is an OrderedSet, but in all other circumstances it does not. Why would order not matter for a list or tuple? My suggestion would be to always have __eq__ ignore order. This provides consistent behavior. If order is truly important for the comparison I would recommend casting the ordered set to a list first (or maybe even comparing its items).

Let me know on the last two points. If you like the idea then I'll make the changes. Otherwise I think the PR is good to go.

@rspeer
Copy link
Owner

rspeer commented Jun 12, 2018

I think your view of how doctests should work is different from mine. I don't think that all the tests should be in the docstrings. Seeing doctests with 'assert' and pytest decorators is very strange to me.

To me, the purpose of doctests is that docstrings should provide usage examples, and when you have usage examples, you should test that they are correct.

Also, I'm not sure about this, but I seem to recall that "old-school" doctest doesn't want the doctests to be indented by 4 spaces, it instead wants them to be at the same indentation level and separated by blank lines.

@Erotemic
Copy link
Contributor Author

I tend to be fairly liberal with my doctests (in order to develop more rapidly, I usually just paste whatever test code I wrote in IPython into the code as a first-pass test), but I understand your point, especially for a package like this that is not in rapid development. I thought you were saying that I could leave what I had as-is, but I'll simply move the ones that don't provide a simple example back into the tests.py file.

The current doctests in this PR work with the builtin Python doctest module. The indentation before the >>> doesn't matter, but the indentation after does. You can have indentation in standard builtin doctests, but its just a pain. If you'd like I can remove the Example: and subsequent indentation. This is a holdover from the fact that I usually write Google-style docstrings.

I'll make this change later tonight. Again, let me know about points (1) and (2).

@rspeer
Copy link
Owner

rspeer commented Jun 12, 2018

(1) Having the methods return another object of the same class sounds like a good idea.

(2) I agree that equality is inconsistent here, but it wouldn't work to ignore order in all comparisons. OrderedSets in different orders should be unequal.

I frequently use OrderedSet when I need a bidirectional mapping between a sparse vocabulary and a dense array, so I can use fast NumPy operations but also keep track of what the data means. This use case is somewhat like pd.Index in Pandas, except I want to be able to construct it incrementally and use set operations on it, and I want to be able to enforce uniqueness so that the bidirectional mapping is valid.

I sometimes need to test these vocabularies for equality, when merging data. If you have arrays labeled with the same vocabulary in the same order, you can just concatenate them on that axis. If the vocabularies are in a different order, you can't.

Another person who contributed via GitHub, Andrew Magee, added the case that allows comparing an OrderedSet to a built-in set for equality. I believe the main justification for this is that OrderedSet is supposed to be a subclass of set (actually of MutableSet).

So the use cases of OrderedSet include both ordered and unordered equality. Maybe there could be more cases for lists and tuples; if I did end up comparing an OrderedSet to a list, then the least surprising and most useful thing would be to preserve order. But I can't think of anything more principled than just checking a bunch of types and deciding whether they're ordered or not.

@Erotemic
Copy link
Contributor Author

Erotemic commented Jun 14, 2018

All of this makes sense.

I can think of a principled approach to the ordered check problem. Borrowing ideas from PR #24, we can use multiple inheritance to inherit from both collections.MutableSet and collections.Sequence. Then we can use the much more simple and elegant rule: if you are checking against a sequence, then care about order, otherwise don't.

I've gone ahead and implemented this because I think its a nice idea, but I can revert it if you disagree.

One bit of the implementation I was unsure of. Should:

OrderedSet([1, 2, 3]) == [1, 1, 2, 2, 3, 3] 

return True of False? My instinct is that it should return False. My current implementation uses this behavior.

Overall I think the new __eq__ implementation is better, but there are a few tricky cases:

  1. iterators do not inherit from Sequence, so they are assumed to be unordered
  2. collections.OrderedDict is not a Sequence, so it is also treated as unordered
  3. In Python2.7 collections.deque not a Sequence but in Python3 it is. So behavior is consistent between versions, I've added an explicit check in __eq__ to always treat queue as ordered.

I've added 4 new tests to test.py to make sure all of these cases are handled properly.

All tests pass (on my machine) on python 2.7 and python 3.6, and the code has 100% coverage. If you like the way that I've addressed the aforementioned issues then I feel this PR is in a merge-able state. If not, lets discuss.

@rspeer
Copy link
Owner

rspeer commented Jun 14, 2018

That seems like a pretty good idea. We were basically implementing Sequence anyway, might as well inherit from the base class.

When testing the methods that come from the base classes, I noticed an inconsistency from the fact that we have this order-sensitive __eq__ but __le__ comes from MutableSet:

>>> a = OrderedSet(['b', 'a'])
>>> b = OrderedSet(['a', 'b'])
>>> a < b
False
>>> a > b
False
>>> a == b
False

But that's been there since before this PR. I need to make __le__ ordered as well.

@rspeer
Copy link
Owner

rspeer commented Jun 14, 2018

And, I agree that OrderedSet([1, 2, 3]) != [1, 1, 2, 2, 3, 3].

@rspeer rspeer merged commit 8e6f096 into rspeer:master Jun 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants