-
Notifications
You must be signed in to change notification settings - Fork 150
allow non hashable data to be used in unique #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow non hashable data to be used in unique #244
Conversation
Codecov Report
@@ Coverage Diff @@
## master #244 +/- ##
==========================================
+ Coverage 93.86% 94.96% +1.09%
==========================================
Files 13 13
Lines 1566 1609 +43
==========================================
+ Hits 1470 1528 +58
+ Misses 96 81 -15
Continue to review full report at Codecov.
|
martindurant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts
streamz/core.py
Outdated
| self.seen = LRU(self.history, self.seen) | ||
| # if not hashable use deque (since it doesn't need a hash) | ||
| else: | ||
| self.seen = deque(maxlen=self.history) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a deque, we get the specified number of items, not the specified number of unique items. I think that, if we are allowing now to process data to get something hashable, this branch is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair the key kwarg was always there, just not documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uniqueness is guaranteed by https://github.com/python-streamz/streamz/pull/244/files#diff-d4e512993f47710ea9e7155d66011a24R1080 not the deque itself.
streamz/core.py
Outdated
| y = self.key(x) | ||
| # If this is the first piece of data make the cache | ||
| if self.seen is None: | ||
| if isinstance(y, Hashable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that behaviour should depend on what the first item happens to be: the cache should be set up in init
streamz/core.py
Outdated
| self.seen = dict() | ||
| if self.history: | ||
| # if it is hashable use LRU cache | ||
| if isinstance(y, Hashable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be an and, and I would say that the latter should always be True
| Parameters | ||
| ---------- | ||
| history : int or None, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxsize, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter name changed, but not this doc line (sorry :))
| ---------- | ||
| history : int or None, optional | ||
| number of stored unique values to check against | ||
| key : function, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could call this hashfunc to make it clear: it is not a key (in the example, this would be simply "a"), and the purpose is to make the data hashable.
|
The reason why I put in this PR was that I found that specifying hash functions for each type, which is equatable but not hashable, to be frustrating. My approach was to extend the existing |
|
That does make more sense, sorry I misunderstood in the first place |
|
Rather than testing whether the first input happend to be a Hashable, do you think it may be simpler to provide another argument? I'm thinking that there will be cases where the first item is hashable but subsequent ones are not. |
|
I can do that, although I'm not certain how much support we should give to heterogeneous streams, since usually functions don't handle more than one kind of data. |
|
|
Coming back to this now. Are you certain that |
|
Ah now I understand what you mean. Maybe we could refresh the position of the element in the queue. I like the deque structure because it handles things falling out of history nicely (we don't need to explicitly remove anything). |
Right, but if the order isn't useful in this sense, or we have to move things around within a q, may as well use list. In fact, it makes me wonder about the |
|
ping |
|
I think that |
|
mmm, ok. Then I would request that this code be changed to keep a list instead of a deque, so that the correct order of historical uniques can be maintained. |
|
Ok, Do you have any suggestions on how to do the data dropping mechanic? |
|
I'm afraid you'll have to do (note that I would prefer to rename |
|
Ah, so I went back to produce a test which would get at the bug that you found with my previous implementation. However, I think the behavior for hashable and non-hashable are the same. The issue here is that checking membership of a thing in the For instance: from zict import LRU
lru = LRU(2, {}, on_evict=lambda k, v: print("Lost", k, v))
lru['a'] = 1
lru['b'] = 1
'a' in lru
lru['c'] = 1will print that We could change the current behavior so that the history gets updated. |
|
|
|
How about |
|
That's fine too |
|
@martindurant I think this is ready for review again, I'm not certain why py2.7 is failing. |
|
Hm, I don't seem to get notifications of commits on this PR.
|
|
I don't care about the py2k tests either. |
|
OK, so either those things should be fixed/skipped (I doubt it's more than half hour's work), or the build matrix should be updated to run only py36/7 - and then also the setup.py should reflect this change; that means we would need to remember to update the feedstock also when the time comes. |
|
Done. |
No description provided.