allow non hashable data to be used in unique #244

CJ-Wright · 2019-05-01T19:57:21Z

No description provided.

codecov-io · 2019-05-01T20:14:03Z

Codecov Report

Merging #244 into master will increase coverage by 1.09%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #244      +/-   ##
==========================================
+ Coverage   93.86%   94.96%   +1.09%     
==========================================
  Files          13       13              
  Lines        1566     1609      +43     
==========================================
+ Hits         1470     1528      +58     
+ Misses         96       81      -15

Impacted Files	Coverage Δ
streamz/compatibility.py	`66.66% <0%> (-33.34%)`	⬇️
streamz/sources.py	`95.3% <0%> (-0.47%)`	⬇️
streamz/collection.py	`94.44% <0%> (ø)`	⬆️
streamz/dataframe/utils.py	`100% <0%> (ø)`	⬆️
streamz/dataframe/aggregations.py	`99.1% <0%> (ø)`	⬆️
streamz/dataframe/core.py	`92.56% <0%> (+0.07%)`	⬆️
streamz/graph.py	`89.89% <0%> (+18.47%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6445fa9...ba3c164. Read the comment docs.

martindurant

Some thoughts

martindurant · 2019-05-06T14:49:07Z

streamz/core.py

+                        self.seen = LRU(self.history, self.seen)
+            # if not hashable use deque (since it doesn't need a hash)
+            else:
+                self.seen = deque(maxlen=self.history)


With a deque, we get the specified number of items, not the specified number of unique items. I think that, if we are allowing now to process data to get something hashable, this branch is unnecessary.

To be fair the key kwarg was always there, just not documented.

The uniqueness is guaranteed by https://github.com/python-streamz/streamz/pull/244/files#diff-d4e512993f47710ea9e7155d66011a24R1080 not the deque itself.

martindurant · 2019-05-06T14:49:53Z

streamz/core.py

        y = self.key(x)
+        # If this is the first piece of data make the cache
+        if self.seen is None:
+            if isinstance(y, Hashable):


I don't think that behaviour should depend on what the first item happens to be: the cache should be set up in init

martindurant · 2019-05-06T14:50:26Z

streamz/core.py

+                self.seen = dict()
+                if self.history:
+                    # if it is hashable use LRU cache
+                    if isinstance(y, Hashable):


This seems to be an and, and I would say that the latter should always be True

martindurant · 2019-05-06T14:51:06Z

streamz/core.py


+    Parameters
+    ----------
+    history : int or None, optional


maxsize, I think

The parameter name changed, but not this doc line (sorry :))

martindurant · 2019-05-06T14:52:57Z

streamz/core.py

+    ----------
+    history : int or None, optional
+        number of stored unique values to check against
+    key : function, optional


You could call this hashfunc to make it clear: it is not a key (in the example, this would be simply "a"), and the purpose is to make the data hashable.

CJ-Wright · 2019-05-06T23:07:20Z

The reason why I put in this PR was that I found that specifying hash functions for each type, which is equatable but not hashable, to be frustrating. My approach was to extend the existing unique node class to provide support for non-hashable types.

martindurant · 2019-05-06T23:14:56Z

That does make more sense, sorry I misunderstood in the first place

martindurant · 2019-05-07T15:10:36Z

Rather than testing whether the first input happend to be a Hashable, do you think it may be simpler to provide another argument? I'm thinking that there will be cases where the first item is hashable but subsequent ones are not.

CJ-Wright · 2019-05-07T15:14:12Z

I can do that, although I'm not certain how much support we should give to heterogeneous streams, since usually functions don't handle more than one kind of data.

martindurant · 2019-05-07T17:06:19Z

streamz/core.py:3:1: F401 'collections.Hashable' imported but unused

martindurant · 2019-05-17T17:27:49Z

Coming back to this now.

Are you certain that deque is the right structure here? I ask, because if we only test with in, then the position in the q is not updated, something can drop out of the history despite being seen relatively recently.

CJ-Wright · 2019-05-18T15:02:09Z

Ah now I understand what you mean. Maybe we could refresh the position of the element in the queue. I like the deque structure because it handles things falling out of history nicely (we don't need to explicitly remove anything).

martindurant · 2019-05-19T22:07:01Z

like the deque structure because it handles things falling out

Right, but if the order isn't useful in this sense, or we have to move things around within a q, may as well use list.

In fact, it makes me wonder about the in operation in this sense - this will defer to the hash() function by default anyway, right (i.e., check is against each element)? So I'm not certain this has the right behaviour in any case for the non-hashable, non-hashfunction case.

martindurant · 2019-05-28T20:30:30Z

ping

CJ-Wright · 2019-05-30T15:11:59Z

I think that in checks equality if hash doesn't exist. For instance {1:2} in [{1:2}, {3:4}] is a valid statement, even though hash({1:2}) is not.

martindurant · 2019-05-30T15:27:16Z

mmm, ok.

Then I would request that this code be changed to keep a list instead of a deque, so that the correct order of historical uniques can be maintained.

CJ-Wright · 2019-05-30T15:42:26Z

Ok, Do you have any suggestions on how to do the data dropping mechanic?

martindurant · 2019-05-30T15:48:50Z

I'm afraid you'll have to do del seen[history:] every time; insert to the front and move to the front (seen.remove(item); seen.insert(0, item)). Or the same from the other end, if you prefer. This does not scale very well for very big lists.

(note that I would prefer to rename history, in a comment above)

CJ-Wright · 2019-05-30T16:39:50Z

Ah, so I went back to produce a test which would get at the bug that you found with my previous implementation. However, I think the behavior for hashable and non-hashable are the same. The issue here is that checking membership of a thing in the LRU cache doesn't count as a "use" of the element.

For instance:

from zict import LRU
lru = LRU(2, {}, on_evict=lambda k, v: print("Lost", k, v))
lru['a'] = 1
lru['b'] = 1
'a' in lru
lru['c'] = 1

will print that a has been lost.

We could change the current behavior so that the history gets updated.

martindurant · 2019-05-30T16:42:23Z

'a' in lru -> lru['a']?

CJ-Wright · 2019-05-30T16:51:27Z

How about lru.get('a', None) that way we don't need to wrap it in a try except and if the result is None we know we need to add it to the cache.

martindurant · 2019-05-30T17:14:57Z

That's fine too

CJ-Wright · 2019-05-31T17:29:47Z

@martindurant I think this is ready for review again, I'm not certain why py2.7 is failing.

martindurant · 2019-05-31T17:33:29Z

Hm, I don't seem to get notifications of commits on this PR.
For py2 (although I don't really care):

tests that depend on distributed (master) should be skipped as it may contain py3-only syntax
the rest looks like circular imports, which can likely be fixed by deferring, moving some imports into the methods that actually need them

CJ-Wright · 2019-05-31T17:53:01Z

I don't care about the py2k tests either.

martindurant · 2019-05-31T18:04:19Z

OK, so either those things should be fixed/skipped (I doubt it's more than half hour's work), or the build matrix should be updated to run only py36/7 - and then also the setup.py should reflect this change; that means we would need to remember to update the feedstock also when the time comes.

martindurant · 2019-05-31T20:59:10Z

Done.

allow non hashable data to be used in unique

f81489b

martindurant reviewed May 6, 2019

View reviewed changes

use hashable kwarg

2301991

remove unused import

1f425f3

add history test and change history to maxsize

cb328f2

handle unique maxsize properly

8c5679f

py3k only

ba3c164

martindurant merged commit ffed062 into python-streamz:master May 31, 2019

CJ-Wright deleted the unique_non_hashable branch May 31, 2019 21:16

CJ-Wright mentioned this pull request Jun 13, 2019

unique fails on dicts #140

Closed

allow non hashable data to be used in unique #244

allow non hashable data to be used in unique #244

Uh oh!

Conversation

CJ-Wright commented May 1, 2019

Uh oh!

codecov-io commented May 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

martindurant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CJ-Wright commented May 6, 2019

Uh oh!

martindurant commented May 6, 2019

Uh oh!

martindurant commented May 7, 2019

Uh oh!

CJ-Wright commented May 7, 2019

Uh oh!

martindurant commented May 7, 2019

Uh oh!

martindurant commented May 17, 2019

Uh oh!

CJ-Wright commented May 18, 2019

Uh oh!

martindurant commented May 19, 2019

Uh oh!

martindurant commented May 28, 2019

Uh oh!

CJ-Wright commented May 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented May 30, 2019

Uh oh!

CJ-Wright commented May 30, 2019

Uh oh!

martindurant commented May 30, 2019

Uh oh!

CJ-Wright commented May 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented May 30, 2019

Uh oh!

CJ-Wright commented May 30, 2019

Uh oh!

martindurant commented May 30, 2019

Uh oh!

CJ-Wright commented May 31, 2019

Uh oh!

martindurant commented May 31, 2019

Uh oh!

CJ-Wright commented May 31, 2019

Uh oh!

martindurant commented May 31, 2019

Uh oh!

martindurant commented May 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

codecov-io commented May 1, 2019 •

edited

Loading

CJ-Wright commented May 30, 2019 •

edited

Loading

CJ-Wright commented May 30, 2019 •

edited

Loading