Skip to content

Conversation

@CJ-Wright
Copy link
Member

No description provided.

@codecov-io
Copy link

codecov-io commented May 1, 2019

Codecov Report

Merging #244 into master will increase coverage by 1.09%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #244      +/-   ##
==========================================
+ Coverage   93.86%   94.96%   +1.09%     
==========================================
  Files          13       13              
  Lines        1566     1609      +43     
==========================================
+ Hits         1470     1528      +58     
+ Misses         96       81      -15
Impacted Files Coverage Δ
streamz/compatibility.py 66.66% <0%> (-33.34%) ⬇️
streamz/sources.py 95.3% <0%> (-0.47%) ⬇️
streamz/collection.py 94.44% <0%> (ø) ⬆️
streamz/dataframe/utils.py 100% <0%> (ø) ⬆️
streamz/dataframe/aggregations.py 99.1% <0%> (ø) ⬆️
streamz/dataframe/core.py 92.56% <0%> (+0.07%) ⬆️
streamz/graph.py 89.89% <0%> (+18.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6445fa9...ba3c164. Read the comment docs.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts

streamz/core.py Outdated
self.seen = LRU(self.history, self.seen)
# if not hashable use deque (since it doesn't need a hash)
else:
self.seen = deque(maxlen=self.history)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a deque, we get the specified number of items, not the specified number of unique items. I think that, if we are allowing now to process data to get something hashable, this branch is unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair the key kwarg was always there, just not documented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streamz/core.py Outdated
y = self.key(x)
# If this is the first piece of data make the cache
if self.seen is None:
if isinstance(y, Hashable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that behaviour should depend on what the first item happens to be: the cache should be set up in init

streamz/core.py Outdated
self.seen = dict()
if self.history:
# if it is hashable use LRU cache
if isinstance(y, Hashable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be an and, and I would say that the latter should always be True

Parameters
----------
history : int or None, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxsize, I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name changed, but not this doc line (sorry :))

----------
history : int or None, optional
number of stored unique values to check against
key : function, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could call this hashfunc to make it clear: it is not a key (in the example, this would be simply "a"), and the purpose is to make the data hashable.

@CJ-Wright
Copy link
Member Author

The reason why I put in this PR was that I found that specifying hash functions for each type, which is equatable but not hashable, to be frustrating. My approach was to extend the existing unique node class to provide support for non-hashable types.

@martindurant
Copy link
Member

That does make more sense, sorry I misunderstood in the first place

@martindurant
Copy link
Member

Rather than testing whether the first input happend to be a Hashable, do you think it may be simpler to provide another argument? I'm thinking that there will be cases where the first item is hashable but subsequent ones are not.

@CJ-Wright
Copy link
Member Author

I can do that, although I'm not certain how much support we should give to heterogeneous streams, since usually functions don't handle more than one kind of data.

@martindurant
Copy link
Member

streamz/core.py:3:1: F401 'collections.Hashable' imported but unused

@martindurant
Copy link
Member

Coming back to this now.

Are you certain that deque is the right structure here? I ask, because if we only test with in, then the position in the q is not updated, something can drop out of the history despite being seen relatively recently.

@CJ-Wright
Copy link
Member Author

Ah now I understand what you mean. Maybe we could refresh the position of the element in the queue. I like the deque structure because it handles things falling out of history nicely (we don't need to explicitly remove anything).

@martindurant
Copy link
Member

like the deque structure because it handles things falling out

Right, but if the order isn't useful in this sense, or we have to move things around within a q, may as well use list.

In fact, it makes me wonder about the in operation in this sense - this will defer to the hash() function by default anyway, right (i.e., check is against each element)? So I'm not certain this has the right behaviour in any case for the non-hashable, non-hashfunction case.

@martindurant
Copy link
Member

ping

@CJ-Wright
Copy link
Member Author

CJ-Wright commented May 30, 2019

I think that in checks equality if hash doesn't exist. For instance {1:2} in [{1:2}, {3:4}] is a valid statement, even though hash({1:2}) is not.

@martindurant
Copy link
Member

mmm, ok.

Then I would request that this code be changed to keep a list instead of a deque, so that the correct order of historical uniques can be maintained.

@CJ-Wright
Copy link
Member Author

Ok, Do you have any suggestions on how to do the data dropping mechanic?

@martindurant
Copy link
Member

I'm afraid you'll have to do del seen[history:] every time; insert to the front and move to the front (seen.remove(item); seen.insert(0, item)). Or the same from the other end, if you prefer. This does not scale very well for very big lists.

(note that I would prefer to rename history, in a comment above)

@CJ-Wright
Copy link
Member Author

CJ-Wright commented May 30, 2019

Ah, so I went back to produce a test which would get at the bug that you found with my previous implementation. However, I think the behavior for hashable and non-hashable are the same. The issue here is that checking membership of a thing in the LRU cache doesn't count as a "use" of the element.

For instance:

from zict import LRU
lru = LRU(2, {}, on_evict=lambda k, v: print("Lost", k, v))
lru['a'] = 1
lru['b'] = 1
'a' in lru
lru['c'] = 1

will print that a has been lost.

We could change the current behavior so that the history gets updated.

@martindurant
Copy link
Member

'a' in lru -> lru['a']?

@CJ-Wright
Copy link
Member Author

How about lru.get('a', None) that way we don't need to wrap it in a try except and if the result is None we know we need to add it to the cache.

@martindurant
Copy link
Member

That's fine too

@CJ-Wright
Copy link
Member Author

@martindurant I think this is ready for review again, I'm not certain why py2.7 is failing.

@martindurant
Copy link
Member

Hm, I don't seem to get notifications of commits on this PR.
For py2 (although I don't really care):

  • tests that depend on distributed (master) should be skipped as it may contain py3-only syntax
  • the rest looks like circular imports, which can likely be fixed by deferring, moving some imports into the methods that actually need them

@CJ-Wright
Copy link
Member Author

I don't care about the py2k tests either.

@martindurant
Copy link
Member

OK, so either those things should be fixed/skipped (I doubt it's more than half hour's work), or the build matrix should be updated to run only py36/7 - and then also the setup.py should reflect this change; that means we would need to remember to update the feedstock also when the time comes.

@martindurant
Copy link
Member

Done.

@martindurant martindurant merged commit ffed062 into python-streamz:master May 31, 2019
@CJ-Wright CJ-Wright deleted the unique_non_hashable branch May 31, 2019 21:16
@CJ-Wright CJ-Wright mentioned this pull request Jun 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants