-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset types use __nonzero/bool__ method for truthiness #992
Conversation
""" | ||
return 1 | ||
def nonzero(cls, dataset): | ||
return True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a definite improvement! Hardcoding nonzero is vastly better than hardcoding length. Even so, is there no way to determine the actual value of nonzero in a way that doesn't load the entire dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried various things such as using:
try:
next(df.iterrows())
except:
nonzero = False
else:
nonzero = True
But all of these approaches still take a considerable amount of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth asking the dask maintainers if there is a quick method for testing nonzero, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've asked, hopefully they'll have a good solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matt Rocklin suggested using .head
and checking the length of that, not any faster than my solution above though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed it properly now, and it's fairly clear that there won't be a cheap general solution here. A dask dataframe can be the result of a bunch of chained operations which all have to be evaluated before the length or even a nonzero length can be determined.
while i < len(self): | ||
yield tuple(self.data[i, ...]) | ||
i += 1 | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure of the implications of deleting this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was old and broken because it assumed the data was always an array. There is an existing issue about adding an iterator method to all the Dataset
interfaces somewhere.
Looks great! Happy to see it merged. |
Very happy with this improvement and tests have passed. Merging. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
As discussed in #988 this PR implements the
__nonzero__
(Py2) and__bool__
(Py3) methods on Dataset ensuring, which is used instead of__len__
. This ensures that the dask interface can implement the__len__
method correctly, without forcing code throughout holoviews to compute the length on a large out-of-core dask dataframe all the time.