Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset types use __nonzero/bool__ method for truthiness #992

Merged
merged 1 commit into from Nov 29, 2016

Conversation

Projects
None yet
3 participants
@philippjfr
Copy link
Contributor

philippjfr commented Nov 29, 2016

As discussed in #988 this PR implements the __nonzero__ (Py2) and __bool__ (Py3) methods on Dataset ensuring, which is used instead of __len__. This ensures that the dask interface can implement the __len__ method correctly, without forcing code throughout holoviews to compute the length on a large out-of-core dask dataframe all the time.

@philippjfr philippjfr added the data label Nov 29, 2016

"""
return 1
def nonzero(cls, dataset):
return True

This comment has been minimized.

@jbednar

jbednar Nov 29, 2016

Contributor

This is a definite improvement! Hardcoding nonzero is vastly better than hardcoding length. Even so, is there no way to determine the actual value of nonzero in a way that doesn't load the entire dataset?

This comment has been minimized.

@philippjfr

philippjfr Nov 29, 2016

Author Contributor

I tried various things such as using:

try:
   next(df.iterrows())
except:
   nonzero = False
else:
   nonzero = True

But all of these approaches still take a considerable amount of time.

This comment has been minimized.

@jbednar

jbednar Nov 29, 2016

Contributor

Maybe worth asking the dask maintainers if there is a quick method for testing nonzero, then.

This comment has been minimized.

@philippjfr

philippjfr Nov 29, 2016

Author Contributor

I've asked, hopefully they'll have a good solution.

This comment has been minimized.

@philippjfr

philippjfr Nov 29, 2016

Author Contributor

Matt Rocklin suggested using .head and checking the length of that, not any faster than my solution above though.

This comment has been minimized.

@philippjfr

philippjfr Nov 29, 2016

Author Contributor

Discussed it properly now, and it's fairly clear that there won't be a cheap general solution here. A dask dataframe can be the result of a bunch of chained operations which all have to be evaluated before the length or even a nonzero length can be determined.

while i < len(self):
yield tuple(self.data[i, ...])
i += 1


This comment has been minimized.

@jbednar

jbednar Nov 29, 2016

Contributor

Not sure of the implications of deleting this.

This comment has been minimized.

@philippjfr

philippjfr Nov 29, 2016

Author Contributor

This was old and broken because it assumed the data was always an array. There is an existing issue about adding an iterator method to all the Dataset interfaces somewhere.

@jbednar

This comment has been minimized.

Copy link
Contributor

jbednar commented Nov 29, 2016

Looks great! Happy to see it merged.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Nov 29, 2016

Very happy with this improvement and tests have passed. Merging.

@jlstevens jlstevens merged commit 8c03983 into master Nov 29, 2016

4 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 75.778%
Details
s3-reference-data-cache Test data is cached.
Details

@philippjfr philippjfr deleted the interface_nonzero branch Dec 10, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.