Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an option to cache selection in memory #13

Closed
bombrun opened this issue Nov 22, 2016 · 2 comments
Closed

add an option to cache selection in memory #13

bombrun opened this issue Nov 22, 2016 · 2 comments

Comments

@bombrun
Copy link

bombrun commented Nov 22, 2016

It seems that vaex is not caching selection in memory. I think it will be useful to have an option to cache a selection in memory. It will avoid to go trough all the data again when working on a subset.

One should manage memory limitation, by example by setting a memory buffer limit. One should be warned if the selection exceed buffer limit.

@maartenbreddels
Copy link
Member

Actually it should, but only 'named' selections. For instance

ds.plot("x", "y", selection="z > 0")

would not cache anything, but

ds.select("z > 0", name="zpos")
ds.plot("x", "y", selection="zpos")

should cache it, so executing the plot a second time should go faster.
Although caching all selections with some memory limit would be an option, I want to leave it out to keep vaex as simple as possible. I hope this will suffice for the moment.

However, I was testing the performance of this, and found it was not being cached. I fixed this, however, don't expect massive speedups for simple selections. Since having a selection means that the data has to be copied (though this may change in the future), it means a performance penalty for selections.

I'll get back to you on this, in combinations with performance improvements for multiple plots, you may see some speedup, which should also fix #10.

@maartenbreddels
Copy link
Member

This is working well now (using arctan2 since it's quite expensive to calculate):

ds = vaex.datasets.helmi_de_zeeuw.fetch()
ds.select("arctan2(y,x) > 0")
%%timeit -r5
ds.count(selection=True)

Gives:

10 loops, best of 5: 18.5 ms per loop

While using an unnamed selection (which will not get cached)

%%timeit -r5
ds.count(selection="arctan2(y,x) > 0")

Gives

1 loop, best of 5: 120 ms per loop

6.5x speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants