Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve skip key #47

Open
CarstVaartjes opened this issue Sep 22, 2015 · 5 comments
Open

Improve skip key #47

CarstVaartjes opened this issue Sep 22, 2015 · 5 comments

Comments

@CarstVaartjes
Copy link
Member

It's a slightly ugly method to handle irrelevant results (because of filtering) during a groupby:

  • finding it can be slow as it's now (now cythonized)
  • getmask would be nicer; also have to look how bcolz iterblocks handles filters
  • how it's removed at the end (manipulating a list of ndarrays) is not performance efficient
@waylonflinn
Copy link
Contributor

Would whereblocks be a good way to handle this?
I'm working on a PR that will add pivot table style aggregation, and I'm pursuing a method that would make extensive use of the existing filtering functionality. If there are speed gains to be had, I'd like to pursue them.

@CarstVaartjes
Copy link
Member Author

whereblocks aren't cython based; @FrancescElies his work on iterblocks will work (and make it a lot more readable) but we still have to apply the filter ourselves :/
we could do a np.getmask on the chunk array but not sure if that will make it quicker or not than just looping through it. Also depends on the amount that you filter out of course

@waylonflinn
Copy link
Contributor

I found this excellent investigation of timings from @FrancescElies: Comment in Chunks Class iterator, PR 153

It looks like he's just using the top level iterblocks though: test_v5 in bench_iter_carray

Isn't that defined in python here? iterblocks in toplevel.py

@waylonflinn
Copy link
Contributor

I'm also happy to put together a PR that reproduces his earlier PR, since that will likely be hard to merge. Would that be helpful?

@CarstVaartjes
Copy link
Member Author

@FrancescAlted also did some work to improve performance here (also with tuples vs namedtuples), I will check if we can use that to improve this and save ourselves unneeded chunk decompressions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants