New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering grouped data is slow #3294
Comments
|
Indeed, we'd need to scan the expression for filter functions (or maybe any non-operator functions) to avoid this problem. This will be better with #2295, but still not as fast as the ungrouped version. |
|
In a discussion I had with @alandipert, this idea came up: if dplyr had implementations of common functions, that were group-aware, then if For example, imagine this: mtcars %>%
group_by(cyl) %>%
filter(mpg > mean(mpg))If dplyr had a version of |
|
This sounds very similar to hybrid evaluation to me: https://github.com/tidyverse/dplyr/blob/master/vignettes/internals/hybrid-evaluation.Rmd. Here, we know what For the use case of Currently, we always process in groups, we don't scan the expression to see if it only contains group-aware (or perhaps "mutating" or "creating", http://r4ds.had.co.nz/transform.html#mutate-funs?) functions. This would be a useful shortcut that will speed up many scenarios. I like the idea, opened #3326 and closing this issue. I hope I got the idea right. |
|
My understanding of #3226 is that it's somewhat different from what you (and I) have described here, with group-aware functions. But it's possible I'm just not understanding "mutating" or "creating" functions correctly. I could imagine it working something like this: if it finds that all the functions are group-aware, then it calls a special version of them, and passes them modified vectors, which have a group index attached an attribute. |
|
I see, maybe I confused "group-aware" with "group-agnostic"? We alredy have hybrid evaluation, this feels like a sufficient approximation to the "group-aware" functions you're suggesting. In particular, all hybrid handlers are group-aware (and implemented in C++). |
|
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
EDIT: I've added a much simpler example at the top.
I've found that filtering grouped data is slow when there are many groups, even when the filter condition is orthogonal to the grouping.
It's possible I'm hoping for too much intelligence here -- that dplyr can detect when the grouping is relevant for the filtering condition(s).
Example:
EDIT: Original example below:
I have a gist here with data:
https://gist.github.com/wch/15bce85635d7e035126681f81900fa47
To reproduce, clone the gist, enter the directory, and run this code:
The text was updated successfully, but these errors were encountered: