Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering grouped data is slow #3294

Closed
wch opened this issue Jan 8, 2018 · 6 comments
Closed

Filtering grouped data is slow #3294

wch opened this issue Jan 8, 2018 · 6 comments

Comments

@wch
Copy link
Member

@wch wch commented Jan 8, 2018

EDIT: I've added a much simpler example at the top.

I've found that filtering grouped data is slow when there are many groups, even when the filter condition is orthogonal to the grouping.

It's possible I'm hoping for too much intelligence here -- that dplyr can detect when the grouping is relevant for the filtering condition(s).

Example:

dat <- tibble(
  g = rep(1:1e5, length.out = 1e6),
  x = rnorm(1e6)
)

system.time({
  dat %>% filter(x > 0)
})
#  user  system elapsed 
# 0.012   0.000   0.012 

system.time({
  dat %>% 
    group_by(g) %>%
    filter(x > 0)
})
#  user  system elapsed 
# 2.325   0.016   2.344 

EDIT: Original example below:

I have a gist here with data:
https://gist.github.com/wch/15bce85635d7e035126681f81900fa47

To reproduce, clone the gist, enter the directory, and run this code:

x <- readRDS("x.rds")
# # A tibble: 102,524 x 4
# # Groups: Package, Version [60,215]
#    Package Version type         n
#    <chr>   <chr>   <chr>    <int>
#  1 A3      0.9.1   Depends      2
#  2 A3      0.9.1   Suggests     2
#  3 A3      0.9.2   Depends      2
#  4 A3      0.9.2   Suggests     2
#  5 A3      1.0.0   Depends      2
#  6 A3      1.0.0   Suggests     2
#  7 abbyyR  0.1     Imports      2
#  8 abbyyR  0.2     Imports      5
#  9 abbyyR  0.2.1   Imports      5
# 10 abbyyR  0.2.1   Suggests     2
# # ... with 102,514 more rows

system.time({
  x %>% filter(Package == "shiny", Version == "1.0.5")
})
#  user  system elapsed 
# 7.385   0.107   7.588 

system.time({
  x %>% ungroup() %>% filter(Package == "shiny", Version == "1.0.5")
})
#  user  system elapsed 
# 0.003   0.001   0.004 
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 17, 2018

Indeed, we'd need to scan the expression for filter functions (or maybe any non-operator functions) to avoid this problem. This will be better with #2295, but still not as fast as the ungrouped version.

@wch
Copy link
Member Author

@wch wch commented Jan 26, 2018

In a discussion I had with @alandipert, this idea came up: if dplyr had implementations of common functions, that were group-aware, then if filter() (or mutate() or summarise()) were called with an expression that contained only these group-aware functions, it could execute the expression once for the whole data set, instead of once for each group.

For example, imagine this:

mtcars %>%
  group_by(cyl) %>%
  filter(mpg > mean(mpg))

If dplyr had a version of mean that understood groups, it could calculate the mean for each group with only one call (instead of one call for each group); and if there were a version of > that understood groups, then it could calculate the T/F values with just one call. Since the expression uses only group-aware functions, then dplyr could execute it with these group-aware functions; but if the expression contained any functions that weren't group-aware, then it could fall back on the regular way of doing it. This glosses over details like how dplyr would pass grouping data to the functions, but hopefully it makes sense.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 26, 2018

This sounds very similar to hybrid evaluation to me: https://github.com/tidyverse/dplyr/blob/master/vignettes/internals/hybrid-evaluation.Rmd. Here, we know what mean() does and have our own implementation to quickly compute groupwise means.

For the use case of > and other arithmetic operators, we might just add hybrid handlers for each of them. Currently, hybrid evaluation is rather limited, e.g. mean(a) works but mean(a + b) will fall back to the slow implementation.

Currently, we always process in groups, we don't scan the expression to see if it only contains group-aware (or perhaps "mutating" or "creating", http://r4ds.had.co.nz/transform.html#mutate-funs?) functions. This would be a useful shortcut that will speed up many scenarios. I like the idea, opened #3326 and closing this issue. I hope I got the idea right.
Currently, we always process in groups, we don't scan the expression to see if it only contains group-aware functions. This would be a useful shortcut that will speed up many scenarios. I like the idea, opened #3326 and closing this issue (still happy to take further input).

@krlmlr krlmlr closed this as completed Jan 26, 2018
@wch
Copy link
Member Author

@wch wch commented Jan 26, 2018

My understanding of #3226 is that it's somewhat different from what you (and I) have described here, with group-aware functions. But it's possible I'm just not understanding "mutating" or "creating" functions correctly.

I could imagine it working something like this: if it finds that all the functions are group-aware, then it calls a special version of them, and passes them modified vectors, which have a group index attached an attribute.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 26, 2018

I see, maybe I confused "group-aware" with "group-agnostic"?

We alredy have hybrid evaluation, this feels like a sufficient approximation to the "group-aware" functions you're suggesting. In particular, all hybrid handlers are group-aware (and implemented in C++).

@lock
Copy link

@lock lock bot commented Jul 25, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jul 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants