Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

top_n performance problems in v0.6 #2822

Closed
mmuurr opened this issue May 27, 2017 · 7 comments
Closed

top_n performance problems in v0.6 #2822

mmuurr opened this issue May 27, 2017 · 7 comments

Comments

@mmuurr
Copy link

@mmuurr mmuurr commented May 27, 2017

top_n (and possibly other ranking functions) within group_by chains got really slow in the latest master branch version (c7ca374).
Performance is fine when the number of groups is small, but it appears (in my light testing) that the slowdown is a (rapidly-growing) function of the number of groups.
Observe:

library(tidyverse); library(magrittr)

n_groups <- 10e3
n_obs <- 100e3

x <- sample(n_groups, n_obs, TRUE)
y <- runif(n_obs)
df <- tibble(x, y)

system.time(foo1 <- df %>% group_by(x) %>% top_n(1, y))
#    user  system elapsed
#  4.868   8.328  13.195

system.time(foo2 <- df %>% group_by(x) %>% arrange(desc(y)) %>% mutate(ix = row_number()) %>% filter(ix == 1) %>% select(-ix))
#    user  system elapsed
#   0.440   0.008   0.448

system.time(foo3 <- df %>% group_by(x) %>% arrange(desc(y)) %>% slice(1))
#    user  system elapsed
#  0.112   0.000   0.113

identical(sort(foo1$y), sort(foo2$y)) ## TRUE
identical(sort(foo2$y), sort(foo3$y)) ## TRUE

top_n performance previously appeared to be similar to the 'naive' foo2 and foo3 variants above.


Here's a performance analysis for n_groups varying between 1,000 and 10,000:

n_groups <- seq(1e3, 10e3, by = 1e3)
n_obs <- 100e3
results <- lapply(n_groups, function(n) {
    print(n)
    df <- tibble(x = sample(n, n_obs, TRUE), y = runif(n_obs))
    t1 <- system.time(df %>% group_by(x) %>% top_n(1, y))
    t2 <- system.time(df %>% group_by(x) %>% arrange(desc(y)) %>% mutate(ix = row_number()) %>% filter(ix == 1) %>% select(-ix))
    t3 <- system.time(df %>% group_by(x) %>% arrange(desc(y)) %>% slice(1))
    lst(t1, t2, t3)
}) %>% setNames(n_groups)

map_df(results, function(x) map_df(x, "elapsed"), .id = "n_groups")
 # A tibble: 10 x 4
    n_groups     t1    t2    t3
       <chr>  <dbl> <dbl> <dbl>
  1     1000  1.663 0.201 0.122
  2     2000  3.192 0.268 0.129
  3     3000  4.038 0.226 0.099
  4     4000  5.128 0.269 0.100
  5     5000  6.328 0.297 0.104
  6     6000  7.413 0.323 0.103
  7     7000  8.737 0.358 0.107
  8     8000 10.009 0.399 0.107
  9     9000 11.118 0.418 0.111
 10    10000 12.487 0.453 0.112
@lionel-
Copy link
Member

@lionel- lionel- commented May 27, 2017

Thanks, we'll investigate. That's weird because top_n() calls filter() with min_rank() which has a hybrid handler, so it shouldn't become slow with number of groups.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 30, 2017

Confirmed that it was faster with dplyr 0.5.0:

Reprex with session info
suppressPackageStartupMessages(library(tidyverse))

n_groups <- 10e3
n_obs <- 100e3

x <- sample(n_groups, n_obs, TRUE)
y <- runif(n_obs)
df <- tibble(x, y)

system.time(foo1 <- df %>% group_by(x) %>% top_n(1, y))
#>    user  system elapsed 
#>   0.292   0.000   0.292
system.time(foo2 <- df %>% group_by(x) %>% arrange(desc(y)) %>% mutate(ix = row_number()) %>% filter(ix == 1) %>% select(-ix))
#>    user  system elapsed 
#>   0.280   0.008   0.286
system.time(foo3 <- df %>% group_by(x) %>% arrange(desc(y)) %>% slice(1))
#>    user  system elapsed 
#>   0.088   0.000   0.089

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.0 (2017-04-21)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_US:en                    
#>  collate  en_US.UTF-8                 
#>  tz       Europe/Busingen             
#>  date     2017-05-30
#> Packages -----------------------------------------------------------------
#>  package    * version     date       source                          
#>  assertthat   0.2.0       2017-04-11 CRAN (R 3.4.0)                  
#>  backports    1.1.0       2017-05-22 cran (@1.1.0)                   
#>  base       * 3.4.0       2017-04-21 local                           
#>  broom        0.4.2       2017-02-13 CRAN (R 3.4.0)                  
#>  cellranger   1.1.0       2016-07-27 CRAN (R 3.4.0)                  
#>  colorspace   1.3-2       2016-12-14 CRAN (R 3.4.0)                  
#>  compiler     3.4.0       2017-04-21 local                           
#>  datasets   * 3.4.0       2017-04-21 local                           
#>  DBI          0.6-1       2017-04-01 CRAN (R 3.4.0)                  
#>  devtools     1.13.1.9000 2017-05-23 local (hadley/devtools@c4099b3) 
#>  digest       0.6.12      2017-01-27 CRAN (R 3.4.0)                  
#>  dplyr      * 0.5.0       2016-06-24 CRAN (R 3.4.0)                  
#>  evaluate     0.10        2016-10-11 CRAN (R 3.4.0)                  
#>  forcats      0.2.0       2017-01-23 CRAN (R 3.4.0)                  
#>  foreign      0.8-68      2017-04-24 CRAN (R 3.4.0)                  
#>  ggplot2    * 2.2.1       2016-12-30 CRAN (R 3.4.0)                  
#>  graphics   * 3.4.0       2017-04-21 local                           
#>  grDevices  * 3.4.0       2017-04-21 local                           
#>  grid         3.4.0       2017-04-21 local                           
#>  gtable       0.2.0       2016-02-26 CRAN (R 3.4.0)                  
#>  haven        1.0.0       2016-09-23 CRAN (R 3.4.0)                  
#>  hms          0.3         2016-11-22 CRAN (R 3.4.0)                  
#>  htmltools    0.3.6       2017-04-28 CRAN (R 3.4.0)                  
#>  httr         1.2.1       2016-07-03 CRAN (R 3.4.0)                  
#>  jsonlite     1.4         2017-04-08 CRAN (R 3.4.0)                  
#>  knitr        1.16        2017-05-18 CRAN (R 3.4.0)                  
#>  lattice      0.20-35     2017-03-25 CRAN (R 3.4.0)                  
#>  lazyeval     0.2.0       2016-06-12 CRAN (R 3.4.0)                  
#>  lubridate    1.6.0       2016-09-13 CRAN (R 3.4.0)                  
#>  magrittr     1.5         2014-11-22 CRAN (R 3.4.0)                  
#>  memoise      1.1.0       2017-04-21 CRAN (R 3.4.0)                  
#>  methods    * 3.4.0       2017-04-21 local                           
#>  mnormt       1.5-5       2016-10-15 CRAN (R 3.4.0)                  
#>  modelr       0.1.0       2016-08-31 CRAN (R 3.4.0)                  
#>  munsell      0.4.3       2016-02-13 CRAN (R 3.4.0)                  
#>  nlme         3.1-131     2017-02-06 CRAN (R 3.4.0)                  
#>  parallel     3.4.0       2017-04-21 local                           
#>  pkgbuild     0.0.0.9000  2017-05-23 Github (r-pkgs/pkgbuild@8aab60b)
#>  pkgload      0.0.0.9000  2017-05-19 Github (r-pkgs/pkgload@119cf9a) 
#>  plyr         1.8.4       2016-06-08 CRAN (R 3.4.0)                  
#>  psych        1.7.5       2017-05-03 CRAN (R 3.4.0)                  
#>  purrr      * 0.2.2.2     2017-05-11 CRAN (R 3.4.0)                  
#>  R6           2.2.1       2017-05-10 CRAN (R 3.4.0)                  
#>  Rcpp         0.12.11     2017-05-22 CRAN (R 3.4.0)                  
#>  readr      * 1.1.1       2017-05-16 CRAN (R 3.4.0)                  
#>  readxl       1.0.0       2017-04-18 CRAN (R 3.4.0)                  
#>  reshape2     1.4.2       2016-10-22 CRAN (R 3.4.0)                  
#>  rlang        0.1.1.9000  2017-05-23 local (hadley/rlang@NA)         
#>  rmarkdown    1.5         2017-04-26 CRAN (R 3.4.0)                  
#>  rprojroot    1.2         2017-01-16 CRAN (R 3.4.0)                  
#>  rvest        0.3.2       2016-06-17 CRAN (R 3.4.0)                  
#>  scales       0.4.1       2016-11-09 CRAN (R 3.4.0)                  
#>  stats      * 3.4.0       2017-04-21 local                           
#>  stringi      1.1.5       2017-04-07 CRAN (R 3.4.0)                  
#>  stringr      1.2.0       2017-02-18 CRAN (R 3.4.0)                  
#>  tibble     * 1.3.3       2017-05-29 local (tidyverse/tibble@b2275d5)
#>  tidyr      * 0.6.3       2017-05-15 CRAN (R 3.4.0)                  
#>  tidyverse  * 1.1.1       2017-01-27 CRAN (R 3.4.0)                  
#>  tools        3.4.0       2017-04-21 local                           
#>  utils      * 3.4.0       2017-04-21 local                           
#>  withr        1.0.2       2016-06-20 CRAN (R 3.4.0)                  
#>  xml2         1.1.1       2017-01-24 CRAN (R 3.4.0)                  
#>  yaml         2.1.14      2016-11-12 CRAN (R 3.4.0)

@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 30, 2017

@lionel-: This looks like the performance penalty incurred by rlang, see r-lib/rlang#151.

@krlmlr krlmlr removed the bug label May 30, 2017
@mmuurr
Copy link
Author

@mmuurr mmuurr commented May 30, 2017

@krlmlr @lionel- Any idea what versions/commits of tibble and dplyr we should downgrade to in the meantime to avoid these speed hits?

@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 30, 2017

Does dplyr v0.5.0 work for you? tibble shouldn't matter that much.

@mmuurr
Copy link
Author

@mmuurr mmuurr commented May 30, 2017

@krlmlr Ah, yeah 0.5 works just fine for me... just didn't know if I should also revert tibble to avoid tidyverse/tibble#262.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 30, 2017

tibble 1.3.0 doesn't use rlang yet.

lionel- added a commit to lionel-/dplyr that referenced this issue Jun 16, 2017
lionel- added a commit to lionel-/dplyr that referenced this issue Jun 16, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants