Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with group_by, factor and first under 0.8.0.1 #4295

Closed
PhilippRuchser opened this issue Mar 19, 2019 · 5 comments
Closed

Performance issue with group_by, factor and first under 0.8.0.1 #4295

PhilippRuchser opened this issue Mar 19, 2019 · 5 comments

Comments

@PhilippRuchser
Copy link

@PhilippRuchser PhilippRuchser commented Mar 19, 2019

I have encountered a performance issue with dplyr 0.8.0.1 when applying first() to a factor variable in a group_by() setting with high-dimensional identifier. The operation is significantly faster when converting the factor variable to a character (see reprex output). In contrast, the execution times of both approaches used to be quite similar under dplyr 0.7.x.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tictoc)

n = 1e5
dat = tibble(
  id  = sample(x = seq(1, n),    size = n, replace = TRUE),
  val = sample(x = letters[1:4], size = n, replace = TRUE) %>%
    factor(., levels = letters[1:4]))

tic()
result_1 = dat %>%
  group_by(id) %>%
  summarise(first_val = dplyr::first(val))
toc()
#> 28.974 sec elapsed

tic()
result_2 = dat %>%
  mutate(val = as.character(val)) %>%
  group_by(id) %>%
  summarise(first_val = dplyr::first(val)) %>%
  mutate(first_val = factor(first_val, levels = letters[1:4]))
toc()
#> 0.098 sec elapsed

all_equal(result_1, result_2)
#> [1] TRUE

Created on 2019-03-19 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.3 (2019-03-11)
#>  os       macOS Mojave 10.14.1        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Zurich               
#>  date     2019-03-19                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)
#>  backports     1.1.3   2018-12-14 [1] CRAN (R 3.5.0)
#>  callr         3.2.0   2019-03-15 [1] CRAN (R 3.5.3)
#>  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.3)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)
#>  devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.3)
#>  digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.0)
#>  dplyr       * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
#>  evaluate      0.13    2019-02-12 [1] CRAN (R 3.5.2)
#>  fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.0)
#>  glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)
#>  highr         0.7     2018-06-09 [1] CRAN (R 3.5.0)
#>  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)
#>  knitr         1.22    2019-03-08 [1] CRAN (R 3.5.2)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)
#>  pillar        1.3.1   2018-12-15 [1] CRAN (R 3.5.0)
#>  pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)
#>  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)
#>  processx      3.3.0   2019-03-10 [1] CRAN (R 3.5.2)
#>  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.0)
#>  purrr         0.3.1   2019-03-03 [1] CRAN (R 3.5.2)
#>  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.2)
#>  Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)
#>  remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)
#>  rlang         0.3.1   2019-01-08 [1] CRAN (R 3.5.2)
#>  rmarkdown     1.12    2019-03-14 [1] CRAN (R 3.5.3)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)
#>  stringi       1.3.1   2019-02-13 [1] CRAN (R 3.5.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.2)
#>  tibble        2.0.1   2019-01-12 [1] CRAN (R 3.5.2)
#>  tictoc      * 1.0     2014-06-17 [1] CRAN (R 3.5.0)
#>  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.0)
#>  usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
#>  xfun          0.5     2019-02-20 [1] CRAN (R 3.5.2)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library
@pwilczewski
Copy link

@pwilczewski pwilczewski commented Mar 26, 2019

I am also seeing a deterioration in performance with grouped data while using summarise() on factor variables with the n_distinct() function e.g. group_by(id) %>% summarise(unique_vals = dplyr::n_distinct(val)) Similarly I've noticed the top_n() function also runs much more slowly on grouped data in dplyr 0.8.0.1

@nate-bf
Copy link

@nate-bf nate-bf commented Apr 8, 2019

In addition to the change in speed there is a change in use of RAM as well. For large datasets a group_by and summarize call that used to run smoothly now maxes out RAM and refuses to run. Rolling back to 0.7.8 makes things work smoothly again. (running an 8 GB dataset on a server with 128 GB of RAM). Not sure if knowing there's a change in RAM usage will help in figuring out what's happening, but thought I should at least note it here in case it does. Thanks!

@jangorecki
Copy link

@jangorecki jangorecki commented Apr 21, 2019

@nate-bf memory regression is a different issue, not related to factors reported in this issue. I filled #4334

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 30, 2019

What happens here is that the hybrid version of first() does not handle factors, so we end up paying a high price to call R's version of dplyr::first() in a loop with many groups.

@lock
Copy link

@lock lock bot commented Nov 27, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants