Skip to content

Performance issue with group_by, factor and first under 0.8.0.1 #4295

@PhilippRuchser

Description

@PhilippRuchser

I have encountered a performance issue with dplyr 0.8.0.1 when applying first() to a factor variable in a group_by() setting with high-dimensional identifier. The operation is significantly faster when converting the factor variable to a character (see reprex output). In contrast, the execution times of both approaches used to be quite similar under dplyr 0.7.x.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tictoc)

n = 1e5
dat = tibble(
  id  = sample(x = seq(1, n),    size = n, replace = TRUE),
  val = sample(x = letters[1:4], size = n, replace = TRUE) %>%
    factor(., levels = letters[1:4]))

tic()
result_1 = dat %>%
  group_by(id) %>%
  summarise(first_val = dplyr::first(val))
toc()
#> 28.974 sec elapsed

tic()
result_2 = dat %>%
  mutate(val = as.character(val)) %>%
  group_by(id) %>%
  summarise(first_val = dplyr::first(val)) %>%
  mutate(first_val = factor(first_val, levels = letters[1:4]))
toc()
#> 0.098 sec elapsed

all_equal(result_1, result_2)
#> [1] TRUE

Created on 2019-03-19 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.3 (2019-03-11)
#>  os       macOS Mojave 10.14.1        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Zurich               
#>  date     2019-03-19                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)
#>  backports     1.1.3   2018-12-14 [1] CRAN (R 3.5.0)
#>  callr         3.2.0   2019-03-15 [1] CRAN (R 3.5.3)
#>  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.3)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)
#>  devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.3)
#>  digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.0)
#>  dplyr       * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
#>  evaluate      0.13    2019-02-12 [1] CRAN (R 3.5.2)
#>  fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.0)
#>  glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)
#>  highr         0.7     2018-06-09 [1] CRAN (R 3.5.0)
#>  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)
#>  knitr         1.22    2019-03-08 [1] CRAN (R 3.5.2)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)
#>  pillar        1.3.1   2018-12-15 [1] CRAN (R 3.5.0)
#>  pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)
#>  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)
#>  processx      3.3.0   2019-03-10 [1] CRAN (R 3.5.2)
#>  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.0)
#>  purrr         0.3.1   2019-03-03 [1] CRAN (R 3.5.2)
#>  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.2)
#>  Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)
#>  remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)
#>  rlang         0.3.1   2019-01-08 [1] CRAN (R 3.5.2)
#>  rmarkdown     1.12    2019-03-14 [1] CRAN (R 3.5.3)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)
#>  stringi       1.3.1   2019-02-13 [1] CRAN (R 3.5.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.2)
#>  tibble        2.0.1   2019-01-12 [1] CRAN (R 3.5.2)
#>  tictoc      * 1.0     2014-06-17 [1] CRAN (R 3.5.0)
#>  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.0)
#>  usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
#>  xfun          0.5     2019-02-20 [1] CRAN (R 3.5.2)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions