-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Labels
Milestone
Description
I have encountered a performance issue with dplyr 0.8.0.1 when applying first() to a factor variable in a group_by() setting with high-dimensional identifier. The operation is significantly faster when converting the factor variable to a character (see reprex output). In contrast, the execution times of both approaches used to be quite similar under dplyr 0.7.x.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tictoc)
n = 1e5
dat = tibble(
id = sample(x = seq(1, n), size = n, replace = TRUE),
val = sample(x = letters[1:4], size = n, replace = TRUE) %>%
factor(., levels = letters[1:4]))
tic()
result_1 = dat %>%
group_by(id) %>%
summarise(first_val = dplyr::first(val))
toc()
#> 28.974 sec elapsed
tic()
result_2 = dat %>%
mutate(val = as.character(val)) %>%
group_by(id) %>%
summarise(first_val = dplyr::first(val)) %>%
mutate(first_val = factor(first_val, levels = letters[1:4]))
toc()
#> 0.098 sec elapsed
all_equal(result_1, result_2)
#> [1] TRUECreated on 2019-03-19 by the reprex package (v0.2.1)
Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#> setting value
#> version R version 3.5.3 (2019-03-11)
#> os macOS Mojave 10.14.1
#> system x86_64, darwin15.6.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Zurich
#> date 2019-03-19
#>
#> ─ Packages ──────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0)
#> backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0)
#> callr 3.2.0 2019-03-15 [1] CRAN (R 3.5.3)
#> cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0)
#> devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.3)
#> digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0)
#> dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
#> evaluate 0.13 2019-02-12 [1] CRAN (R 3.5.2)
#> fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0)
#> glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0)
#> highr 0.7 2018-06-09 [1] CRAN (R 3.5.0)
#> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0)
#> knitr 1.22 2019-03-08 [1] CRAN (R 3.5.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0)
#> pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0)
#> pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0)
#> pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0)
#> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0)
#> processx 3.3.0 2019-03-10 [1] CRAN (R 3.5.2)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0)
#> purrr 0.3.1 2019-03-03 [1] CRAN (R 3.5.2)
#> R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2)
#> Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0)
#> remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0)
#> rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2)
#> rmarkdown 1.12 2019-03-14 [1] CRAN (R 3.5.3)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0)
#> stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2)
#> tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2)
#> tictoc * 1.0 2014-06-17 [1] CRAN (R 3.5.0)
#> tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0)
#> usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
#> xfun 0.5 2019-02-20 [1] CRAN (R 3.5.2)
#> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/libraryReactions are currently unavailable