-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Problem Description
There has been a huge performance regression between dplyr 0.8.1 and 0.8.2 when calling r functions on many groups. This is causing a lot of my code to fail to run in a reasonable time after upgrading packages.
The choice of grepl() below is not critical as the new dplyr/rlang causes my code to get perpetually stuck in many different places with similar code patterns. This example code is much smaller than my actual problems, but illustrates the nature of the problem (500X slowdown). My actual code will not finish at all in a reasonable time after upgrading.
Reproducible Code
tmpdata <- data.frame(id=replicate(100000,paste(sample(letters,4),collapse='')),
stringsAsFactors=FALSE)
system.time(
outdata <- tmpdata %>%
group_by(id) %>%
mutate(inxx=grepl("x",id))
)I have two identical computers (hardware and software) on which I can test this. Below is the information and results.
Computer with dplyr 0.8.1
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Workstation Edition)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.8.1 colorout_1.2-1
loaded via a namespace (and not attached):
[1] tidyselect_0.2.5 compiler_3.6.0 magrittr_1.5 assertthat_0.2.1
[5] R6_2.4.0 parallel_3.6.0 tools_3.6.0 pillar_1.4.1
[9] glue_1.3.1 tibble_2.1.3 crayon_1.3.4 Rcpp_1.0.1
[13] pkgconfig_2.0.2 rlang_0.3.4 purrr_0.3.2
> packageVersion("base")
[1] ‘3.6.0’
> system.time(
+ outdata <- tmpdata %>%
+ group_by(id) %>%
+ mutate(inxx=grepl("x",id))
+ )
user system elapsed
1.242 0.005 1.251 Identical computer with dplyr 0.8.2
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Workstation Edition)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.8.2 colorout_1.2-1
loaded via a namespace (and not attached):
[1] tidyselect_0.2.5 compiler_3.6.0 magrittr_1.5 assertthat_0.2.1
[5] R6_2.4.0 parallel_3.6.0 tools_3.6.0 pillar_1.4.2
[9] glue_1.3.1 tibble_2.1.3 crayon_1.3.4 Rcpp_1.0.1
[13] pkgconfig_2.0.2 rlang_0.4.0 purrr_0.3.2
> packageVersion("base")
[1] ‘3.6.0’
> system.time(
+ outdata <- tmpdata %>%
+ group_by(id) %>%
+ mutate(inxx=grepl("x",id))
+ )
user system elapsed
621.149 0.059 623.224 Since my real code takes about a half hour to run, this problem makes running my code under the updated packages completely infeasible. I'm currently running this with 200K rows instead of 100K to see if the slowdown is linear but it's going to be a while before it finishes. It took 2 seconds on the non-upgraded machine.
Is Rlang or something else the Culprit?
These machines have a different rlang version as well, so I downgraded only dplyr and left rlang at 0.4 and tried the example code again. It ran almost instantly. The culprit is most definitely dplyr.