Skip to content

500X performance regression on mutate with many groups #4458

@gvfarns

Description

@gvfarns

Problem Description

There has been a huge performance regression between dplyr 0.8.1 and 0.8.2 when calling r functions on many groups. This is causing a lot of my code to fail to run in a reasonable time after upgrading packages.

The choice of grepl() below is not critical as the new dplyr/rlang causes my code to get perpetually stuck in many different places with similar code patterns. This example code is much smaller than my actual problems, but illustrates the nature of the problem (500X slowdown). My actual code will not finish at all in a reasonable time after upgrading.

Reproducible Code

tmpdata <- data.frame(id=replicate(100000,paste(sample(letters,4),collapse='')),
                      stringsAsFactors=FALSE)
system.time(
  outdata <- tmpdata %>%                                                                                                                   
             group_by(id) %>%
             mutate(inxx=grepl("x",id))
)

I have two identical computers (hardware and software) on which I can test this. Below is the information and results.

Computer with dplyr 0.8.1

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.1    colorout_1.2-1

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 compiler_3.6.0   magrittr_1.5     assertthat_0.2.1
 [5] R6_2.4.0         parallel_3.6.0   tools_3.6.0      pillar_1.4.1    
 [9] glue_1.3.1       tibble_2.1.3     crayon_1.3.4     Rcpp_1.0.1      
[13] pkgconfig_2.0.2  rlang_0.3.4      purrr_0.3.2     
> packageVersion("base")
[1] ‘3.6.0> system.time(
+   outdata <- tmpdata %>%                                                                                                                   
+              group_by(id) %>%
+              mutate(inxx=grepl("x",id))
+ )
   user  system elapsed 
  1.242   0.005   1.251 

Identical computer with dplyr 0.8.2

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.2    colorout_1.2-1

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 compiler_3.6.0   magrittr_1.5     assertthat_0.2.1
 [5] R6_2.4.0         parallel_3.6.0   tools_3.6.0      pillar_1.4.2    
 [9] glue_1.3.1       tibble_2.1.3     crayon_1.3.4     Rcpp_1.0.1      
[13] pkgconfig_2.0.2  rlang_0.4.0      purrr_0.3.2     
> packageVersion("base")
[1] ‘3.6.0> system.time(
+   outdata <- tmpdata %>%                                                                                                                   
+              group_by(id) %>%
+              mutate(inxx=grepl("x",id))
+ )
   user  system elapsed 
621.149   0.059 623.224 

Since my real code takes about a half hour to run, this problem makes running my code under the updated packages completely infeasible. I'm currently running this with 200K rows instead of 100K to see if the slowdown is linear but it's going to be a while before it finishes. It took 2 seconds on the non-upgraded machine.

Is Rlang or something else the Culprit?

These machines have a different rlang version as well, so I downgraded only dplyr and left rlang at 0.4 and tried the example code again. It ran almost instantly. The culprit is most definitely dplyr.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions