Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500X performance regression on mutate with many groups #4458

Closed
gvfarns opened this issue Jul 2, 2019 · 13 comments
Closed

500X performance regression on mutate with many groups #4458

gvfarns opened this issue Jul 2, 2019 · 13 comments
Milestone

Comments

@gvfarns
Copy link

@gvfarns gvfarns commented Jul 2, 2019

Problem Description

There has been a huge performance regression between dplyr 0.8.1 and 0.8.2 when calling r functions on many groups. This is causing a lot of my code to fail to run in a reasonable time after upgrading packages.

The choice of grepl() below is not critical as the new dplyr/rlang causes my code to get perpetually stuck in many different places with similar code patterns. This example code is much smaller than my actual problems, but illustrates the nature of the problem (500X slowdown). My actual code will not finish at all in a reasonable time after upgrading.

Reproducible Code

tmpdata <- data.frame(id=replicate(100000,paste(sample(letters,4),collapse='')),
                      stringsAsFactors=FALSE)
system.time(
  outdata <- tmpdata %>%                                                                                                                   
             group_by(id) %>%
             mutate(inxx=grepl("x",id))
)

I have two identical computers (hardware and software) on which I can test this. Below is the information and results.

Computer with dplyr 0.8.1

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.1    colorout_1.2-1

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 compiler_3.6.0   magrittr_1.5     assertthat_0.2.1
 [5] R6_2.4.0         parallel_3.6.0   tools_3.6.0      pillar_1.4.1    
 [9] glue_1.3.1       tibble_2.1.3     crayon_1.3.4     Rcpp_1.0.1      
[13] pkgconfig_2.0.2  rlang_0.3.4      purrr_0.3.2     
> packageVersion("base")
[1] ‘3.6.0> system.time(
+   outdata <- tmpdata %>%                                                                                                                   
+              group_by(id) %>%
+              mutate(inxx=grepl("x",id))
+ )
   user  system elapsed 
  1.242   0.005   1.251 

Identical computer with dplyr 0.8.2

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.2    colorout_1.2-1

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 compiler_3.6.0   magrittr_1.5     assertthat_0.2.1
 [5] R6_2.4.0         parallel_3.6.0   tools_3.6.0      pillar_1.4.2    
 [9] glue_1.3.1       tibble_2.1.3     crayon_1.3.4     Rcpp_1.0.1      
[13] pkgconfig_2.0.2  rlang_0.4.0      purrr_0.3.2     
> packageVersion("base")
[1] ‘3.6.0> system.time(
+   outdata <- tmpdata %>%                                                                                                                   
+              group_by(id) %>%
+              mutate(inxx=grepl("x",id))
+ )
   user  system elapsed 
621.149   0.059 623.224 

Since my real code takes about a half hour to run, this problem makes running my code under the updated packages completely infeasible. I'm currently running this with 200K rows instead of 100K to see if the slowdown is linear but it's going to be a while before it finishes. It took 2 seconds on the non-upgraded machine.

Is Rlang or something else the Culprit?

These machines have a different rlang version as well, so I downgraded only dplyr and left rlang at 0.4 and tried the example code again. It ran almost instantly. The culprit is most definitely dplyr.

@grantmcdermott
Copy link

@grantmcdermott grantmcdermott commented Jul 2, 2019

Was just about to file an issue myself. Hopefully this can be rectified quickly.

What's interesting is that, in my case (dataset = 2.5 million rows), performance when grouping by two variables and then summarizing does not appear to be that significantly affected. However, once I add a third grouping variable then the performance regression is many orders of magnitude.

@hadley
Copy link
Member

@hadley hadley commented Jul 2, 2019

Reprex that executes in a reasonable amount of time on dplyr 0.8.2:

library(dplyr, warn.conflicts = FALSE)
tmpdata <- data.frame(
  id = replicate(1000, paste(sample(letters, 4), collapse = "")),
  stringsAsFactors = FALSE
)

system.time(tmpdata %>%
  group_by(id) %>%
  mutate(inxx = grepl("x", id)))
#>    user  system elapsed 
#>   0.213   0.011   0.225

Created on 2019-07-02 by the reprex package (v0.3.0)

And on dplyr 0.8.1:

library(dplyr, warn.conflicts = FALSE)
tmpdata <- data.frame(
  id = replicate(1000, paste(sample(letters, 4), collapse = "")),
  stringsAsFactors = FALSE
)

system.time(tmpdata %>%
  group_by(id) %>%
  mutate(inxx = grepl("x", id)))
#>    user  system elapsed 
#>   0.020   0.001   0.021

Created on 2019-07-02 by the reprex package (v0.3.0)

@gvfarns
Copy link
Author

@gvfarns gvfarns commented Jul 2, 2019

Using 200K rows, my example executes in 1998.384 seconds on 0.8.2 and 2.406 on 0.8.1 (830X slowdown). Looks like the problem might get worse in a nonlinear way as the number of observations increases.

@hadley
Copy link
Member

@hadley hadley commented Jul 2, 2019

Even simpler reprex, and evidence that it's not related to strings:

library(dplyr, warn.conflicts = FALSE)
df <- tibble(x = as.character(1:5000))

system.time(df %>%
  group_by(x) %>%
  mutate(y = paste0(x, "1"))
)
#>    user  system elapsed 
#>   1.660   0.023   1.694

df <- tibble(x = 1:5000)
system.time(df %>%
  group_by(x) %>%
  mutate(y = x + 1)
)
#>    user  system elapsed 
#>   1.464   0.018   1.488

Created on 2019-07-02 by the reprex package (v0.3.0)

@hadley
Copy link
Member

@hadley hadley commented Jul 2, 2019

But a constant is much much faster:

library(dplyr, warn.conflicts = FALSE)
df <- tibble(x = as.character(1:5000))

df <- tibble(x = 1:5000)
system.time(df %>%
  group_by(x) %>%
  mutate(y = 1)
)
#>    user  system elapsed 
#>   0.019   0.001   0.021

Created on 2019-07-02 by the reprex package (v0.3.0)

@romainfrancois romainfrancois added this to the 0.8.3 milestone Jul 3, 2019
romainfrancois added a commit that referenced this issue Jul 3, 2019
👋 🐌
closes #4458
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jul 3, 2019

Please try remotes::install_github("tidyverse/dplyr#4461") see #4461 for details

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jul 3, 2019

I get :

library(dplyr, warn.conflicts = FALSE)
tmpdata <- data.frame(id=replicate(100000,paste(sample(letters,4),collapse='')),
  stringsAsFactors=FALSE)

system.time(
  outdata <- tmpdata %>%                                                                                                                   
    group_by(id) %>%
    mutate(inxx=grepl("x",id))
)
#>    user  system elapsed 
#>   2.252   0.051   2.304
sessioninfo::package_info("dplyr", dependencies = FALSE)
#>  package * version    date       lib source
#>  dplyr   * 0.8.2.9000 2019-07-03 [1] local 
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

and this with dplyr 0.8.1 (all things equal otherwise):

library(dplyr, warn.conflicts = FALSE)
tmpdata <- data.frame(id=replicate(100000,paste(sample(letters,4),collapse='')),
  stringsAsFactors=FALSE)

system.time(
  outdata <- tmpdata %>%                                                                                                                   
    group_by(id) %>%
    mutate(inxx=grepl("x",id))
)
#>    user  system elapsed 
#>   2.297   0.051   2.349
sessioninfo::package_info("dplyr", dependencies = FALSE)
#>  package * version date       lib source                          
#>  dplyr   * 0.8.1   2019-07-03 [1] Github (tidyverse/dplyr@73a8ced)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

Created on 2019-07-03 by the reprex package (v0.3.0.9000)

@gvfarns
Copy link
Author

@gvfarns gvfarns commented Jul 3, 2019

Confirmed. With this change, performance is restored.

@grantmcdermott
Copy link

@grantmcdermott grantmcdermott commented Jul 3, 2019

Running remotes::install_github("tidyverse/dplyr") fixes things for me too. (I don't think the separate tidyverse/dplyr#4461 patch is available any more.) Thanks Romain and Hadley for getting to the bottom of this so quickly!

@shabbybanks
Copy link
Contributor

@shabbybanks shabbybanks commented Jul 3, 2019

I was going to file an issue that I think is the same as this one concerning use of %in% in a 'group-summarize'. My MWE:

nobs <- 1e6
ngrp <- 1e3
set.seed(1234)
df <- tibble(grp_num=sample(1:ngrp,size=nobs,replace=TRUE),
             has_letters=sample(letters,size=nobs,replace=TRUE))
lookup_letters <- c('a','b','c')
system.time({
  asum <- df %>%
    group_by(grp_num) %>%
      summarize(has_all=all(lookup_letters %in% has_letters)) %>%
    ungroup()
})

In dplyr 0.7.7 (ok, I'm behind), this takes:

   user  system elapsed
  0.704   0.021   0.725

In 0.8.2 it is:

   user  system elapsed
 51.565   0.009  51.561

(I do not have github access on this machine, but will try the 4461 patch when at home.)

@gvfarns
Copy link
Author

@gvfarns gvfarns commented Jul 3, 2019

shabbybanks, your example ran very quickly on my machine using 0.8.2 after implementing the fix

   user  system elapsed 
  0.061   0.006   0.069 

And much slower on the machine without the fix

   user  system elapsed 
  3.674   0.955   4.671 

Though neither approaches the 51 seconds you got.

@shabbybanks
Copy link
Contributor

@shabbybanks shabbybanks commented Jul 5, 2019

I have confirmed on my home machine that 0.8.3 fixes my issue.
@gvfarns : I get paid the big bucks to work on slow machines with old versions of R...

@lock
Copy link

@lock lock bot commented Jan 1, 2020

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants