Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by + summarise in dplyr v8.0 is so much slower than previous version #4202

Closed
snp opened this issue Feb 20, 2019 · 6 comments
Closed

group_by + summarise in dplyr v8.0 is so much slower than previous version #4202

snp opened this issue Feb 20, 2019 · 6 comments

Comments

@snp
Copy link

@snp snp commented Feb 20, 2019

Simple operations on fairly large data frames now take forever to complete


I have a tibble

> pepData
# A tibble: 1,496,130 x 6
   id    Sequence Modifications Protein Gene  Description 
   <chr> <chr>    <chr>         <chr>   <chr> <chr>  

This operation used to complete in 2sec

> pepData %>% group_by(Protein) %>% summarise(n=n_distinct(id))
# A tibble: 6,614 x 2
   Protein        n
   <chr>      <int>

With new dplyr v8 it takes 5min or more. Adding .drop=TRUE to group_by doesn't help.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 20, 2019

Please submit a reprex, we don't have pepData.

There are no factors involved, so .drop won't make a difference.

@snp
Copy link
Author

@snp snp commented Feb 20, 2019

Here comes an example (1 min in v0.8.0 vs 2 sec in v0.7.8)

> devtools::install_github("tidyverse/dplyr", ref="v0.8.0.1")
Skipping install of 'dplyr' from a github remote, the SHA1 (0cefb86c) has not changed since last install.
  Use `force = TRUE` to force installation
> library(dplyr)
> timestamp()
##------ Wed Feb 20 14:16:30 2019 ------##
> df <- tibble(x=as.character(pi*sample.int(1e7,1e6,replace=TRUE)), y=as.character(pi*sample.int(3e3, 1e6, replace=T)))
> df %>% group_by(y) %>% summarise(n_distinct(x))
# A tibble: 3,000 x 2
   y                `n_distinct(x)`
   <chr>                      <int>
 1 100.530964914873             333
 2 1002.16805649514             310
 3 1005.30964914873             348
 4 1008.45124180232             300
 5 1011.59283445591             337
 6 1014.7344271095              304
 7 1017.87601976309             349
 8 1021.01761241668             345
 9 1024.15920507027             314
10 1027.30079772386             316
# … with 2,990 more rows
> timestamp()
##------ Wed Feb 20 14:17:38 2019 ------##
> 

And with older version:

> devtools::install_github("tidyverse/dplyr", ref="v0.7.8")
Skipping install of 'dplyr' from a github remote, the SHA1 (ebdf2236) has not changed since last install.
  Use `force = TRUE` to force installation
> library(dplyr)
> timestamp()
##------ Wed Feb 20 14:21:59 2019 ------##
> df <- tibble(x=as.character(pi*sample.int(1e7,1e6,replace=TRUE)), y=as.character(pi*sample.int(3e3, 1e6, replace=T)))
> df %>% group_by(y) %>% summarise(n_distinct(x))
# A tibble: 3,000 x 2
   y                `n_distinct(x)`
   <chr>                      <int>
 1 100.530964914873             339
 2 1002.16805649514             367
 3 1005.30964914873             336
 4 1008.45124180232             295
 5 1011.59283445591             325
 6 1014.7344271095              351
 7 1017.87601976309             333
 8 1021.01761241668             312
 9 1024.15920507027             359
10 1027.30079772386             329
# … with 2,990 more rows
> timestamp()
##------ Wed Feb 20 14:22:02 2019 ------##
> 

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 20, 2019

Thanks. It looks like it's more a summarise() problem:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(
  x=as.character(pi*sample.int(1e7,1e6,replace=TRUE)), 
  y=as.character(pi*sample.int(3e3, 1e6, replace=T))
)

tf <- tempfile()
Rprof(tf)
  
grouped <- group_by(df, y)
res <- summarise(grouped, n_distinct(x))

Rprof(NULL)

summaryRprof(tf)
#> $by.self
#>                   self.time self.pct total.time total.pct
#> "summarise_impl"      45.74    98.32      45.74     98.32
#> "grouped_df_impl"      0.78     1.68       0.78      1.68
#> 
#> $by.total
#>                       total.time total.pct self.time self.pct
#> "<Anonymous>"              46.52    100.00      0.00     0.00
#> "block_exec"               46.52    100.00      0.00     0.00
#> "call_block"               46.52    100.00      0.00     0.00
#> "do.call"                  46.52    100.00      0.00     0.00
#> "doTryCatch"               46.52    100.00      0.00     0.00
#> "eval"                     46.52    100.00      0.00     0.00
#> "evaluate_call"            46.52    100.00      0.00     0.00
#> "evaluate::evaluate"       46.52    100.00      0.00     0.00
#> "evaluate"                 46.52    100.00      0.00     0.00
#> "handle"                   46.52    100.00      0.00     0.00
#> "in_dir"                   46.52    100.00      0.00     0.00
#> "knitr::knit"              46.52    100.00      0.00     0.00
#> "process_file"             46.52    100.00      0.00     0.00
#> "process_group.block"      46.52    100.00      0.00     0.00
#> "process_group"            46.52    100.00      0.00     0.00
#> "rmarkdown::render"        46.52    100.00      0.00     0.00
#> "saveRDS"                  46.52    100.00      0.00     0.00
#> "timing_fn"                46.52    100.00      0.00     0.00
#> "try"                      46.52    100.00      0.00     0.00
#> "tryCatch"                 46.52    100.00      0.00     0.00
#> "tryCatchList"             46.52    100.00      0.00     0.00
#> "tryCatchOne"              46.52    100.00      0.00     0.00
#> "withCallingHandlers"      46.52    100.00      0.00     0.00
#> "withVisible"              46.52    100.00      0.00     0.00
#> "summarise_impl"           45.74     98.32     45.74    98.32
#> "summarise.tbl_df"         45.74     98.32      0.00     0.00
#> "summarise"                45.74     98.32      0.00     0.00
#> "grouped_df_impl"           0.78      1.68      0.78     1.68
#> "group_by.data.frame"       0.78      1.68      0.00     0.00
#> "group_by"                  0.78      1.68      0.00     0.00
#> "grouped_df"                0.78      1.68      0.00     0.00
#> 
#> $sample.interval
#> [1] 0.02
#> 
#> $sampling.time
#> [1] 46.52

Something probably went wrong in the implementation of the hybrid n_distinct(). We will get to that as part of the effort in 0.9.0 to use hashing from vctrs.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 20, 2019

@snp can you conform that #4205 fixes the performance problem ?

@snp
Copy link
Author

@snp snp commented Feb 20, 2019

@romainfrancois yes, #4205 fixes the problem. Thx!

@lock
Copy link

@lock lock bot commented Aug 19, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Aug 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants