New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group_by + summarise in dplyr v8.0 is so much slower than previous version #4202
Comments
Please submit a reprex, we don't have There are no factors involved, so |
Here comes an example (1 min in v0.8.0 vs 2 sec in v0.7.8) > devtools::install_github("tidyverse/dplyr", ref="v0.8.0.1")
Skipping install of 'dplyr' from a github remote, the SHA1 (0cefb86c) has not changed since last install.
Use `force = TRUE` to force installation
> library(dplyr)
> timestamp()
##------ Wed Feb 20 14:16:30 2019 ------##
> df <- tibble(x=as.character(pi*sample.int(1e7,1e6,replace=TRUE)), y=as.character(pi*sample.int(3e3, 1e6, replace=T)))
> df %>% group_by(y) %>% summarise(n_distinct(x))
# A tibble: 3,000 x 2
y `n_distinct(x)`
<chr> <int>
1 100.530964914873 333
2 1002.16805649514 310
3 1005.30964914873 348
4 1008.45124180232 300
5 1011.59283445591 337
6 1014.7344271095 304
7 1017.87601976309 349
8 1021.01761241668 345
9 1024.15920507027 314
10 1027.30079772386 316
# … with 2,990 more rows
> timestamp()
##------ Wed Feb 20 14:17:38 2019 ------##
> And with older version: > devtools::install_github("tidyverse/dplyr", ref="v0.7.8")
Skipping install of 'dplyr' from a github remote, the SHA1 (ebdf2236) has not changed since last install.
Use `force = TRUE` to force installation
> library(dplyr)
> timestamp()
##------ Wed Feb 20 14:21:59 2019 ------##
> df <- tibble(x=as.character(pi*sample.int(1e7,1e6,replace=TRUE)), y=as.character(pi*sample.int(3e3, 1e6, replace=T)))
> df %>% group_by(y) %>% summarise(n_distinct(x))
# A tibble: 3,000 x 2
y `n_distinct(x)`
<chr> <int>
1 100.530964914873 339
2 1002.16805649514 367
3 1005.30964914873 336
4 1008.45124180232 295
5 1011.59283445591 325
6 1014.7344271095 351
7 1017.87601976309 333
8 1021.01761241668 312
9 1024.15920507027 359
10 1027.30079772386 329
# … with 2,990 more rows
> timestamp()
##------ Wed Feb 20 14:22:02 2019 ------##
> |
Thanks. It looks like it's more a library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(
x=as.character(pi*sample.int(1e7,1e6,replace=TRUE)),
y=as.character(pi*sample.int(3e3, 1e6, replace=T))
)
tf <- tempfile()
Rprof(tf)
grouped <- group_by(df, y)
res <- summarise(grouped, n_distinct(x))
Rprof(NULL)
summaryRprof(tf)
#> $by.self
#> self.time self.pct total.time total.pct
#> "summarise_impl" 45.74 98.32 45.74 98.32
#> "grouped_df_impl" 0.78 1.68 0.78 1.68
#>
#> $by.total
#> total.time total.pct self.time self.pct
#> "<Anonymous>" 46.52 100.00 0.00 0.00
#> "block_exec" 46.52 100.00 0.00 0.00
#> "call_block" 46.52 100.00 0.00 0.00
#> "do.call" 46.52 100.00 0.00 0.00
#> "doTryCatch" 46.52 100.00 0.00 0.00
#> "eval" 46.52 100.00 0.00 0.00
#> "evaluate_call" 46.52 100.00 0.00 0.00
#> "evaluate::evaluate" 46.52 100.00 0.00 0.00
#> "evaluate" 46.52 100.00 0.00 0.00
#> "handle" 46.52 100.00 0.00 0.00
#> "in_dir" 46.52 100.00 0.00 0.00
#> "knitr::knit" 46.52 100.00 0.00 0.00
#> "process_file" 46.52 100.00 0.00 0.00
#> "process_group.block" 46.52 100.00 0.00 0.00
#> "process_group" 46.52 100.00 0.00 0.00
#> "rmarkdown::render" 46.52 100.00 0.00 0.00
#> "saveRDS" 46.52 100.00 0.00 0.00
#> "timing_fn" 46.52 100.00 0.00 0.00
#> "try" 46.52 100.00 0.00 0.00
#> "tryCatch" 46.52 100.00 0.00 0.00
#> "tryCatchList" 46.52 100.00 0.00 0.00
#> "tryCatchOne" 46.52 100.00 0.00 0.00
#> "withCallingHandlers" 46.52 100.00 0.00 0.00
#> "withVisible" 46.52 100.00 0.00 0.00
#> "summarise_impl" 45.74 98.32 45.74 98.32
#> "summarise.tbl_df" 45.74 98.32 0.00 0.00
#> "summarise" 45.74 98.32 0.00 0.00
#> "grouped_df_impl" 0.78 1.68 0.78 1.68
#> "group_by.data.frame" 0.78 1.68 0.00 0.00
#> "group_by" 0.78 1.68 0.00 0.00
#> "grouped_df" 0.78 1.68 0.00 0.00
#>
#> $sample.interval
#> [1] 0.02
#>
#> $sampling.time
#> [1] 46.52 Something probably went wrong in the implementation of the hybrid |
@romainfrancois yes, #4205 fixes the problem. Thx! |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
Simple operations on fairly large data frames now take forever to complete
I have a tibble
This operation used to complete in 2sec
With new dplyr v8 it takes 5min or more. Adding .drop=TRUE to group_by doesn't help.
The text was updated successfully, but these errors were encountered: