Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade past dplyr 1.0.0 to either 1.0.1 or 1.0.2 causes order of magnitude slowdown in group_by mutate() #5675

Closed
gmonaie opened this issue Jan 4, 2021 · 5 comments · Fixed by #5686
Assignees
Milestone

Comments

@gmonaie
Copy link

gmonaie commented Jan 4, 2021

dealing with max(na.rm=T) in a mutate statement when grouping by 2 columns slows down by an order by magnitude from version 1.0.0 to 1.0.1


Orders of magnitude difference in time to complete between version 1.0.0 to 1.0.2. When the tibble is very large (thousands or hundreds of thousands of rows) the time explodes. Requires a session restart between installing versions to reproduce the example to ensure the old packages are unloaded and new version is loaded.

library(devtools)
library(rstudioapi)
#> Loading required package: usethis
packageVersion("dplyr")
#> [1] '1.0.0'
install_version("dplyr", version = "1.0.0", repos = "http://cran.us.r-project.org", upgrade = "never", quiet = TRUE)
restartSession()

Created on 2021-01-04 by the reprex package (v0.3.0)

library(tibble)
library(dplyr)
#> Warning: replacing previous import 'vctrs::data_frame' by
#> 'tibble::data_frame' when loading 'dplyr'
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tictoc)
packageVersion('dplyr')
#> [1] '1.0.0'
df <- tidyr::crossing(tibble(at_date=seq.Date(from = as.Date("2010-01-01"), to = as.Date("2011-01-31"), by = 1), risk_date=lag(at_date,2)) %>% 
                        mutate(risk_date=as.Date(ifelse(risk_date==as.Date("2010-01-01"), risk_date, NA), origin = "1970-01-01")), tibble(letters=rep(letters[1:3])))
df <- bind_rows(df, df) %>% arrange(at_date, letters)
tic()
result <- suppressWarnings(df %>% group_by(at_date, letters) %>% mutate(risk_date=max(risk_date, na.rm=T)))
toc()
#> 0.154 sec elapsed

Created on 2021-01-04 by the reprex package (v0.3.0)

library(devtools)
library(rstudioapi)
#> Loading required package: usethis
packageVersion("dplyr")
#> [1] '1.0.0'
install_version("dplyr", version = "1.0.2", repos = "http://cran.us.r-project.org", upgrade = "never", quiet = TRUE)
restartSession()
library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tictoc)
packageVersion('dplyr')
#> [1] '1.0.2'
df <- tidyr::crossing(tibble(at_date=seq.Date(from = as.Date("2010-01-01"), to = as.Date("2011-01-31"), by = 1), risk_date=lag(at_date,2)) %>% 
                        mutate(risk_date=as.Date(ifelse(risk_date==as.Date("2010-01-01"), risk_date, NA), origin = "1970-01-01")), tibble(letters=rep(letters[1:3])))
df <- bind_rows(df, df) %>% arrange(at_date, letters)
tic()
result <- suppressWarnings(df %>% group_by(at_date, letters) %>% mutate(risk_date=max(risk_date, na.rm=T)))
toc()
#> 2.031 sec elapsed

Created on 2021-01-04 by the reprex package (v0.3.0)

@romainfrancois
Copy link
Member

That's intriguing. It seems to be about the handling of the warnings():

[master] 256.1 MiB ❯ bench::workout({
+     g <- df %>% group_by(at_date, letters)
+     g %>% mutate(risk_date=max(risk_date, na.rm=FALSE))
+ })
# A tibble: 2 x 3
  exprs                                                    process     real
  <bch:expr>                                              <bch:tm> <bch:tm>
1 g <- df %>% group_by(at_date, letters)                    4.25ms   3.75ms
2 g %>% mutate(risk_date = max(risk_date, na.rm = FALSE))  28.98ms  28.85ms
[master] 256.1 MiB ❯ 
[master] 256.1 MiB ❯ bench::workout({
+     g <- df %>% group_by(at_date, letters)
+     g %>% mutate(risk_date=max(risk_date, na.rm=TRUE))
+ })
# A tibble: 2 x 3
  exprs                                                   process     real
  <bch:expr>                                             <bch:tm> <bch:tm>
1 g <- df %>% group_by(at_date, letters)                   3.92ms   3.46ms
2 g %>% mutate(risk_date = max(risk_date, na.rm = TRUE))    1.84s    1.84s

so these:

warning = function(w) {

suppressing the warnings before they are handled seems to have an impact:

[master] 256.1 MiB ❯ bench::workout({
+   g <- df %>% group_by(at_date, letters)
+   g %>% mutate(risk_date=suppressWarnings(max(risk_date, na.rm=TRUE)))
+ })
# A tibble: 2 x 3
  exprs                                                                     process     real
  <bch:expr>                                                               <bch:tm> <bch:tm>
1 g <- df %>% group_by(at_date, letters)                                     3.64ms   3.25ms
2 g %>% mutate(risk_date = suppressWarnings(max(risk_date, na.rm = TRUE)))  95.97ms  95.98ms

@romainfrancois
Copy link
Member

Running this without the warning= handler gives:

[master*] 243.4 MiB ❯ bench::workout({
+   g <- df %>% group_by(at_date, letters)
+   g %>% mutate(risk_date=max(risk_date, na.rm=TRUE))
+ })
# A tibble: 2 x 3
  exprs                                                   process     real
  <bch:expr>                                             <bch:tm> <bch:tm>
1 g <- df %>% group_by(at_date, letters)                   5.44ms   5.28ms
2 g %>% mutate(risk_date = max(risk_date, na.rm = TRUE))  82.65ms  82.71ms

which is similar to the performance when suppressing the warning. Most of the difference originally observed is due to catching and promoting the warnings. However this adds value, see original discussion here: #5315

This will potentially be improved when we bring back some sort of hybrid evaluation for max()

@lionel-
Copy link
Member

lionel- commented Jan 12, 2021

@romainfrancois I think the problem is more general than max() and so won't be solved with hybrid evaluation.

Maybe we should avoid instrumenting warnings if a muffleWarning restart is on the stack? This way using a suppressWarnings() on the outside wouldn't cause this surprising performance degradation. I can take a look if you'd like.

Edit: oops of course a muffle restart will always be on the stack. What we need is to resignal the condition to allow a potential calling handler to invoke that restart.

lionel- added a commit to lionel-/dplyr that referenced this issue Jan 13, 2021
@romainfrancois romainfrancois added this to the 1.0.4 milestone Jan 18, 2021
@norihikorihiko
Copy link

I am being botherd by slow down even dplyr ver 1.0.3 of group_by, summarize or mutate_at, and so on.
I tested dplyr from 1.0.0 to 1.0.3, and slow down occur even in specific situation with dplyr 1.0.0.

The slow down of dplyr functions come with the package of vctrs ver0.3.1 and above, and/or after intalling dtplyr 1.0.1 even using dplyr 1.0.0.

@romainfrancois
Copy link
Member

Please @norihikorihiko be specific, review the issues we already have on the theme of performance https://github.com/tidyverse/dplyr/issues?q=is%3Aopen+is%3Aissue+label%3A%22performance+%3Arocket%3A%22 and open a new one with a reprex if your findings are not covered.

lionel- added a commit to lionel-/dplyr that referenced this issue Jan 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants