Upgrade past dplyr 1.0.0 to either 1.0.1 or 1.0.2 causes order of magnitude slowdown in group_by mutate() #5675

gmonaie · 2021-01-04T20:23:28Z

dealing with max(na.rm=T) in a mutate statement when grouping by 2 columns slows down by an order by magnitude from version 1.0.0 to 1.0.1

Orders of magnitude difference in time to complete between version 1.0.0 to 1.0.2. When the tibble is very large (thousands or hundreds of thousands of rows) the time explodes. Requires a session restart between installing versions to reproduce the example to ensure the old packages are unloaded and new version is loaded.

library(devtools)
library(rstudioapi)
#> Loading required package: usethis
packageVersion("dplyr")
#> [1] '1.0.0'
install_version("dplyr", version = "1.0.0", repos = "http://cran.us.r-project.org", upgrade = "never", quiet = TRUE)
restartSession()

^{Created on 2021-01-04 by the reprex package (v0.3.0)}

library(tibble)
library(dplyr)
#> Warning: replacing previous import 'vctrs::data_frame' by
#> 'tibble::data_frame' when loading 'dplyr'
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tictoc)
packageVersion('dplyr')
#> [1] '1.0.0'
df <- tidyr::crossing(tibble(at_date=seq.Date(from = as.Date("2010-01-01"), to = as.Date("2011-01-31"), by = 1), risk_date=lag(at_date,2)) %>% 
                        mutate(risk_date=as.Date(ifelse(risk_date==as.Date("2010-01-01"), risk_date, NA), origin = "1970-01-01")), tibble(letters=rep(letters[1:3])))
df <- bind_rows(df, df) %>% arrange(at_date, letters)
tic()
result <- suppressWarnings(df %>% group_by(at_date, letters) %>% mutate(risk_date=max(risk_date, na.rm=T)))
toc()
#> 0.154 sec elapsed

^{Created on 2021-01-04 by the reprex package (v0.3.0)}

library(devtools)
library(rstudioapi)
#> Loading required package: usethis
packageVersion("dplyr")
#> [1] '1.0.0'
install_version("dplyr", version = "1.0.2", repos = "http://cran.us.r-project.org", upgrade = "never", quiet = TRUE)
restartSession()

library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tictoc)
packageVersion('dplyr')
#> [1] '1.0.2'
df <- tidyr::crossing(tibble(at_date=seq.Date(from = as.Date("2010-01-01"), to = as.Date("2011-01-31"), by = 1), risk_date=lag(at_date,2)) %>% 
                        mutate(risk_date=as.Date(ifelse(risk_date==as.Date("2010-01-01"), risk_date, NA), origin = "1970-01-01")), tibble(letters=rep(letters[1:3])))
df <- bind_rows(df, df) %>% arrange(at_date, letters)
tic()
result <- suppressWarnings(df %>% group_by(at_date, letters) %>% mutate(risk_date=max(risk_date, na.rm=T)))
toc()
#> 2.031 sec elapsed

^{Created on 2021-01-04 by the reprex package (v0.3.0)}

romainfrancois · 2021-01-11T11:30:06Z

That's intriguing. It seems to be about the handling of the warnings():

[master] 256.1 MiB ❯ bench::workout({
+     g <- df %>% group_by(at_date, letters)
+     g %>% mutate(risk_date=max(risk_date, na.rm=FALSE))
+ })
# A tibble: 2 x 3
  exprs                                                    process     real
  <bch:expr>                                              <bch:tm> <bch:tm>
1 g <- df %>% group_by(at_date, letters)                    4.25ms   3.75ms
2 g %>% mutate(risk_date = max(risk_date, na.rm = FALSE))  28.98ms  28.85ms
[master] 256.1 MiB ❯ 
[master] 256.1 MiB ❯ bench::workout({
+     g <- df %>% group_by(at_date, letters)
+     g %>% mutate(risk_date=max(risk_date, na.rm=TRUE))
+ })
# A tibble: 2 x 3
  exprs                                                   process     real
  <bch:expr>                                             <bch:tm> <bch:tm>
1 g <- df %>% group_by(at_date, letters)                   3.92ms   3.46ms
2 g %>% mutate(risk_date = max(risk_date, na.rm = TRUE))    1.84s    1.84s

so these:

dplyr/R/mutate.R

Line 378 in ae55f5f

warning = function(w) {

suppressing the warnings before they are handled seems to have an impact:

[master] 256.1 MiB ❯ bench::workout({
+   g <- df %>% group_by(at_date, letters)
+   g %>% mutate(risk_date=suppressWarnings(max(risk_date, na.rm=TRUE)))
+ })
# A tibble: 2 x 3
  exprs                                                                     process     real
  <bch:expr>                                                               <bch:tm> <bch:tm>
1 g <- df %>% group_by(at_date, letters)                                     3.64ms   3.25ms
2 g %>% mutate(risk_date = suppressWarnings(max(risk_date, na.rm = TRUE)))  95.97ms  95.98ms

romainfrancois · 2021-01-12T09:32:10Z

Running this without the warning= handler gives:

[master*] 243.4 MiB ❯ bench::workout({
+   g <- df %>% group_by(at_date, letters)
+   g %>% mutate(risk_date=max(risk_date, na.rm=TRUE))
+ })
# A tibble: 2 x 3
  exprs                                                   process     real
  <bch:expr>                                             <bch:tm> <bch:tm>
1 g <- df %>% group_by(at_date, letters)                   5.44ms   5.28ms
2 g %>% mutate(risk_date = max(risk_date, na.rm = TRUE))  82.65ms  82.71ms

which is similar to the performance when suppressing the warning. Most of the difference originally observed is due to catching and promoting the warnings. However this adds value, see original discussion here: #5315

This will potentially be improved when we bring back some sort of hybrid evaluation for max()

lionel- · 2021-01-12T12:30:03Z

@romainfrancois I think the problem is more general than max() and so won't be solved with hybrid evaluation.

Maybe we should avoid instrumenting warnings if a muffleWarning restart is on the stack? This way using a suppressWarnings() on the outside wouldn't cause this surprising performance degradation. I can take a look if you'd like.

Edit: oops of course a muffle restart will always be on the stack. What we need is to resignal the condition to allow a potential calling handler to invoke that restart.

Closes tidyverse#5675

norihikorihiko · 2021-01-18T10:40:09Z

I am being botherd by slow down even dplyr ver 1.0.3 of group_by, summarize or mutate_at, and so on.
I tested dplyr from 1.0.0 to 1.0.3, and slow down occur even in specific situation with dplyr 1.0.0.

The slow down of dplyr functions come with the package of vctrs ver0.3.1 and above, and/or after intalling dtplyr 1.0.1 even using dplyr 1.0.0.

romainfrancois · 2021-01-18T10:43:46Z

Please @norihikorihiko be specific, review the issues we already have on the theme of performance https://github.com/tidyverse/dplyr/issues?q=is%3Aopen+is%3Aissue+label%3A%22performance+%3Arocket%3A%22 and open a new one with a reprex if your findings are not covered.

Closes tidyverse#5675

romainfrancois added the performance 🚀 label Jan 11, 2021

romainfrancois closed this as completed Jan 12, 2021

romainfrancois reopened this Jan 12, 2021

romainfrancois assigned lionel- Jan 12, 2021

lionel- added a commit to lionel-/dplyr that referenced this issue Jan 13, 2021

Don't instrument muffled warnings

bfce6a7

Closes tidyverse#5675

lionel- mentioned this issue Jan 13, 2021

Don't instrument muffled warnings #5686

Merged

romainfrancois added this to the 1.0.4 milestone Jan 18, 2021

lionel- added a commit to lionel-/dplyr that referenced this issue Jan 18, 2021

Don't instrument muffled warnings

4c288fe

Closes tidyverse#5675

lionel- closed this as completed in #5686 Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade past dplyr 1.0.0 to either 1.0.1 or 1.0.2 causes order of magnitude slowdown in group_by mutate() #5675

Upgrade past dplyr 1.0.0 to either 1.0.1 or 1.0.2 causes order of magnitude slowdown in group_by mutate() #5675

gmonaie commented Jan 4, 2021

romainfrancois commented Jan 11, 2021

romainfrancois commented Jan 12, 2021

lionel- commented Jan 12, 2021 •

edited

norihikorihiko commented Jan 18, 2021

romainfrancois commented Jan 18, 2021

Upgrade past dplyr 1.0.0 to either 1.0.1 or 1.0.2 causes order of magnitude slowdown in group_by mutate() #5675

Upgrade past dplyr 1.0.0 to either 1.0.1 or 1.0.2 causes order of magnitude slowdown in group_by mutate() #5675

Comments

gmonaie commented Jan 4, 2021

romainfrancois commented Jan 11, 2021

romainfrancois commented Jan 12, 2021

lionel- commented Jan 12, 2021 • edited

norihikorihiko commented Jan 18, 2021

romainfrancois commented Jan 18, 2021

lionel- commented Jan 12, 2021 •

edited