Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutate and summarize speed related to parentheses #6681

Closed
dkutner opened this issue Feb 2, 2023 · 6 comments · Fixed by #6714
Closed

Mutate and summarize speed related to parentheses #6681

dkutner opened this issue Feb 2, 2023 · 6 comments · Fixed by #6714
Milestone

Comments

@dkutner
Copy link

dkutner commented Feb 2, 2023

Using dplyr 1.1.0 and vctrs 0.5.2, I'm noticing speed issues with mutate and summarize related to parentheses on the RHS.

df <- tibble::tibble(x = 1:10000)
bench::mark(
  b1 = dplyr::summarize(
    df,
    res = (1 + 1)
  ),
  b2 = dplyr::summarize(
    df,
    res = 1 + 1
  )
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 b1           1.47ms   1.53ms      625.    3.78MB     12.6
#> 2 b2           3.13ms   3.38ms      290.   546.9KB     15.6
bench::mark(
  b1 = dplyr::mutate(
    df,
    res = (1 + 1)
  ),
  b2 = dplyr::mutate(
    df,
    res = 1 + 1
  )
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 b1           1.64ms   1.73ms      557.     363KB     14.9
#> 2 b2           3.23ms    3.5ms      279.     102KB     15.4

On dplyr 1.0.10 with vctrs 0.5.0, all give the same result, ~1.5 ms. I'm using tibble 3.1.8 for both tests. With more complicated aggregation expressions, I've seen 15x slowdowns.

@DavisVaughan DavisVaughan added this to the 1.1.1 milestone Feb 7, 2023
@DavisVaughan
Copy link
Member

If you have an example of a 15x slowdown, we'd like to see it

@DavisVaughan
Copy link
Member

DavisVaughan commented Feb 7, 2023

I think I've isolated this particular slowdown to:

library(rlang)

bench::mark(
    as_label(quo((1 + 1)))
)
#> # A tibble: 1 × 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 as_label(quo((1 + 1)))   73.6µs     78µs    11784.     128KB     14.4
bench::mark(
    as_label(quo(1 + 1))
)
#> # A tibble: 1 × 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 as_label(quo(1 + 1))   1.34ms   1.66ms      599.     495KB     17.3

Being called in summarise() through the error context setup here

expr <- quo_as_label(dots[[i]])

This actually happens to be the same exact slowdown as #6674, CC @lionel-

From as_label(), infix_overflows() is called and shortcircuits early with (1 + 1) but does the full expr_deparse() with 1 + 1.


I'm under the impression that we only pay this cost 1 time per expression in summarise(), so it really isn't that bad.

We can use the opt-out temporary global option that @lionel- is going to make for the case_when() issue here if we feel like we need it, but I think I might need more proof that this is a substantial slowdown.

@DavisVaughan
Copy link
Member

It also happens with mutate() because it uses the same error context infrastructure, but again it seems like we are paying a relatively small fixed cost in this case

library(rlang)
library(dplyr, warn.conflicts = FALSE)

# Another expression that does it (it's the infix `+` operator causing the issue)
bench::mark(
  as_label(quo((x + 1))),
  as_label(quo(x + 1)),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 as_label(quo((x + 1)))  73.69µs   89.9µs    10482.     128KB     16.7
#> 2 as_label(quo(x + 1))     1.44ms   1.86ms      527.     591KB     14.9

# Noticeable on small data
df <- tibble(x = 1:5 + 0L)

bench::mark(
  slow = mutate(df, y = x + 1),
  fast = mutate(df, y = (x + 1))
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 slow         2.77ms   3.37ms      303.    1.78MB     17.6
#> 2 fast         1.43ms   1.71ms      574.   22.41KB     12.8

# Mostly goes away at any scale
df <- tibble(
  x = 1:50000000 + 0L,
  g = sample(10, length(x), replace = TRUE)
)

bench::mark(
  slow = mutate(df, y = x + 1, .by = g),
  fast = mutate(df, y = (x + 1), .by = g),
  iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 slow          1.33s     1.4s     0.708    1.99GB     1.42
#> 2 fast          1.32s    1.39s     0.720    1.99GB     1.44

Created on 2023-02-07 with reprex v2.0.2.9000

@dkutner
Copy link
Author

dkutner commented Feb 7, 2023

Here's an example with a larger hit. The more terms there are on the RHS, the bigger the slowdown.

df <- tibble::tibble(x = 1:10000)
bench::mark(
  b1 = dplyr::summarize(
    df,
    res = (mean(x) + mean(x) + mean(x) + mean(x) + mean(x) + mean(x) + mean(x) + mean(x))
  ),
  b2 = dplyr::summarize(
    df,
    res = mean(x) + mean(x) + mean(x) + mean(x) + mean(x) + mean(x) + mean(x) + mean(x)
  )
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 b1            2.1ms   2.21ms     433.      3.8MB     15.2
#> 2 b2             54ms  53.95ms      18.5   698.2KB    167.

@DavisVaughan
Copy link
Member

Ah that is very helpful thanks!

@lionel-
Copy link
Member

lionel- commented Feb 10, 2023

I've added a private option to disable the slow part of as_label() until we find a more general solution (lazy names).

r-lib/rlang@33db700

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants