Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

across() performance improvements #5697

Merged
merged 32 commits into from Jan 28, 2021
Merged

across() performance improvements #5697

merged 32 commits into from Jan 28, 2021

Conversation

romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jan 18, 2021

Replaces the use of compat pluck() with an internal loop. Because there is chunks[[i]] <- vec_cast_common(!!!chunks[[i]], .to = result_type) just before, we don't need to do any checks that each element of chunks[[i]] is the correctly types data frame.

image

The left part of the flame graph above is replaced by the selected bit here:

image

Part of #5264, although not strictly an across() performance speedup, would be relevant to anything that returns a data frame.

@romainfrancois
Copy link
Member Author

Using vctrs::new_data_frame() the visible new_tibble() above becomes:

image

@romainfrancois
Copy link
Member Author

@DavisVaughan any chance we can use sys.call() as is somehow as the key instead of deparse()ing it ?

@romainfrancois romainfrancois added this to the 1.0.4 milestone Jan 18, 2021
@romainfrancois romainfrancois changed the title replace compat pluck() with internal more efficient loop across() performance improvements Jan 18, 2021
@romainfrancois
Copy link
Member Author

boom

image

This might need some refinement, but this avoids the whole constructing data frames for the purpose of auto splicing them later.

%>% summarise(across(c(x, y), mean)) becomes %>% summarise(x = mean(x), y = mean(y))

R/summarise.R Outdated
@@ -204,6 +204,58 @@ summarise.rowwise_df <- function(.data, ..., .groups = NULL) {
out
}

top_across <- function(.cols = everything(), .fns = NULL, ..., .names = NULL) {
key <- key_deparse(sys.call())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be wishful thinking, but it seems like the C level r-dict type that @lionel- added to rlang might could be used here instead of deparsing (if that is still a performance factor). Lionel could probably say if that is a good idea or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing I had not realized is that the call made by sys.call() is different each time (i.e. a different SEXP) so some deparsing is needed.

This might be a reach, but I was thinking perhaps instrumenting the call so that we squeeze in a .key argument. This would help for when across() is visible in the call, but not when called from another function, i.e. summarise(colSums(across(<select>, <fns>))) would become summarise(colSums(across(<select>, <fns>, .key = <something>))) but the deparse stuff would still be used for things like:

f <- function(<select>, <fns>) {
  colSums(across(<select>, <fns>))
}
summarise(f(<select>, <fns>))

Although, I suppose if we were willing to instrument the call, we might as well directly expand the across() call, i.e.

summarise(z = across(c(x, y), mean))

would be substituted to (a more complicated version of:, because of cur_column())

summarise(z = vctrs::new_data_frame(df_list(x = mean(x), y = mean(y))))

but then the error messages would look weird ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup the sys calls are duplicated each time.

@romainfrancois What are you using these keys for and what do they represent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that calling e.g. tidyselect::eval_select(cols, data = mask$across_cols()) was expensive and so we wanted to do it ahead of time, once, and then retrieve it when the same across() call is used on another group.

The keys are for identifying which across() call it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're already modifying the expressions to replace them with calls to top_across(), it might be simpler to evaluate the tidy selection before evaluation and inline the results in the calls. Then you don't need a cache for across. Am I missing something?

Copy link
Member

@lionel- lionel- Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh the cache will still be useful for wrapping across() in functions without degrading performance with large n_groups.

A few more toughts regarding the caching as I'm learning about this mechanism. I think there a few corner cases that are currently not well addressed:

  • The key should at the minimum include the environment in addition to the call.

  • Even if the key includes both the call and the environment, it is still incorrect to cache in some cases. A tidy selection can include arbitrary R code. An extreme example would be a sample() call. Even if we decide that a selection is evaluated once for every group (which is reasonable), there is still an ambiguity across user inputs. I think you only want to cache if the call is a pure data-expression (see https://tidyselect.r-lib.org/articles/syntax.html#data-expressions-and-env-expressions).

R/summarise.R Outdated Show resolved Hide resolved
R/summarise.R Outdated Show resolved Hide resolved
R/summarise.R Outdated
@@ -204,6 +204,58 @@ summarise.rowwise_df <- function(.data, ..., .groups = NULL) {
out
}

top_across <- function(.cols = everything(), .fns = NULL, ..., .names = NULL) {
key <- key_deparse(sys.call())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're already modifying the expressions to replace them with calls to top_across(), it might be simpler to evaluate the tidy selection before evaluation and inline the results in the calls. Then you don't need a cache for across. Am I missing something?

@romainfrancois
Copy link
Member Author

top_across() is now used for both summarise() and mutate() and no longer does the caching dance because that's not necessary.

@romainfrancois
Copy link
Member Author

Some comments to help the review.

Quosures are captured with dplyr_quosures(...) instead of a raw enquos(...) to add some information to them: was it named, what name is it given, what auto name can be used instead. There are stored as attributes of the quosure .

Then in mutate() and summarise() before a quosure is processed for all the groups, it first goes through expand_quosure() which returns a list of quosures:

  • either a size 1 list of the one that was given (most of the time)
  • or in case the quosure wasn't named and it is a top level call to across() it returns a list of several quosures. one for each of the columns that would have been created in across() for the sole purpose of later auto splicing them. This way we don't have to pack and unpack.

expand_quosure() works by calling top_across() which shares some code with across() but does not need or use the caching mechanism.

I believe expand_quosure() can also do something for if_any() / if_all() once #5660 is merged. I'll do this in another pull request.

@romainfrancois
Copy link
Member Author

closes #5264

@romainfrancois
Copy link
Member Author

The code in #5254 gives:

library(dplyr, warn.conflicts = FALSE)
library(vroom)

mun2014 <- vroom("https://data.regardscitoyens.org/elections/2014_municipales/MN14_Bvot_T1_01-49.txt", 
                 col_select = -c('X4','X9','X10','X11'),
                 col_names = FALSE, 
                 locale = locale(encoding = "WINDOWS-1252"), altrep = FALSE) 
#> Rows: 275,446
#> Columns: 9
#> Delimiter: ";"
#> chr [4]: X2, X3, X5, X12
#> dbl [5]: X1, X6, X7, X8, X13
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
bench::workout({
  a <- mun2014 %>% group_by_if( is.character )
  b <- a %>% summarise_if( is.numeric, sum ) 
})
#> # A tibble: 2 x 3
#>   exprs                                       process     real
#>   <bch:expr>                                 <bch:tm> <bch:tm>
#> 1 a <- mun2014 %>% group_by_if(is.character)    125ms    126ms
#> 2 b <- a %>% summarise_if(is.numeric, sum)      698ms    699ms
bench::workout({
  c <- mun2014 %>% group_by( across(where(is.character)))
  d <- c %>% summarise( across(where(is.numeric), sum) ) 
})
#> `summarise()` has grouped output by 'X2', 'X3', 'X5'. You can override using the `.groups` argument.
#> # A tibble: 2 x 3
#>   exprs                                                   process     real
#>   <bch:expr>                                             <bch:tm> <bch:tm>
#> 1 c <- mun2014 %>% group_by(across(where(is.character)))    123ms    123ms
#> 2 d <- c %>% summarise(across(where(is.numeric), sum))      690ms    692ms
all.equal(b, d)
#> [1] TRUE

Created on 2021-01-27 by the reprex package (v0.3.0)

so we're on par with summarise_if() / mutate_if() when the across() call is an unnamed top level call (most of the time), but we can still have the flexibility of e.g. wrapping across() or call it from a function, however in that case there's a cost, but that's ok I guess.

@romainfrancois
Copy link
Member Author

closes #4953

R/summarise.R Outdated Show resolved Hide resolved
Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe needs a high level comment explaining how and why top-level across() calls are expanded to individual expressions to avoid overhead. Would also be nice to mention when unexpanded across() calls have overhead and why.

R/across.R Outdated
@@ -102,7 +102,7 @@ across <- function(.cols = everything(), .fns = NULL, ..., .names = NULL) {

if (is.null(fns)) {
nrow <- length(mask$current_rows())
data <- new_tibble(data, nrow = nrow)
data <- new_data_frame(data, n = nrow)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check the data frame assumptions of data here? In that case you can use vctrs::data_frame() instead of new_data_frame() as the latter is a fast constructor that does not type-check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checks are not needed but I reverted to making tibbles now, so I'm inclined to keep using new_data_frame()

R/across.R Outdated
Comment on lines 142 to 145
size <- vec_size_common(!!!out)
out <- vec_recycle_common(!!!out, .size = size)
names(out) <- names
new_tibble(out, nrow = size)
new_data_frame(out, n = size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace these with vctrs::data_frame(!!!out)?

R/summarise.R Outdated Show resolved Hide resolved
expand_quosure <- function(quo) {
if (quo_is_call(quo, "across", ns = c("", "dplyr")) && !attr(quo, "is_named")) {
quo_env <- quo_get_env(quo)
quo <- new_quosure(node_poke_car(quo_get_expr(quo), top_across), quo_env)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This node-poke seems a bit unsafe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But at that point, we know quo's expression is a call to across() or dplyr::across(). Do you have a suggestion ?

R/across.R Outdated Show resolved Hide resolved
R/across.R Show resolved Hide resolved
R/across.R Outdated Show resolved Hide resolved
tests/testthat/test-across.R Outdated Show resolved Hide resolved
@romainfrancois romainfrancois merged commit a0d1c5b into master Jan 28, 2021
@romainfrancois romainfrancois deleted the across_perf branch January 28, 2021 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants