New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More Efficient group_by(...) %>% sample_*(...)
#3193
Conversation
Great - much easier to start with this minimal change and then figure out the other stuff. Can you please add a bullet to NEWS? It should briefly describe the change (starting with name of the function), and crediting yourself with |
f74e872
to
9fc9696
Compare
Done. Can you please take another look? |
@@ -265,11 +265,11 @@ sample_n.grouped_df <- function(tbl, size, replace = FALSE, | |||
inform("`.env` is deprecated and no longer has any effect") | |||
} | |||
weight <- enquo(weight) | |||
weight <- mutate(tbl, w = !!weight)[["w"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a space after the !!
please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, !!
is going to have very high precedence so now it makes sense to have it close to its argument just like unary -
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Let me know if there is anything else I can help to help this PR merged.
9fc9696
to
cd23d86
Compare
Improves `df %>% group_by(...) %>% sample_*(...)` performance by 10-100x for dataset with large number of groups. The motivation is that when performing stratified sampling using `group_by %>% sample_n` on 100k+ strata, it can take minutes or longer. A toy example shows every 1k groups increases runtime by ~2s. A quick profiling shows most of time is spent in `eval_tidy(weight, ...)` for each group. This PR performs the weight calculation using `mutate` instead to preserve the semantics but eliminates the repeated overscope lookup across groups.
cd23d86
to
ce4ba23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good to me except for the small nit.
@@ -292,11 +292,11 @@ sample_frac.grouped_df <- function(tbl, size = 1, replace = FALSE, | |||
) | |||
} | |||
weight <- enquo(weight) | |||
weight <- mutate(tbl, w = !!weight)[["w"]] | |||
|
|||
index <- attr(tbl, "indices") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a bit cleaner (and maybe faster) if we assign index <- attr(...) + 1L
here and don't add in sample_group()
.
Thanks! |
Improves
df %>% group_by(...) %>% sample_*(...)
performance by 10-100x for dataset with large number of groups.The motivation is that when performing stratified sampling using
group_by %>% sample_n
on 100k+ strata, it can take minutes or longer. A toy example shows every 1k groups increases runtime by ~2s. A quick profiling shows most of time is spent ineval_tidy(weight, ...)
for each group. This PR performs the weight calculation usingmutate
instead to preserve the semantics but eliminates the repeated overscope lookup across groups.Before (dplyr master)
After