More Efficient `group_by(...) %>% sample_*(...)` #3193

saurfang · 2017-11-06T08:12:14Z

Improves df %>% group_by(...) %>% sample_*(...) performance by 10-100x for dataset with large number of groups.

The motivation is that when performing stratified sampling using group_by %>% sample_n on 100k+ strata, it can take minutes or longer. A toy example shows every 1k groups increases runtime by ~2s. A quick profiling shows most of time is spent in eval_tidy(weight, ...) for each group. This PR performs the weight calculation using mutate instead to preserve the semantics but eliminates the repeated overscope lookup across groups.

library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union

n_strata <- 1000
# number of original observations in each strata (follows normal distribution)
n_per_strata_mean <- 100
n_per_strata_sd <- 10

# how many to sample from each strata
n_sample <- 10

source_df <-
  data_frame(group = 1:n_strata) %>%
  group_by(group) %>%
  do(sample_n(iris, round(rnorm(1, n_per_strata_mean, n_per_strata_sd)), replace = TRUE)) %>%
  ungroup() %>%
  # shuffle
  sample_frac(1)

source_df
#> # A tibble: 100,151 x 6
#>    group Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#>    <int>        <dbl>       <dbl>        <dbl>       <dbl>     <fctr>
#>  1   406          5.0         3.6          1.4         0.2     setosa
#>  2   998          5.5         2.4          3.8         1.1 versicolor
#>  3   101          6.4         2.8          5.6         2.2  virginica
#>  4   803          5.6         2.7          4.2         1.3 versicolor
#>  5   897          6.7         3.3          5.7         2.5  virginica
#>  6   743          4.9         3.1          1.5         0.2     setosa
#>  7   885          6.2         2.2          4.5         1.5 versicolor
#>  8    37          5.0         3.6          1.4         0.2     setosa
#>  9   276          5.2         4.1          1.5         0.1     setosa
#> 10   225          6.9         3.2          5.7         2.3  virginica
#> # ... with 100,141 more rows

# dplyr master ###########################
## sample without replacement no weights
system.time({
  source_df %>%
    group_by(group) %>%
    sample_n(n_sample)
})
#>    user  system elapsed
#>   1.677   0.006   1.687
## sample without replacement with weights
system.time({
  source_df %>%
    group_by(group) %>%
    sample_n(n_sample, weight = Petal.Length)
})
#>    user  system elapsed
#>   1.827   0.017   1.857
# sample with replacement
system.time({
  source_df %>%
    group_by(group) %>%
    sample_n(n_sample, replace = TRUE)
})
#>    user  system elapsed
#>   1.860   0.015   1.895

# dplyr dev ##################################
devtools::load_all("~/workspace/dplyr")
#> Loading dplyr
## sample without replacement no weights
system.time({
  source_df %>%
    group_by(group) %>%
    sample_n(n_sample)
})
#>    user  system elapsed
#>   0.019   0.000   0.023
## sample without replacement with weights
system.time({
  source_df %>%
    group_by(group) %>%
    sample_n(n_sample, weight = Petal.Length)
})
#>    user  system elapsed
#>   0.065   0.002   0.070
# sample with replacement
system.time({
  source_df %>%
    group_by(group) %>%
    sample_n(n_sample, replace = TRUE)
})
#>    user  system elapsed
#>   0.017   0.000   0.018

Before (dplyr master)

After

hadley · 2017-11-06T18:12:49Z

Great - much easier to start with this minimal change and then figure out the other stuff.

Can you please add a bullet to NEWS? It should briefly describe the change (starting with name of the function), and crediting yourself with (@yourname, #issuenumber).

saurfang · 2017-11-13T07:06:14Z

Done. Can you please take another look?

lionel- · 2017-11-29T15:53:44Z

R/grouped-df.r

@@ -265,11 +265,11 @@ sample_n.grouped_df <- function(tbl, size, replace = FALSE,
    inform("`.env` is deprecated and no longer has any effect")
  }
  weight <- enquo(weight)
+  weight <- mutate(tbl, w = !!weight)[["w"]]


Could you add a space after the !! please?

Nevermind, !! is going to have very high precedence so now it makes sense to have it close to its argument just like unary -.

Great. Let me know if there is anything else I can help to help this PR merged.

Improves `df %>% group_by(...) %>% sample_*(...)` performance by 10-100x for dataset with large number of groups. The motivation is that when performing stratified sampling using `group_by %>% sample_n` on 100k+ strata, it can take minutes or longer. A toy example shows every 1k groups increases runtime by ~2s. A quick profiling shows most of time is spent in `eval_tidy(weight, ...)` for each group. This PR performs the weight calculation using `mutate` instead to preserve the semantics but eliminates the repeated overscope lookup across groups.

krlmlr

Thanks! Looks good to me except for the small nit.

krlmlr · 2017-12-12T14:50:33Z

R/grouped-df.r

@@ -292,11 +292,11 @@ sample_frac.grouped_df <- function(tbl, size = 1, replace = FALSE,
    )
  }
  weight <- enquo(weight)
+  weight <- mutate(tbl, w = !!weight)[["w"]]

  index <- attr(tbl, "indices")


It looks a bit cleaner (and maybe faster) if we assign index <- attr(...) + 1L here and don't add in sample_group().

krlmlr · 2017-12-30T08:39:09Z

Thanks!

…to NULL for saeSim, CC @wasabi. Introduced in #3193.

saurfang mentioned this pull request Nov 6, 2017

Efficient and Adaptive Stratified Sampling #3182

Closed

saurfang force-pushed the efficient_sample_group branch from f74e872 to 9fc9696 Compare November 13, 2017 07:05

hadley mentioned this pull request Nov 29, 2017

sample_frac() performance issue on grouped df #3229

Closed

lionel- reviewed Nov 29, 2017

View reviewed changes

saurfang force-pushed the efficient_sample_group branch from 9fc9696 to cd23d86 Compare December 6, 2017 05:06

saurfang force-pushed the efficient_sample_group branch from cd23d86 to ce4ba23 Compare December 6, 2017 05:08

krlmlr reviewed Dec 12, 2017

View reviewed changes

Merge branch 'master' into efficient_sample_group

7de3757

krlmlr merged commit a0d0e72 into tidyverse:master Dec 30, 2017

krlmlr added a commit that referenced this pull request Mar 14, 2018

use eval_tidy() instead of mutate() to support weights that evaluate …

45a13b8

…to NULL for saeSim, CC @wasabi. Introduced in #3193.

lock bot locked as resolved and limited conversation to collaborators Jun 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Efficient `group_by(...) %>% sample_*(...)` #3193

More Efficient `group_by(...) %>% sample_*(...)` #3193

saurfang commented Nov 6, 2017

hadley commented Nov 6, 2017

saurfang commented Nov 13, 2017

lionel- Nov 29, 2017

lionel- Dec 4, 2017

saurfang Dec 6, 2017

krlmlr left a comment

krlmlr Dec 12, 2017

krlmlr commented Dec 30, 2017

More Efficient group_by(...) %>% sample_*(...) #3193

More Efficient group_by(...) %>% sample_*(...) #3193

Conversation

saurfang commented Nov 6, 2017

hadley commented Nov 6, 2017

saurfang commented Nov 13, 2017

lionel- Nov 29, 2017

Choose a reason for hiding this comment

lionel- Dec 4, 2017

Choose a reason for hiding this comment

saurfang Dec 6, 2017

Choose a reason for hiding this comment

krlmlr left a comment

Choose a reason for hiding this comment

krlmlr Dec 12, 2017

Choose a reason for hiding this comment

krlmlr commented Dec 30, 2017

More Efficient `group_by(...) %>% sample_*(...)` #3193

More Efficient `group_by(...) %>% sample_*(...)` #3193