Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement balance_prop_strata() (Fixes #317) #365

Merged
merged 2 commits into from
Sep 26, 2022
Merged

Implement balance_prop_strata() (Fixes #317) #365

merged 2 commits into from
Sep 26, 2022

Conversation

mikemahoney218
Copy link
Member

This PR fixes #317 by implementing balance_prop_strata() and adding strata = NULL, prop = 0.1 to the relevant grouped resampling functions. As before, these arguments are after ... to allow for future expansion (specifically for if breaks ever gets implemented).

library(rsample)
set.seed(11)

group_table <- tibble::tibble(
  group = 1:100,
  outcome = sample(c(rep(0, 89), rep(1, 11)))
)
observation_table <- tibble::tibble(
  group = sample(1:100, 1e5, replace = TRUE),
  observation = 1:1e5
)
sample_data <- dplyr::full_join(group_table, observation_table, by = "group")
group_mc_cv(sample_data, group, strata = outcome, pool = 0.1)
#> # Group Monte Carlo cross-validation (0.75/0.25) with 25 resamples using stratification 
#> # A tibble: 25 × 2
#>    splits                id        
#>    <list>                <chr>     
#>  1 <split [74791/25209]> Resample01
#>  2 <split [74733/25267]> Resample02
#>  3 <split [75219/24781]> Resample03
#>  4 <split [75040/24960]> Resample04
#>  5 <split [75013/24987]> Resample05
#>  6 <split [75142/24858]> Resample06
#>  7 <split [74945/25055]> Resample07
#>  8 <split [75121/24879]> Resample08
#>  9 <split [74891/25109]> Resample09
#> 10 <split [74934/25066]> Resample10
#> # … with 15 more rows
group_validation_split(sample_data, group, strata = outcome, pool = 0.1)
#> # Group Validation Set Split (0.75/0.25) using stratification 
#> # A tibble: 1 × 2
#>   splits                id        
#>   <list>                <chr>     
#> 1 <split [74886/25114]> validation
group_initial_split(sample_data, group, strata = outcome, pool = 0.1)
#> <Training/Testing/Total>
#> <74869/25131/100000>
group_bootstraps(sample_data, group, strata = outcome, pool = 0.1)
#> # Group bootstrap sampling using stratification 
#> # A tibble: 25 × 2
#>    splits                 id         
#>    <list>                 <chr>      
#>  1 <split [100554/34134]> Bootstrap01
#>  2 <split [100019/37058]> Bootstrap02
#>  3 <split [99577/41119]>  Bootstrap03
#>  4 <split [99862/39010]>  Bootstrap04
#>  5 <split [100037/34010]> Bootstrap05
#>  6 <split [100000/37967]> Bootstrap06
#>  7 <split [100273/36879]> Bootstrap07
#>  8 <split [99922/31872]>  Bootstrap08
#>  9 <split [99863/36441]>  Bootstrap09
#> 10 <split [100209/38823]> Bootstrap10
#> # … with 15 more rows

# mc_cv

strata_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_mc_cv(sample_data, group, strata = outcome, pool = 0.1)
    purrr::map_dbl(
      rs4$splits,
      function(x) {
        dat <- as.data.frame(x)$outcome
        mean(dat == "1")
      }
    )
  }
) |> unlist()

base_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_mc_cv(sample_data, group, pool = 0.1)
    purrr::map_dbl(
      rs4$splits,
      function(x) {
        dat <- as.data.frame(x)$outcome
        mean(dat == "1")
      }
    )
  }
) |> unlist()

# Mean absolute error of strata proportions, versus expected
mean(abs(0.10 - strata_rate))
#> [1] 0.006504722
mean(abs(0.10 - base_rate))
#> [1] 0.01725373

# validation_split

strata_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_validation_split(sample_data, group, strata = outcome, pool = 0.1)
    purrr::map_dbl(
      rs4$splits,
      function(x) {
        dat <- as.data.frame(x)$outcome
        mean(dat == "1")
      }
    )
  }
) |> unlist()

base_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_validation_split(sample_data, group, pool = 0.1)
    purrr::map_dbl(
      rs4$splits,
      function(x) {
        dat <- as.data.frame(x)$outcome
        mean(dat == "1")
      }
    )
  }
) |> unlist()

# Mean absolute error of strata proportions, versus expected
mean(abs(0.10 - strata_rate))
#> [1] 0.006540856
mean(abs(0.10 - base_rate))
#> [1] 0.01639703


# initial_split

strata_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_initial_split(sample_data, group, strata = outcome, pool = 0.1)
    mean(as.data.frame(rs4)$outcome == "1")
  }
) |> unlist()

base_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_initial_split(sample_data, group, pool = 0.1)
    mean(as.data.frame(rs4)$outcome == "1")
  }
) |> unlist()

# Mean absolute error of strata proportions, versus expected
mean(abs(0.10 - strata_rate))
#> [1] 0.006545343
mean(abs(0.10 - base_rate))
#> [1] 0.01699527

# bootstraps

strata_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_bootstraps(sample_data, group, strata = outcome, pool = 0.1)
    purrr::map_dbl(
      rs4$splits,
      function(x) {
        dat <- as.data.frame(x)$outcome
        mean(dat == "1")
      }
    )
  }
) |> unlist()

base_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_bootstraps(sample_data, group, pool = 0.1)
    purrr::map_dbl(
      rs4$splits,
      function(x) {
        dat <- as.data.frame(x)$outcome
        mean(dat == "1")
      }
    )
  }
) |> unlist()

# Mean absolute error of strata proportions, versus expected
mean(abs(0.10 - strata_rate))
#> [1] 0.009791294
mean(abs(0.10 - base_rate))
#> [1] 0.02536413

Created on 2022-09-16 with reprex v2.0.2

@mikemahoney218
Copy link
Member Author

How about you don't send a PR until you're past your exam and fully recovered? 😄 🤝

Originally posted by @hfrick in #364 (comment)

whooooooops

(in all seriousness, this looked pretty easy to tackle and I was stuck inside on a Friday night anyway...)

@mikemahoney218 mikemahoney218 marked this pull request as ready for review September 17, 2022 02:18
@mikemahoney218
Copy link
Member Author

The pkgdown error happens because downlit::href_package("mlbench") fails with a mysterious error:

downlit::href_package("mlbench")
#> Error in readRDS(con): error reading from connection

Created on 2022-09-16 with reprex v2.0.2

This is called by pkgdown::build_home_index(), with a wild trace:

> pkgdown::build_home_index()
Error in readRDS(con) : error reading from connection
> rlang::last_trace()
<callr_error/rlib_error_3_0/rlib_error/error>
Error: 
! error in callr subprocess
Caused by error in `readRDS(con)`:
! error reading from connection
---
Backtrace:
1. pkgdown::build_site()
2. pkgdown:::build_site_external(pkg = pkg, examples = examples, run_dont_run = run_dont_run, …
3. callr::r(function(..., cli_colors, pkgdown_internet) { …
4. callr:::get_result(output = out, options)
5. callr:::throw(callr_remote_error(remerr))
6. callr:::callr_remote_error(remerr)
7. callr:::throw(err, parent = remerr[[3]])
---
Subprocess backtrace:
 1. pkgdown::build_site(...)
 2. pkgdown:::build_site_local(pkg = pkg, examples = examples, run_dont_run = run_dont_run, …
 3. pkgdown::build_home(pkg, override = override, preview = FALSE)
 4. pkgdown::build_home_index(pkg, quiet = quiet)
 5. pkgdown::render_page(pkg, "home", data, "index.html", quiet = quiet)
 6. pkgdown:::tweak_page(html, name, pkg = pkg)
 7. downlit::downlit_html_node(html)
 8. downlit:::tweak_children(x, xpath_block, highlight, pre_class = "downlit sourceCode r", …
 9. downlit:::map_chr(text, fun, ...)
10. base::vapply(.x, .f, ..., FUN.VALUE = character(1), USE.NAMES = FALSE)
11. local FUN(X[[i]], ...)
12. downlit:::token_href(out$token, out$text)
13. downlit:::map_chr(gsub("['\"]", "", text[pkg]), href_package)
14. base::vapply(.x, .f, ..., FUN.VALUE = character(1), USE.NAMES = FALSE)
15. local FUN(X[[i]], ...)
16. downlit:::package_urls(package)
17. downlit:::CRAN_urls()
18. base::withVisible(eval(mc, parent.frame()))
19. base::eval(mc, parent.frame())
20. base::eval(mc, parent.frame())
21. (function () …
22. tools::CRAN_package_db()
23. base::as.data.frame(read_CRAN_object(CRAN_baseurl_for_web_area(), "web/packages/packages.rds"), …
24. tools:::read_CRAN_object(CRAN_baseurl_for_web_area(), "web/packages/packages.rds")
25. base::readRDS(con)
26. base::.handleSimpleError(function (e) …
27. global h(simpleError(msg, call))

We intentionally removed mlbench from website dependencies (#314, #340). Putting it back doesn't fix this error locally.

At any rate, I don't think that has anything to do with this PR, so I've marked it as good to review.

@mikemahoney218
Copy link
Member Author

Found it:

try(remove.packages("mlbench"), silent = TRUE)
#> Removing package from '/home/mikemahoney218/R/x86_64-pc-linux-gnu-library/4.2'
#> (as 'lib' is unspecified)
downlit::href_package("mlbench")
#> Error in readRDS(con): error reading from connection

install.packages("mlbench")
#> Installing package into '/home/mikemahoney218/R/x86_64-pc-linux-gnu-library/4.2'
#> (as 'lib' is unspecified)
downlit::href_package("mlbench")
#> [1] NA

Created on 2022-09-16 with reprex v2.0.2

So it might be the case that having mlbench in website dependencies, or installing it as an extra package in the pkgdown job, would be a good workaround here.

Copy link
Member

@hfrick hfrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 🎉 Could you add one code comment, please? (I tagged the exact place.) I found myself going back to a comment you left on the previous PR so I assume it would be good to add that in here directly.

R/make_groups.R Show resolved Hide resolved
@mikemahoney218
Copy link
Member Author

Added comments to the relevant section 😄

@hfrick hfrick merged commit adb24c5 into main Sep 26, 2022
@hfrick hfrick deleted the mike/prop_strata branch September 26, 2022 08:52
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stratification in grouped resampling
2 participants