Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rep_sample_n() error when prob vector specified #279

Closed
rudeboybert opened this issue Jan 22, 2020 · 10 comments
Closed

rep_sample_n() error when prob vector specified #279

rudeboybert opened this issue Jan 22, 2020 · 10 comments
Assignees

Comments

@rudeboybert
Copy link
Contributor

An issue was brought to my attention by @unoe w.r.t. to specifying non-uniform sampling probabilities. Here is a reprex:

library(tidyverse)
library(infer)

tbl <- tibble(out = c('S', 'F'))
size <- 12
replace <- TRUE
prob <- c(1/5, 4/5)

# No problem:
rep_sample_n(tbl = tbl, size = size, replace = replace)
#> # A tibble: 12 x 2
#> # Groups:   replicate [1]
#>    replicate out  
#>        <int> <chr>
#>  1         1 F    
#>  2         1 S    
#>  3         1 S    
#>  4         1 S    
#>  5         1 F    
#>  6         1 S    
#>  7         1 S    
#>  8         1 S    
#>  9         1 F    
#> 10         1 S    
#> 11         1 S    
#> 12         1 F

# Problem when including prob argument:
rep_sample_n(tbl = tbl, size = size, replace = replace, prob = prob)
#> Error: All columns in a tibble must be 1d or 2d objects:
#> * Column `vals` is NULL

I've traced the issue to https://github.com/tidymodels/infer/blob/master/R/rep_sample_n.R#L59. In fact, there is a residual there should be a better way!! comment! In short, rep_sample_n() currently assumes that the variable of interest to sample is

  • in the first column of the input data frame
  • more importantly, is of type factor

I'm happy to file a PR to fix this, however I'm not quite sure I understand why the following code on lines 69-73 are necessary, instead of just using the prob argument as is

prob <- tibble::tibble(vals = levels(dplyr::pull(tbl, 1))) %>%
  dplyr::mutate(probs = prob) %>%
  dplyr::inner_join(tbl) %>%
  dplyr::select(probs) %>%
  dplyr::pull()

Judging by #82, it seems @mine-cetinkaya-rundel wrote this function back in the day for oilabs. Mine, could you shed some light as to necessity of the above wrangling?

@rudeboybert
Copy link
Contributor Author

Also, I'm wondering if per tidy principles, we should force/require that the sampling probabilities be specified as a variable prob of the input data frame. This does break with how various base::sample() functions work (like the base::sample.int() engine for rep_sample_n()), but I think this is ultimately safer.

@mine-cetinkaya-rundel
Copy link
Collaborator

Indeed, this was based on oilabs::rep_sample_n() (which, by the way, is destined to live in the openintro package going forward) but I didn't implement the probabilities so I'm not the best person to answer:

Also, since we're talking about rep_sample_n(), we should introduce a new function called rep_slice_sample() or something like that as sample_n() will be superseeded by slice_sample() in the next release of dplyr. See https://github.com/tidyverse/dplyr/blob/master/R/slice.R and https://github.com/tidyverse/dplyr/blob/master/R/sample.R.

@rudeboybert
Copy link
Contributor Author

Sounds good. @andrewpbray, could you please let me know your reasoning? I'll then PR rep_sample_n() in infer, which then later on can be ported to oilabs along with the new slice_sample() functions.

@unoe
Copy link

unoe commented Jan 22, 2020

Hi all, thanks for your help in understanding how to do this.
I basically wanted to show a randomization distribution for the null that the probability of success (coded as S) is p = 1/5, using rep_sample_n and then summarising with prop = mean(output == 'S').

Currently the code:

  • uses the first column of the tibble: dplyr::pull(tbl, 1)

  • assumes that it is a factor: levels(dplyr::pull(tbl, 1))

  • assumes that the column with the outcomes has name vals.

A temporary workaround, which returns a warning, is to make sure exactly of those points: that the tibble contains a column called vals which is a factor.

tbl <- tibble(vals = factor(c("S","F")))
samples <- rep_sample_n(tbl, size = 12, replace = TRUE, prob = c(1/5, 4/5), reps = 1)
# Warning message:
# Column `vals` joining character vector and factor, coercing into character vector 

@echasnovski

This comment has been minimized.

@ismayc
Copy link
Collaborator

ismayc commented Jun 17, 2020

Removing @andrewpbray as the assignee here. @echasnovski or @simonpcouch, can you look into this whenever you get a chance?

@unoe
Copy link

unoe commented Jul 23, 2020

Hi everyone,

I hope you are well. I just thought this might help in getting started.
I made this little draft - it might not be the most efficient implementation or perhaps you might not want to load the libraries in there, but it does the job:

rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1, prob = NULL) {
    
    require(magrittr)
    require(dplyr)
    require(purrr)
    
    samples <- 1:reps %>%
        map_dfr(~ tbl %>%
                    slice_sample(n = size, weight_by = prob, replace = replace)) %>%
        mutate(replicate = rep(1:reps, each = size)) %>%
        group_by(replicate)
    
    return(samples)
}

Hope this helps,
Umberto

simonpcouch added a commit that referenced this issue Jul 24, 2020
also adds a new rep_slice_sample wrapper that has a more similar interface to dplyr::slice_sample().

still need to extend unit testing and rewrite examples.
@simonpcouch
Copy link
Collaborator

simonpcouch commented Jul 24, 2020

Starting to give a go at this on the branch linked above—will drop a link to here in the PR once it's ready!

@unoe, your draft was really helpful. :-)

@simonpcouch
Copy link
Collaborator

🦋

@github-actions
Copy link

github-actions bot commented Mar 8, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants