Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Permutation resampling #198

Closed
mattwarkentin opened this issue Oct 17, 2020 · 11 comments · Fixed by #200
Closed

Feature request: Permutation resampling #198

mattwarkentin opened this issue Oct 17, 2020 · 11 comments · Fixed by #200
Labels
feature a feature request or enhancement

Comments

@mattwarkentin
Copy link
Contributor

mattwarkentin commented Oct 17, 2020

Hi,

I propose the addition of a function for row-wise permutation-based resampling. Happy to have a discussion about whether it fits into the vision of rsample. Basically, I want to consider resurrecting #13. recipes::step_shuffle() is helpful but not quite sufficient for large scale permutation resampling. infer::generate() is closer to what is desired, but it doesn't exactly follow the rsample paradigm. I should add, if there is interest, I would be happy to contribute this feature via a PR.

Motivation

  • I think this type of resampling function fits into the natural ecosystem with rsample
  • I am a proponent of using permutation-based resampling for generating null distributions for statistics (e.g. p-values), such as when the null distribution is unknown or if distributional assumptions for parametric tests may be violated
  • Counterpoint: Unlike the other resampling methods in rsample which serve to generate samples that have the appearance of being new draws from the same underlying data-generating mechanism (or the alternative dist.), permutation resampling serves to generate samples under the null. Despite this difference, I still think it possibly fits in here.
  • Issue: Permutation sampling doesn't have defined splits, so what would training()/testing() or analysis()/assessment() return? This is a bit of a sticking point, perhaps.

Considerations

  • Function could be named permutations() to have the same feel as bootstraps(), but open for suggestions
  • Could function in one of two ways:
    1. Return a fixed number of permutations, akin to bootstraps()
    2. Return all possible permutations, as is common for permutation tests (could be an enormous amount of permutations for large data/multiple columns). In most cases one would only need to permute the response variable which simplifies things a great deal...
  • I think the function should also include a strata argument to perform stratified permutation; probably should also have all the same arguments as bootstraps
  • I think the function should maybe contain a cols argument specifying which columns should be permuted, while all other columns remain in their natural order. cols could accept tidyselect functions to select columns to permute.
  • Could add some helper functions to:
    • Summarize permutation-based statistics with mean, SE, and confidence intervals (similar to int_pctl())
    • Plot the permutation distribution of a chosen statistic and show where the apparent value falls in relation

Proposed API

For bootstrap-like functionality...

permutations(data, times = 25, strata = NULL, cols = everything(), breaks = 4, apparent = FALSE, ...)

For permutation-test like functionality...

permutations(data, strata = NULL, cols = everything(), breaks = 4, apparent = FALSE, ...)
@juliasilge juliasilge added the feature a feature request or enhancement label Oct 19, 2020
@topepo
Copy link
Member

topepo commented Oct 19, 2020

Excellent proposal. I think that it would be a great addition. Some details:

  • I don't know that strata and breaks are needed. I can't think of a case where that might be part of an analysis. Do you know of any?

  • cols is good but let's think about other names. permute or shuffle? Also, I would not have a default and make people fill it in. everything() doesn't really fit since that would keep the rows intact and just reorder the data frame. I agree that the original column order should be maintained.

  • I think that analysis() is still needed to get the data out but assessment() should throw an error.

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Oct 20, 2020

Thanks for the reply, @topepo. Glad to hear there is interest in implementing this feature.

I don't know that strata and breaks are needed. I can't think of a case where that might be part of an analysis. Do you know of any?

Hmm, I can't really think of a compelling use-case. I would have to think about this some more, but I actually think stratified permutation sampling might be invalid. I think its possible that permuting within strata might not have the desired property of producing apparent samples from the null (i.e. if an association exists, it might not break it).

cols is good but let's think about other names. permute or shuffle? Also, I would not have a default and make people fill it in. everything() doesn't really fit since that would keep the rows intact and just reorder the data frame. I agree that the original column order should be maintained.

This was actually something I wanted specific clarification on. Based on your comment it seems like all of the columns selected by shuffle/permute/whatever should be permuted to the same order (rather than permuted independently). In essence, the selected columns would be extracted from the data frame and shuffled together, then spliced back in. Alternatively, I could imagine a setup where each of the selected columns is extracted and independently shuffled before reassembly. Thoughts?

I think that analysis() is still needed to get the data out but assessment() should throw an error.

Makes a lot of sense. You obviously have a deep sense of the inner workings of parsnip and workflows. Is assessment() throwing an error enough to halt users from accidentally using a permutation resample object as resamples when model building (and provide an informative error message)? Is it problematic for other tidymodels packages if permutations() inherits the rset/rsplit classes?

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Oct 21, 2020

I think the lowest-impact way to do this would be to include a new list-item in the rsplit class object which indicates the column indices to resample. It would mean a slight recode of the last line in the function below, but otherwise it uses all the other rsample machinery so it would generally be non-invasive. I've coded up a prototype and it's working pretty well.

Right now the last line will shuffle rows for all columns together, but it would be pretty straightforward to add column indices to only shuffle certain columns and then splice them back into the data. Thoughts?

# as.data.frame.rsplit()
function (x, row.names = NULL, optional = FALSE, data = "analysis", 
          ...) 
{
  if (!is.null(row.names)) 
    warning("`row.names` is kept for consistency with the ", 
            "underlying class but non-NULL values will be ", 
            "ignored.", call. = FALSE)
  if (optional) 
    warning("`optional` is kept for consistency with the ", 
            "underlying class but TRUE values will be ", "ignored.", 
            call. = FALSE)
  x$data[as.integer(x, data = data, ...), , drop = FALSE]
}

@mattwarkentin
Copy link
Contributor Author

Getting the column indices is easy and we could support tidyselect functions for choosing columns to permute.

tidyselect::eval_select(tidyselect::starts_with("c"), mtcars)
#>  cyl carb 
#>    2   11

Just need to pack that into the rsplit object. Maybe as col_id??

@topepo
Copy link
Member

topepo commented Oct 21, 2020

Right now the last line will shuffle rows for all columns together, but it would be pretty straightforward to add column indices to only shuffle certain columns and then splice them back into the data. Thoughts?

Yes, I would do that.

Based on your other comments though... since the permuted columns are all permuted the same way, maybe the rsplit object should have the rearranged integer sequence. If not, the rset isn't really reproducible. To illustrate:

library(rsample)
set.seed(1)
tmp <- apparent(mtcars)
# permute these when you make the permutation rsplit objects:
tmp$splits[[1]]$in_id
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32

Created on 2020-10-21 by the reprex package (v0.3.0)

@mattwarkentin
Copy link
Contributor Author

Based on your other comments though... since the permuted columns are all permuted the same way, maybe the rsplit object should have the rearranged integer sequence. If not, the rset isn't really reproducible.

Hmm, maybe I'm not entirely clear on what you mean by this. Could you elaborate?

The rsplit object will contain in_id which is the rearranged integer sequence, no?

@mattwarkentin
Copy link
Contributor Author

Also, how do you feel about throwing an error when the user selects ALL columns to be permuted? This effectively creates bootstrap resamples and I think it is a bad idea to accidentally or purposefully use permutations() to bootstrap.

We could produce an informative error, something like: You have selected all columns to permute. This effectively creates bootstrap resamples. If this is what you intended, please use rsample::bootstraps() instead, or select fewer columns to permute.

@topepo
Copy link
Member

topepo commented Oct 21, 2020

Let's say that the in_id elements are all 1:n and the shuffling happens in analysis(). If we were doing something like

set.seed(1)
perms <- permutations(mtcars) 

perms <- 
  perms %>% 
  mutate(models = map(splits, ~ lm(mpg ~ ., data = analysis(.x)))) #<- randomness happens here

it would be really different than how the other rsample functions work; you set the seed when you make them and the randomness is embedded in the object. Certainly you can remember to do it before the mutate() above. However, if the model uses random numbers, then it gets dicey if you want to do it for a different model.

Also, how do you feel about throwing an error when the user selects ALL columns to be permuted?

👍

@mattwarkentin
Copy link
Contributor Author

My current implementation follows the rsample system of the randomness happening in the permutations() function call, and the shuffled IDs are embedded in the rsplit object.

I wanted permutations() to adhere to the same system as other resamplers as closely as possible. Like the other resampling functions, analysis() just uses the embedded ID and data objects to construct the resampled data, but nothing new is happening at that point.

I am aiming to get a PR draft spun up soon, hopefully you'll like how it's structured and we can iterate from there.

@mattwarkentin
Copy link
Contributor Author

Note to self:

You have selected all columns to permute. This effectively creates bootstrap resamples. If this is what you intended, please use rsample::bootstraps() instead, or select fewer columns to permute.

This is not true. It effectively shuffles the original data but there is no replacement so it is not bootstrap-like. Throwing an error is still good - just reword.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants