Skip to content

Feature request: Permutation resampling #198

@mattwarkentin

Description

@mattwarkentin

Hi,

I propose the addition of a function for row-wise permutation-based resampling. Happy to have a discussion about whether it fits into the vision of rsample. Basically, I want to consider resurrecting #13. recipes::step_shuffle() is helpful but not quite sufficient for large scale permutation resampling. infer::generate() is closer to what is desired, but it doesn't exactly follow the rsample paradigm. I should add, if there is interest, I would be happy to contribute this feature via a PR.

Motivation

  • I think this type of resampling function fits into the natural ecosystem with rsample
  • I am a proponent of using permutation-based resampling for generating null distributions for statistics (e.g. p-values), such as when the null distribution is unknown or if distributional assumptions for parametric tests may be violated
  • Counterpoint: Unlike the other resampling methods in rsample which serve to generate samples that have the appearance of being new draws from the same underlying data-generating mechanism (or the alternative dist.), permutation resampling serves to generate samples under the null. Despite this difference, I still think it possibly fits in here.
  • Issue: Permutation sampling doesn't have defined splits, so what would training()/testing() or analysis()/assessment() return? This is a bit of a sticking point, perhaps.

Considerations

  • Function could be named permutations() to have the same feel as bootstraps(), but open for suggestions
  • Could function in one of two ways:
    1. Return a fixed number of permutations, akin to bootstraps()
    2. Return all possible permutations, as is common for permutation tests (could be an enormous amount of permutations for large data/multiple columns). In most cases one would only need to permute the response variable which simplifies things a great deal...
  • I think the function should also include a strata argument to perform stratified permutation; probably should also have all the same arguments as bootstraps
  • I think the function should maybe contain a cols argument specifying which columns should be permuted, while all other columns remain in their natural order. cols could accept tidyselect functions to select columns to permute.
  • Could add some helper functions to:
    • Summarize permutation-based statistics with mean, SE, and confidence intervals (similar to int_pctl())
    • Plot the permutation distribution of a chosen statistic and show where the apparent value falls in relation

Proposed API

For bootstrap-like functionality...

permutations(data, times = 25, strata = NULL, cols = everything(), breaks = 4, apparent = FALSE, ...)

For permutation-test like functionality...

permutations(data, strata = NULL, cols = everything(), breaks = 4, apparent = FALSE, ...)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions