Skip to content

tsee/Statistics-CaseResampling

Repository files navigation

NAME
    Statistics::CaseResampling - Efficient resampling and calculation of
    medians with confidence intervals

SYNOPSIS
      use Statistics::CaseResampling ':all';
  
      my $sample = [1,3,5,7,1,2,9]; # ... usually MUCH more data ...
      my $confidence = 0.95; # ~2*sigma or "90% within confidence limits"
      #my $confidence = 0.37; # ~1*sigma or "~66% within confidence limits"
  
      # calculate the median of the sample with lower and upper confidence
      # limits using resampling/bootstrapping:
      my ($lower_cl, $median, $upper_cl)
        = median_simple_confidence_limits($sample, $confidence);
  
      # There are many auxiliary functions:
  
      my $resampled = resample($sample);
      # $resampled is now a random set of measurements from $sample,
      # including potential duplicates
  
      my $medians = resample_medians($sample, $n_resamples);
      # $medians is now an array reference containing the medians
      # of $n_resamples resample runs
      # This is vastly more efficient that doing the same thing with
      # repeated resample() calls!
      # Analogously:
      my $means = resample_means($sample, $n_resamples);
  
      # you can get the cl's from a set of separately resampled medians, too:
      my ($lower_cl, $median, $upper_cl)
        = simple_confidence_limits_from_samples($median, $medians, $confidence);
  
      # utility functions:
      print median([1..5]), "\n"; # prints 3
      print mean([1..5]), "\n"; # prints 3, too, surprise!
      print select_kth([1..5], 1), "\n"; # inefficient way to calculate the minimum

DESCRIPTION
    The purpose of this (XS) module is to calculate the median (or in
    principle also other statistics) with confidence intervals on a sample.
    To do that, it uses a technique called bootstrapping. In a nutshell, it
    resamples the sample a lot of times and for each resample, it calculates
    the median. From the distribution of medians, it then calculates the
    confidence limits.

    In order to implement the confidence limit calculation, various other
    functions had to be implemented efficiently (both algorithmically
    efficient and done in C). These functions may be useful in their own
    right and are thus exposed to Perl. Most notably, this exposes a median
    (and general selection) algorithm that works in linear time as opposed
    to the trivial implementation that requires "O(n*log(n))".

  Random numbers
    The resampling involves drawing many random numbers. Therefore, the
    module comes with an embedded Mersenne twister (taken from
    "Math::Random::MT").

    If you want to change the seed of the RNG, do this:

      $Statistics::CaseResampling::Rnd
        = Statistics::CaseResampling::RdGen::setup($seed);

    or

      $Statistics::CaseResampling::Rnd
        = Statistics::CaseResampling::RdGen::setup(@seed);

    Do not use the embedded random number generator for other purposes. Use
    "Math::Random::MT" instead! At this point, you cannot change the type of
    RNG.

  EXPORT
    None by default.

    Can export any of the functions that are documented below using standard
    "Exporter" semantics, including the customary ":all" group.

FUNCTIONS
    This list of functions is loosely sorted from *basic* to *comprehensive*
    because the more complicated functions are usually (under the hood, in
    C) implemented using the basic building blocks. Unfortunately, that
    means you may want to read the documentation backwards :)

    All of these functions are written in C for speed.

  approx_erf($x)
    Calculates an approximatation of the error function of *x*. Implemented
    after

      Winitzki, Sergei (6 February 2008).
      "A handy approximation for the error function and its inverse" (PDF). 
      http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf

    Quoting: Relative precision better than 1.3e-4.

  approx_erf_inv($x)
    Calculates an approximation of the inverse of the error function of *x*.

    Algorithm from the same source as "approx_erf".

    Quoting: Relative precision better than 2e-3.

  nsigma_to_alpha($nsigma)
    Calculates the probability that a measurement from a normal distribution
    is further away from the mean than $nsigma standard deviations.

    The confidence level (what you pass as the "CONFIDENCE" parameter to
    some functions in this module) is "1 - nsigma_to_alpha($nsigma)".

  alpha_to_nsigma($alpha)
    Inverse of "nsigma_to_alpha()".

  mean(ARRAYREF)
    Calculates the mean of a sample.

  median(ARRAYREF)
    Calculates the median (second quartile) of a sample. Works in linear
    time thanks to using a selection instead of a sort.

    Unfortunately, the way this is implemented, the median of an even number
    of parameters is, here, defined as the "n/2-1"th largest number and not
    the average of the "n/2-1"th and the "n/2"th number. This shouldn't
    matter for nontrivial sample sizes!

  median_absolute_deviation(ARRAYREF)
    Calculates the median absolute deviation (MAD) in what I believe is
    O(n). Take care to rescale the MAD before using it in place of a
    standard deviation.

  first_quartile(ARRAYREF)
    Calculates the first quartile of the sample.

  third_quartile(ARRAYREF)
    Calculates the third quartile of the sample.

  select_kth(ARRAYREF, K)
    Selects the kth smallest element from the sample.

    This is the general function that implements the median/quartile
    calculation. You get the median with this definition of K:

      my $k = int(@$sample/2) + (@$sample & 1);
      my $median = select_kth($sample, $k);

  resample(ARRAYREF)
    Returns a reference to an array containing N random elements from the
    input array, where N is the length of the original array.

    This is different from shuffling in that it's random drawing without
    removing the drawn elements from the sample.

  resample_medians(ARRAYREF, NMEDIANS)
    Returns a reference to an array containing the medians of "NMEDIANS"
    resamples of the original input sample.

  resample_means(ARRAYREF, NMEANS)
    Returns a reference to an array containing the means of "NMEANS"
    resamples of the original input sample.

  simple_confidence_limits_from_median_samples(STATISTIC, STATISTIC_SAMPLES, CONFIDENCE)
    Calculates the confidence limits for *STATISTIC*. Returns the lower
    confidence limit, the statistic, and the upper confidence limit.
    *STATISTIC_SAMPLES* is the output of, for example,
    "resample_means(\@sample)". *CONFIDENCE* indicates the fraction of data
    within the confidence limits (assuming a normal, symmetric distribution
    of the statistic => *simple confidence limits*).

    For example, to get the 90% confidence (~2 sigma) interval for the mean
    of your sample, you can do the following:

      my $sample = [<numbers>];
      my $means = resample_means($sample, $n_resamples);
      my ($lower_cl, $mean, $upper_cl)
        = simple_confidence_limits_from_samples(mean($sample), $means, 0.90);

    If you want to apply this logic to other statistics such as the first or
    third quartile, you have to manually resample and lose the advantage of
    doing it in C:

      my $sample = [<numbers>];
      my $quartiles = [];
      foreach (1..1000) {
        push @$quartiles, first_quartile(resample($sample));
      }
      my ($lower_cl, $firstq, $upper_cl)
        = simple_confidence_limits_from_samples(
            first_quartile($sample), $quartiles, 0.90
          );

    For a reliable calculation of the confidence limits, you should use at
    least 1000 samples if not more. Yes. This whole procedure is expensive.

    For medians, however, use the following function:

  median_simple_confidence_limits(SAMPLE, CONFIDENCE, [NSAMPLES])
    Calculates the confidence limits for the "CONFIDENCE" level for the
    median of *SAMPLE*. Returns the lower confidence limit, the median, and
    the upper confidence limit.

    In order to calculate the limits, a lot of resampling has to be done.
    *NSAMPLES* defaults to 1000. Try running this a couple of times on your
    data interactively to see how the limits still vary a little bit at this
    setting.

TODO
    One could calculate more statistics in C for performance.

SEE ALSO
    Math::Random::MT

    On the approximation of the error function:

      Winitzki, Sergei (6 February 2008).
      "A handy approximation for the error function and its inverse" (PDF). 
      http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf

AUTHOR
    Steffen Mueller, <smueller@cpan.org>

COPYRIGHT AND LICENSE
    Copyright (C) 2010, 2011, 2012 by Steffen Mueller

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, either Perl version 5.8.0 or, at
    your option, any later version of Perl 5 you may have available.