# Bootstrap
## Turning 1 set of data into many by resampling with replacement (and then using these samples to evaluate the accuracy of your analysis)


### Prof. Robert Quimby
&copy; 2020 Robert Quimby

## In this tutorial you will...

- Be introduced to the bootstrap resampling technique
- Use bootstrap resampling to determine uncertainty in a result
- Compare the uncertainty from bootstrap and standard uncertainty propagation techniques
- Apply bootstrap techniques where standard uncertainty propagation cannot be done
- Hear about the related Jackknife resampling method

## Bootstrap method

1. Start with a data sample comprised of $N$ values, $A=[x_1, x_2, x_3, \ldots, x_N]$
  - For accurate results, $N$ must be reasonably large (e.g., $N>30$ if the distribution is Gaussian).

2. Derive what ever statistic$-$let us call it $\theta$$-$you want from the sample.
  - For example, $\theta(A)$ might be the mean, median, or standard deviation of sample $A$.

3. Create a new sample, $B$, by randomly drawing $N$ values from $A$ with *replacement*. 
  - Replacement just means that after you have drawn a value from $A$, you put that value back into $A$ so that the next random draw has a chance to select that same value again. 
  - This means that the new sample, $B$, may include zero, one or more copies of each $x_i$ value.

4. Apply the same analysis as before to sample $B$ to determine the statistic, $\theta(B)$.

5. Record your statistic for this sample and then repeat steps 3 and 4 above a total of $M$ times.
  - $M$ should be large enough to capture the range of permutations possible in $A$

6. Use the *distribution* of $\theta(B)$ values to estimate the uncertainty of $\theta(A)$

## Recall: statistical uncertainty

When you report the uncertainty of a measurement, you are telling other astronomers the range of values they can expect to obtain if they repeat your experiment exactly (this is true at least for ordinary, "frequentist" statistics).

## Determining uncertainty through bootstrapping

Each sample drawn from its parent distribution provides information about the parent distribution.

So drawing from a sample is similar to drawing directly from the parent.

Repeated draws from a sample can be used to approximate repeated draws from the parent. 

- Bootstrapping is a process that allows you to estimate the range of possible outcomes by using the **sample** population of your data as a stand-in for the **parent** population provided by the Universe. 

- If your sample is large enough it will span the range of possibilities offered by the Universe, so you can randomly draw test data from your existing sample to determine the range of results that others may find through new observations of the Universe. 

- Thus repeating this random selection many times can predict the range of possible results of your experiment, and this can be used to derive the uncertainty in your measurement. 

### Example: average brightness of a star

In [None]:
import numpy as np
def get_star_observations(mag=18.5, emag=0.2, nobs=50, varmag=0):
    dtype = [('mjd', float), ('mag', float), ('emag', float)]
    sample = np.zeros(nobs, dtype=dtype)
    sample['mjd'] = 59000 + np.random.uniform(0, nobs, size=nobs)
    sigma = np.sqrt(emag**2 + varmag**2)
    sample['mag'] = np.random.normal(mag, sigma, size=nobs)
    sample['emag'] = emag
    return sample

In [None]:
# make a set of observations
my_sample = get_star_observations()

# perform some analysis on the sample to produce a result
def analysis(sample):
    return ????

my_result = analysis(my_sample)
print(f'{my_result:0.3f}')

In [None]:
# other astronomers conduct the same experiment
nastronomers = ???
their_results = [???? for i in range(nastronomers)]

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(their_results, bins=100);

In [None]:
avg_of_their_results = np.mean(their_results)
std_of_their_results = np.std(their_results, ddof=1)
print(f'{avg_of_their_results:0.3f} +/- {std_of_their_results:0.3f}')

In [None]:
# uncertainty in my result (using uncertainty propagation formula)
# recall: my_result = np.sum(sample['mag']) / sample.size
my_uncertainty = ????
print(f'{my_result:0.3f} +/- {my_uncertainty:0.3f}')

### Estimating the uncertainty through bootstrapping 

In [None]:
def bootstrap_dist(sample, func, m=10000):
    dist = np.zeros(m)
    for i in range(m):
        resample = ????
        dist[i] = func(resample)
    return dist

In [None]:
# bootstrap to find the uncertainty in the result
sample = get_star_observations()
func = ????
m = ????
dist = bootstrap_dist(sample, func, m=m)

# distribution or results from others
plt.hist(their_results, bins=100, label='others')

# estimated distribution of results from bootstrapping
plt.hist(dist, bins=100, alpha=0.5, label='bootstrap')
plt.axvline(func(sample), c='r', ls='--')
plt.legend();

$$\sigma_{\theta(A)} \approx \sqrt{ {1 \over M} \sum_{i=1}^{M} (x_i - \hat{x})^2} $$

where $\hat{x} = \theta(A)$

Note this differs from the usual standard deviation formula because:
- the "true" median is used instead of the median of the distribution
- because of this difference, ddof=0

In [None]:
# use the bootstrap distribution to find the uncertainty
edist = ????
print(f"bootstrap uncertainty is +/- {edist:.4f} mag")

### Example 2: median brightness of a star

In [None]:
def analysis2(sample):
    sorted_mags = np.sort(sample['mag'])
    ind1 = (sample.size - 1) // 2
    ind2 = sample.size // 2
    return np.sum(sorted_mags[[ind1, ind2]]) / 2

my_result = analysis2(my_sample)
print(f'{my_result:0.3f}')

In [None]:
their_results = [analysis2(get_star_observations()) for i in range(nastronomers)]
avg_of_their_results = np.mean(their_results)
std_of_their_results = np.std(their_results, ddof=1)
print(f'{avg_of_their_results:0.3f} +/- {std_of_their_results:0.3f}')

In [None]:
# can't use uncertainty propagation formula!
# bootstrap to find the uncertainty in the result
func = analysis2
dist = bootstrap_dist(sample, func)
edist = np.sqrt(1 / m * np.sum( (dist - func(sample))**2 ))
print("bootstrap uncertainty is +/- {:.4f} mag".format(edist))

## Jackknife method

- Given $N$ data points, calculate the statistic $N$ times using subsets of the data that exclude one point each time.

This can be useful for determining if a relation is driven by a single data point or reflective of the whole sample.

In [None]:
def jackknife_dist(sample, func):
    dist = np.zeros(sample.size)
    inds = np.arange(sample.size)
    for i in range(sample.size):
        dist[i] = func(sample[????])
    return dist

In [None]:
func = analysis
dist = jackknife_dist(sample, func)
plt.hist(dist, bins=20);
plt.axvline(func(sample), ls='dashed', c='r')
plt.axvline(dist.mean(), ls='dotted', c='g');

In [None]:
stat_jk = np.mean(dist)
estat_jk = np.sqrt( (sample.size - 1) / sample.size * np.sum( (dist - stat_jk)**2 ) )
print("measured statistic is {:.4f}".format(func(sample)))
print("Jackknife gives {:.4f} +/- {:.4f}".format(stat_jk, estat_jk))

## `astropy` tools for bootstrapping and jackknifing

In [None]:
from astropy.stats import bootstrap
from astropy.stats import jackknife_stats
from astropy.stats import jackknife_resampling

- [`astropy.stats.bootstrap`](http://docs.astropy.org/en/stable/api/astropy.stats.bootstrap.html) works similar to the `bootstrap_dist` function defined above, but note the arguments are slightly different

- [`astropy.stats.jackknife_resampling`](http://docs.astropy.org/en/stable/api/astropy.stats.jackknife_resampling.html) is similar to the `jackknife_dist` function above, but it does not work with structured arrays

- [`astropy.stats.jackknife_stats`](http://docs.astropy.org/en/stable/api/astropy.stats.jackknife_stats.html) performs jackknife resampling and then returns 4 values: the estimate, bias, uncertainty, and the confidence interval

In [None]:
# example using astropy's bootstrap function
dist = bootstrap(????)