## Bootstrapping
Bootstrapping is a frequently used technique to generate inferential quantities, confidence intervals and p-values.  The exciting thing about the bootstrap is that it is very flexible and can be used in many, many circumstances since it makes fewer assumptions than other types of tests.  Formally the bootstrap is about understanding the sampling distribution of a statistic (like the mean or the standard deviation) and, in particularly, how much variability there is in the statistic due to the sampling process.  To get estimates of this variability, we will approximate the sampling process.  

The basic steps of the bootstrap: 
1. Define and calculate the statistic of interest
2. Resample the data with replacement many times, calculate and store the statistic of interest each time
3. Use the percentiles of the stored values to make inference

In [9]:
# Load in packages we need for doing graphical representations.  
from pathlib import Path
import pandas as pd

import matplotlib.pyplot as plt
import numpy as np

from matplotlib import colors

import scipy.stats as stats

We will motivate this by using some data on Penguins.  Some details about the data can be found here: https://allisonhorst.github.io/palmerpenguins/.  

In [10]:
# read in the data to dataframe called ames
penguins = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/penguins.csv", na_values=['NA'])
# remove rows with missing data
penguins.dropna(inplace=True)
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


Body mass, *body_mass_g*, is a measure of overall species health that is sometimes used, so we'll start by focusing on that.  And we'll bootstrap and calculate the mean body mass for these penguins.   

In [12]:
# set the random seed, we'll come back to this
np.random.seed(123)
#make a object that is just the body mass of the penguins
body_mass=penguins['body_mass_g']
mean_body_mass = np.mean(body_mass)
print("Mean Body Mass in the original data is", np.mean(body_mass))
# get and store the total number of 
n_penguins=len(body_mass)
rows = list(range(1, n_penguins))
# n_reps is the number of bootstrap replicates 
n_reps = 5000

# create an empty array to store the calculated means
boot_means = []
# loop through taking a sample and calculating a mean from the sample n_reps times
for i in range(n_reps):
  # create a sample with replacement of size n_penguins from body_mass
  random_out = np.random.choice(rows, n_penguins)
  bms=body_mass.iloc[random_out]
  # calculate the mean of the values in bms and save in avg
  avg = np.mean(bms)
  # add avg to the list of other bootstrapped means
  boot_means.append(avg)

# print the average of the bootstrapped means, boot_means
print("Average of the bootstrapped means is", np.mean(boot_means))
# print the standard deviation of the bootstrapped means
print("Standard deviation of the bootstrapped means is", np.std(boot_means))

Mean Body Mass in the original data is 4207.057057057057
Average of the bootstrapped means is 4208.672882882883
Standard deviation of the bootstrapped means is 44.20500443759891


We can next make a confidence interval based upon the bootstrapped means, *boot_means*, that we got as part of this process.

In [4]:
# take the 2.5th percentile and the 97.5th percentile of the bootstrapped means to make a 95% confidence interval
np.percentile(boot_means,(2.5, 97.5))

array([4121.32132132, 4294.22484985])

We get a 95% confidence interval for the mean body mass of all penguins to be between 4121.3g and 4294.2g.

Now for statistics like the mean (and proportion) whose behavior we --- we being statisticians and now you--- understand well, we get very similar results from the bootstrap as compared to the other methods we have done.  

Here's the confidence interval for the mean that we did previously.  

In [5]:
lower, upper = stats.t.interval(confidence=0.95, 
              df=len(body_mass)-1, 
              loc=np.mean(body_mass),  
              scale=stats.sem(body_mass))
print(lower, upper)

4120.256132623127 4293.857981490987


Note how similar the bootstrapped confidence interval is to the traditional confidence interval.  Of course, you can't do bootstrapped confidence intervals without computing.  

## Bootstrapping Proportions

We are going to do the same process that we did for the mean for the proportion of penguins with a body mass over 3500g.  This time we will do a hypothesis test.  

In [None]:
# create a variable big which is a 1 if body_mass >3500, and 0 otherwise.
# (body_mass>3500) creates an array of True's and False's
# by multiplying by 1, the True's and False's get converted to 1's and 0's respectively
big = (body_mass>3500)*1
# print the number of penguins in the sample with body_mass >3500
print(sum(big))
# set the random seed, we'll come back to this
np.random.seed(123)

# get and store the total number of 
n_penguins=len(big)
#print the total number of penguins in the sample.  
print(n_penguins)
# n_reps is the number of bootstrap replicates 
n_reps = 5000

# create an empty array to store the calculated means
boot_prop = []
# loop through taking a sample and calculating a mean from the sample n_reps times
for i in range(n_reps):
  # create a sample with replacement of size n_penguins from body_mass
  bms = np.random.choice(big.tolist(), n_penguins)
  # calculate the mean of the values in bms and save in avg
  avg = np.mean(bms)
  # add avg to the list of other bootstrapped means
  boot_prop.append(avg)

# print the average of the bootstrapped means, boot_means
print(np.mean(boot_prop))
# print the standard deviation of the bootstrapped means
print(np.std(boot_prop))

258
333
0.7751573573573575
0.022830323020816425


Now the values above which is the mean of the bootstrapped proportions, $0.7752$, and the standard deviation of the bootstrapped proportions, $0.0228$ should be close to the proportion in the sample ($\hat{p}$ =258/333 = 0.7748) and the standard error of the proportion which is $\sqrt{\hat{p}(1-\hat{p})/n}$ = 0.0229.  You may recall that the definition of the standard errors is the standard deviation of the sampling distribution of the statistic.  
 

Now suppose we want to test whether more than 75% of all penguins have body mass more than 3500g.  Formally we are testing $H_0: p = 0.75$ vs $H_a: p>0.75$.  Hypothesis testing is slightly different than making a confidence interval because the p-value is based upon an assumption the the null hypothesis is true so there is an adjustment that is made to the distribution of bootstrapped proportions to make sure their mean is the value in the null hypothesis, here that is $0.75$.  

In [7]:
# here is the adjustment for boot_prop
boot_prop_adjusted = boot_prop - (np.mean(boot_prop)-0.75)
# then the p-value is the proportion of n_reps that is more than our sample proportion which was 258/333.
p_value = sum((boot_prop_adjusted>258/333)*1)/n_reps
print(p_value)

0.1332


Since our p-value is large (0.1332), we fail to reject the null hypothesis that the proportion of penguins with body mass more tham 3500g is 75%.  

## random.seed
Above we used the command _np.random.seed(123)_ this was a way of ensuring the our randomization was consistent --- identically consistent across all uses.  Randomization is tricky process and so we often want to ensure that we have a process that is repeatable and we can do that by setting the 'seed' of the randomization.  That is the reason that I got the same values from the bootstrap above that you got.  A seed can be any integar.  I like to use the date, e.g. 20250115, as my seed but that is personal preference.  

One of the biggest uses for random seed is to ensure that two analyses are done identically and that the results are reproducible.  Let's try re-running the code from above with a new seed in the second line.

*BEFORE RUNNING THE NEXT SET OF CODE, PUT AN INTEGAR INSIDE THE  ( )'S IN THE COMMAND _np.random.seed()*

In [None]:
# ENTER A NEW SEED HERE BEFORE RUNNING THIS CODE
np.random.seed(2025)

# get and store the total number of 
n_penguins=len(big)
#print the total number of penguins in the sample.  
#print(n_penguins)
# n_reps is the number of bootstrap replicates 
n_reps = 5000

# create an empty array to store the calculated means
boot_prop = []
# loop through taking a sample and calculating a mean from the sample n_reps times
for i in range(n_reps):
  # create a sample with replacement of size n_penguins from body_mass
  bms = np.random.choice(big.tolist(), n_penguins)
  # calculate the mean of the values in bms and save in avg
  avg = np.mean(bms)
  # add avg to the list of other bootstrapped means
  boot_prop.append(avg)

# print the average of the bootstrapped means, boot_means
print(np.mean(boot_prop))
# print the standard deviation of the bootstrapped means
print(np.std(boot_prop))

0.7748924924924925
0.022815168496751383


A couple of chunks back, our output when we used _random.seed(123)_ was the following:

0.7751573573573575

0.022830323020816425

Now with a different random seed we get slightly different values.  Keeping the seed the same and rerunning the code, you get the the same output. 

Trying changing the seed again and rerunning the code.  This time you should get a slightly different output.   

Tasks

1. Create a 96% confidence interval for mean flipper length, *flipper_length_mm*, from these penguins data using both the traditional confidence interval methodology and the bootstrap methodology.  Compare the resulting intervals.


2. Create a 95% confidence interval for median body mass from these penguins data using bootstrapping.

3. Create a 90% confidence interval for the IQR, 75th percentile minus the 25th percentile, of body mass of penguins using bootstrapping.