In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import random

# Penguins Dataset

Part of our discussion has been sampling with or without replacement. We use random.sample(list, k=n) for sampling from a population without replacement, and random.choices(list, k=n) for sampling from a population with replacement. The general guidelines:

- we can use sampling with replacement if the sample size is small compared to the size of the population. 
- Sampling without replacement is more appropriate if the sample size is large compared to the population (the question we should ask is whether the sample selction could change the distribution.
- Sampling with replacement is also called for if the category of interest in the population is small compared to the population size.

## Bootstrapping

Today's example is another case where samping with replacement is called for. In many cases the data we have does not represent the whole population but is itself a sample from the population. Consider the examples we have seen so far for resampling:

- Baseball Players:  Our dataset was all of the players in one season.
- Presidential Pardons:  Our dataset was all of the recent pardon petitions.
- PPP Data: Our dataset was all of the large PPP loans (well except for the ones with no state listed or no jobs reported as saved)
- Airline Flights:  Our dataset was all of the flights in one year.
- Class Performance:  Our data (not really a dataset) was the grade distributions from multiple years worth of students in MATH 120.

Compare these with the penguin data:  We have a dataset of measurments taken from selected penguins from three islands in Antarctica.  Clearly our dataset is not the population of all penguins but just a sample from it. 

- Problems with the original sampling methadology that built the dataset.
- The number of times we run the experiment.
- Irreducible errors in the data.

What we would like to try and understand is what is the potential variablity in the mean when the data we have is just a sample of the population.

In [None]:
# Let's continue exploring the penguins data set (our first Case Study that we will work on together)

penguins_url = 'https://drive.google.com/uc?export=download&id=1-SiGKvihMs9sP2I2FZd-sVRm-VnZFihi'
penguins_data = pd.read_csv(penguins_url)
penguins_data

Consider just the set of bill lengths of the Adelie penguins. Make a DataFrame, Series, or list of those numbers.

In [None]:
pop_sample = 

### Q1:  Make a plot of the distribution of the bill_lengths of the penguins.

### Q2: Find the mean bill length for the Adelie penguins.

### Q3:  How many Adelie penguins do we have in our population sample (with a bill length)? 

In [None]:
size = 

## Variability

### Standard Deviation

There are two types of measures of variability we use. One is the **Standard Deviation** and the related **Variance**.

The *Variance* of a list of numbers is average of the squares of the values' differences from the mean:

$$ \mbox{Variance} = \frac{1}{N-1} \sum_{n=1}^N \left( x_i - \bar{x} \right)^2 $$

Where $\bar{x}$ is the mean of the $x_i$ and $N$ is the number of $x_i$. Note that it is not *exactly* the average, we divide by N-1. This because the mean $\bar{x}$ is included and that is reducing the degrees of freedom by 1, hence $N -1$ is what we divide by.

The problem with Variance is that the units it has are the square of the units of our variable. This is not great for actually comparing the spread of our data with the data itself, and so we take the square root of the variance to give the standard deviation:

$$ \mbox{Standard Deviation} = \sqrt{ \mbox{Variance} } $$

The larger the *Standard Deviation* the more our data is spread away from the mean. In the case of the Guassian distribution *Standard Deviation* has a precise meaning and interpretation, but for our purposes it is more useful to just treat it as the appropriate measure of spread if we are working with the mean.

Python can compute it.


In [None]:
pop_sample.std()

Larger standard deviations indicate more spread.

### Percentiles and Quartiles

The other notion of spread, related to the *median* is the locations of quartiles and percentiles for the data. The advantage of this measurment over Standard Deviation is it neglects the distance of the data points from each other and in particular is less influenced by outliers and very large values (like for the median).  The pth percentile is the value where p percent of the data is at that value or less.

In [None]:
pop_sample.quantile(q=0.25), pop_sample.median(), pop_sample.quantile(q=0.75)

Note for this problem the quartiles and standard deviation are relatively close in their notion of spread. This indicates the data does not have values that are significantly different from the rest of the values.

### Minimums and Maximums

Finally we can also check the minimum and maximum values.

In [None]:
pop_sample.min(), pop_sample.max()

## How much does the mean vary?

Our question though is not really how much variablity there is in individual Adelie penguins, but how much the mean of the population sample we have taken should be expected to vary.

What we have is a sample of the population and so the best we can do is use that. What we will do is choose sets of bill lenghts from the Adelie lengths with replacement that are the same size as the original population sample. 

Let me pause and point out that this means, because the sample size is large we are likely to repeat a penguin in our sample, because we are likely to repeat one, that also means we are likely to skip others. This is actually a feature, we are going to learn how important the presence of small changes like that are to our conclusions.

The following function is going to run our **bootstrap** of the data. In this case the size of the sample we draw for the experiment will match the size of the population sample we have.

I'm writing the function so it returns the means.

In [None]:
def experiment(N, s=size, pop = list(pop_sample)):
    
    # The random functions are a little touchy. They work best when we pass them a list. When I pass them pop_sample
    # the pandas series, sometimes it works fine and sometimes it does not.
    
    result = pd.DataFrame([], columns=['Sample Mean'])
    
    for k in range(N):
        sample = random.choices(pop, k=s)  # We have to use replacement here because the set we are building
                                           # is the same size as the population sample.
        
        result.loc[k, 'Sample Mean'] = np.mean(sample)  # Note I have to use the Numpy Mean because we are acting on list
        if pd.isna(result.loc[k, 'Sample Mean']):
            print(sample)
        
        
    return result

### Q4: Find the Standard Deviation, Quartiles, and minimum and maximums for the means of the bootstrapped samples. What do you observe?

Note that the function experiment returns a DataFrame. To compute the standard deviation and quartiles we need to convert it to a single column of the dataframe using a .loc or .iloc command. 

Increase the number of times you run the sample and see what happens to these measures of variation. Note that they do not get close to zero, though they are smaller than the individual variations. This is giving us a sense of *based on the information we have* how much the mean of this many Adelie penguins is expected to vary. 

### Q5: Do the same exercise with the other two species. How much do we expect their population sample means to vary?

Find some number of times to run the experiment so that the measures of spread are not changing too much and run the experiment for each population of Penguins keeping the results in separate variables.

### Q6: Show that there is non-zero amount of overlap in the mean of the bootstrap results for two of the species

This means that we are finding that it is possible (*based on the infromation we have*) that the mean bill lengths for these two species would be the same. Use the experiment results to give an estimate of how often that occurs.