# DSC 10 Discussion Week 9

<img src="data/panda_baby.jpg" width="500">

# Bootstrapping - sampling within a sample
- Problem : statistics about the data population are often unavailable, costly to acquire, unknown, etc.
- Solution : utilize random sampling (and re-sampling) of available data to estimate population statistics
    - The result of bootstrapping will be a distribution over sample statistics!
    - Hopefully we'll see that these *sample statistics* $\approx$ *population statistics*
    
### Bootstrapping basic procedure
- Sample from the population
- Re-sample from that same sample (make sure to have replace=True!)
- Repeat
- **Note** - after re-sampling, we will likely see duplicate data entries within a single sample, but that's okay! 
    - If we didn't have duplicates, then we would have the same exact data in every single sample (this would be bad!)

# Bootstrapped Confidence Intervals
- Goal : return a range of values that we are confident contain the true population statistic 
    - Bootstrapping gives us a distribution of sample statistics
    - The true population statistic often lies within the bulk of that distribution 
- $X$% confidence interval 
    - Interpretation
        - **YES**: $X$% of all bootstrapped sample statistics fall within that interval
        - **YES**: ~$X$% of the time, the interval will capture the correct population statistic
        - **YES**: I'm $X$% confident that the true population statistic is in the interval
        - **NO**: the true population statistic has an $X$% chance of being in the interval
    - Computation
        - Use $\frac{100-X}{2}$ and $100-\frac{100-X}{2}$ for lower and upper percentiles
        
### CIs for testing
- Given P-value $p$ and null hypothesis "population statistic = $a$":
    - Construct $(100-p)$ CI for populatiton statistic
    - Reject null hypothesis if $a$ is not in the interval

# Describing a Distribution : Mean and Spread
- Center of a distribution 
    - *Mean* : balance point
    - *Median* : half-way point (robust to outliers) 
- Spread of distribution 
    - *Range* : biggest - smallest
    - *Standard deviation* : variability around the mean
- Chebyshev's Inequality
    - Proportion of values in the range "average $\pm\ z$ SDs" is ≥ $1-\frac{1}{z^2}$
- Looking forward
    - We'll look at other types of distributions and ones that can be parameterized 

#### Extra
- You can find additional help on these topics in the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).
- [Here](https://ucsd-ets.github.io/dsc10-2020-fa/published/default/reference/babypandas-reference.pdf) is a pointer to that reference sheet we saw last time.

In [None]:
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import otter
grader = otter.Notebook()

from notebook.services.config import ConfigManager

cm = ConfigManager()
cm.update(
    "livereveal", {
        'width': 1500,
        'height': 700,
        "scroll": True,
})

<div style="padding: 15px; border: 1px solid transparent; border-color: transparent; margin-bottom: 20px; border-radius: 4px; color: #3c763d; background-color: #dff0d8; border-color: #d6e9c6;">

# RECALL FROM LAST WEEK
    
Quick outline
- data : life expectancy 
- population : all countries
- sample : smaller random selection of countries
- visualizations : histograms of life expectancy data and mean
    
</div>

## Life expectancy data

This data comes from the World Health Organization.  We can learn more about the meanings of the columns by looking here: https://www.kaggle.com/kumarajarshi/life-expectancy-who

Let's travel back in time to the year 2015 and collect some data!  For the duration of this discussion, we're going to consider the following data to be our *"population"*.

Let's take a look at it.

In [None]:
# load in all the data
life_expectancy = bpd.read_csv("data/Life Expectancy Data.csv")

# choose only data from 2015
recent_data = life_expectancy[life_expectancy.get("Year") == 2015]

recent_data

In [None]:
# compute population mean to compare

pop_mean = recent_data.get('Life expectancy ').mean()
pop_mean

# Quick recap about sampling!

Here we'll take a look at the same life expectancy data and do some sampling exercises.

In [None]:
# Let's visualize our population distribution.

# Defining a function to create bins easily
def get_bins(array, bin_size=1):
    smallestNum = int(array.min())
    
    largestNum = int(array.max())
    upperLimit = largestNum + bin_size + 1
    
    return np.arange(smallestNum, upperLimit, bin_size)

In [None]:
measured = recent_data.get("Life expectancy ")

#generate number of bins
n_bins = get_bins(measured, 1) # <-- Try playing around with the bin size

#lets plot the histogram
recent_data.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

## POPULATION DISTRIBUTION
- life expectancy of all countries in our POPULATION (entire dataset)

In [None]:
# different sample sizes

num_samples = 50

#num_samples = 40

In [None]:
# How do we create a representative sample?
collected = recent_data.sample(n=num_samples, replace=False)

#we need new bin sizes
n_bins = get_bins(collected.get('Life expectancy '),1)


#lets plot the histogram
plt.title("Sample Distribution")
collected.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)
# plt.show()

## SAMPLE DISTRIBUTION
- life expectancy of all countries in our SAMPLE (random selection of 50 countries)

In [None]:
sample_mean = collected.get('Life expectancy ').mean()
sample_mean

# We can show our mean in relation to the sample:

#plot the historgram again
collected.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

#draw the sample mean
plt.title("Sample Mean")
plt.axvline(sample_mean, c='r')
# plt.show()

## SAMPLE MEAN
- **mean** life expectancy of all countries in our SAMPLE (random selection of 40 countries)

In [None]:
# Run this multiple time to see what changes.

resampled = collected.sample(num_samples,replace=True)
resampled_mean = resampled.get('Life expectancy ').mean()
n_bins = get_bins(collected.get('Life expectancy '), 1)

print("The resampled mean is:\t\t", resampled_mean, "\nCompared to the original:\t", sample_mean)

#plot the historgram again
resampled.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

#lets show the sampled_mean and resampled_mean
plt.title("Resampled mean")
plt.axvline(resampled_mean, c='r')
plt.axvline(sample_mean, c='b')
# plt.show()

## RESAMPLED MEAN
- **mean** life expectancy of all countries in our NEW SAMPLE

In [None]:
# bootstrapping loop

sample_means = np.array([])

for i in range(1000):
    bootstrapped = collected.sample(num_samples,replace=True)
    boot_mean = bootstrapped.get('Life expectancy ').mean()
    sample_means = np.append(sample_means, boot_mean)

In [None]:
plt.title("Distribution of Sample Means")
plt.hist(sample_means, bins=get_bins(sample_means, 0.5))
# plt.show()

## DISTRIBUTION OF SAMPLE MEANS
- distribution of **mean** life expectancy from 1000 different samples (bootstrapping!)

In [None]:
plt.hist(sample_means, bins=get_bins(sample_means, 0.5))
plt.scatter(pop_mean, 0, color='red', s=80).set_zorder(2)

## POPULATION MEAN TOO
- comparing population mean to the distribution of sample means

<div style="padding: 15px; border: 1px solid transparent; border-color: transparent; margin-bottom: 20px; border-radius: 4px; color: #3c763d; background-color: #dff0d8; border-color: #d6e9c6;">

### Everything above should hopefully be familiar from last week
- if not [here](https://ucsd.zoom.us/rec/play/FNbiilaGa1BNkSAKGdIAIuBtvTMnua2wyZGWCKZ7SEj1l426mV18AkgUHnFTMypCepd5t5mm8cD85Ukp.dSsb2q67ubzlcB55?startTime=1605664878000&_x_zm_rtaid=0Oagfxv1TH6w3vwvZqPTbQ.1606170859635.ea29f4b90ffe94632dd6376dbf87eb23&_x_zm_rhtaid=235) is a link to last week's discussion!

</div>

# So now what? 

- What conclusions can we make about the **population mean** based on our distribution of sample means?

# Confidence Intervals

- We would like to come up with a range of values that contain X% of all bootstrapped sample means. 
- This interval corresponds to an X% confidence interval

### How to do this?
- We need our array of sample means and we need to compute a few percentiles based on what X% confidence interval we'd like to return

# Question 1 
Suppose we'd like to construct 90% and 82% Confidence Intervals over some statistic.

What are the upper and lower percentiles we need in each case?

In [None]:
# compute the lower percentile given a confidence interval
def compute_lower_percentile(perc_conf):
    
    lower_perc = 
    
    return lower_perc

# compute the upper percentile given a confidence interval
def compute_upper_percentile(perc_conf):
    
    upper_perc = 
    
    return upper_perc

In [None]:
lower_perc_90 = 
print(f"Lower percentile for 90% C.I. : {lower_perc_90}")

upper_perc_90 = 
print(f"Upper percentile for 90% C.I. : {upper_perc_90}")

In [None]:
lower_perc_82 = 
print(f"Lower percentile for 82% C.I. : {lower_perc_82}")

upper_perc_82 = 
print(f"Upper percentile for 82% C.I. : {upper_perc_82}")

# Question 2 

Which of the two confidence intervals (90% or 82%) is larger? Why?

In [None]:
# choose 90 or 82
 
larger_ci = 



# Question 3

Compute the upper and lower bounds of a 95% confidence interval for our ```sample_means``` data from above.

In [None]:
def compute_ci(confidence_level, sample_means):

    # What is the mean we're estimating?
    mean = 

    # What are the percentiles?
    # Use the functions we made above
    lower_perc = 
    upper_perc = 

    # And then our lower and upper bounds?
    lower_bound = 
    upper_bound = 

    # Printing it out so we can easily see our results.
    print("""
    Mean:\t{}

    Lower Percentile:\t{}
    Upper Percentile:\t{}

    Lower Bound:\t{}
    Upper Bound:\t{}

    Confidence Level:\t{}%
    """.format(mean, lower_perc, upper_perc, lower_bound, upper_bound, confidence_level))
    
    return lower_bound, upper_bound

In [None]:
confidence_level = 

# compute the ci
lower_bound, upper_bound = 

### Lets visualize the confidence interval on the histogram from earlier

In [None]:
def plot_ci(ci, lower_bound, upper_bound, sample_means, pop_mean):
    plt.title(f"{ci}% confidence interval")
    plt.hist(sample_means, bins=get_bins(sample_means, 0.5))
    plt.scatter(pop_mean, 0, color='red', s=80).set_zorder(3)
    plt.plot([lower_bound, upper_bound], [0,0], color='lime', linewidth=4, zorder=2)

In [None]:
plot_ci(confidence_level, lower_bound, upper_bound, sample_means, pop_mean)

# Question 4

Interpret what the confidence interval means in the context of this problem. 

Answer: 

In [None]:
lower_bound

In [None]:
upper_bound

# Question 5

Compute 100%, 80%, and 50% confidence intervals using the same ```sample_means``` and visualize the results of each.

In [None]:
# compute the bounds
print("100% CI")
lower_100, upper_100 = 

print("80% CI")
lower_80, upper_80 = 

print("50% CI")
lower_50, upper_50 = 

In [None]:
# visualize the results
plot_ci(100, lower_100, upper_100, sample_means, pop_mean)
plot_ci(80, lower_80, upper_80, sample_means, pop_mean)
plot_ci(50, lower_50, upper_50, sample_means, pop_mean)

# Question 6

Do any of the above confidence intervals (100%, 95%, 80%, 50%) NOT contain the true population mean?

In [None]:
pop_mean

In [None]:
# answer True or False
exists_interval = 
exists_interval

# Question 7

Is it possible for the 80% confidence interval to contain the true population mean while the 95% confidence interval does not?

In [None]:
# answer True or False
possible = 
possible