# DSC 10 Discussion Week 4

<img src="data/panda_baby.jpg" width="500">

# Bootstrapping - sampling within a sample
- Problem : statistics about the data population are often unavailable, costly to acquire, unknown, etc.
- Solution : utilize random sampling (and re-sampling) of available data to estimate population statistics
    - The result of bootstrapping will be a distribution over sample statistics!
    - Hopefully we'll see that these *sample statistics* $\approx$ *population statistics*
    
### Bootstrapping basic procedure
- Sample from the population
- Re-sample from that same sample (make sure to have replace=True!)
- Repeat
- **Note** - after re-sampling, we will likely see duplicate data entries within a single sample, but that's okay! 
    - If we didn't have duplicates, then we would have the same exact data in every single sample (this would be bad!)

# Bootstrapped Confidence Intervals
- Goal : return a range of values that we are confident contain the true population statistic 
    - Bootstrapping gives us a distribution of sample statistics
    - The true population statistic often lies within the bulk of that distribution 
- $X$% confidence interval 
    - Interpretation
        - **YES**: $X$% of all bootstrapped sample statistics fall within that interval
        - **YES**: ~$X$% of the time, the interval will capture the correct population statistic
        - **YES**: I'm $X$% sure that the true population statistic is in the interval
        - **NO**: the true population statistic has an $X$% change of being in the interval
    - Computation
        - Use $\frac{100-X}{2}$ and $100-\frac{100-X}{2}$ for lower and upper percentiles
        
### CIs for testing
- Given P-value $p$ and null hypothesis "population statistic = $a$":
    - Construct $(100-p)$ CI for populatiton statistic
    - Reject null hypothesis if $a$ is not in the interval

# Describing a Distribution : Mean and Spread
- Center of a distribution 
    - *Mean* : balance point
    - *Median* : half-way point (robust to outliers) 
- Spread of distribution 
    - *Range* : biggest - smallest
    - *Standard deviation* : variability around the mean
- Chebyshev's Inequality
    - Proportion of values in the range "average $\pm\ z$ SDs" is ≥ $1-\frac{1}{z^2}$
- Looking forward
    - We'll look at other types of distributions and ones that can be parameterized 

#### Extra
- You can find additional help on these topics in the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).
- [Here](https://ucsd-ets.github.io/dsc10-2020-fa/published/default/reference/babypandas-reference.pdf) is a pointer to that reference sheet we saw last time.

In [1]:
from scipy import stats
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import otter
grader = otter.Notebook()

from notebook.services.config import ConfigManager

cm = ConfigManager()
cm.update(
    "livereveal", {
        'width': 1500,
        'height': 700,
        "scroll": True,
})

<div style="padding: 15px; border: 1px solid transparent; border-color: transparent; margin-bottom: 20px; border-radius: 4px; color: #3c763d; background-color: #dff0d8; border-color: #d6e9c6;">

# RECALL FROM LAST WEEK
    
Quick outline
- data : life expectancy 
- population : all countries
- sample : smaller random selection of countries
- visualizations : histograms of life expectancy data and mean
    
</div>

## Life expectancy data

This data comes from the World Health Organization.  We can learn more about the meanings of the columns by looking here: https://www.kaggle.com/kumarajarshi/life-expectancy-who

Let's travel back in time to the year 2015 and collect some data!  For the duration of this discussion, we're going to consider the following data to be our *"population"*.

Let's take a look at it.

In [2]:
# load in all the data
life_expectancy = bpd.read_csv("data/life_expectancy.csv")

# choose only data from 2015
recent_data = life_expectancy[life_expectancy.get("Year") == 2015]

recent_data

In [3]:
# compute population mean to compare

pop_mean = recent_data.get('Life expectancy ').mean()
pop_mean

# Quick recap about sampling!

Here we'll take a look at the same life expectancy data and do some sampling exercises.

In [4]:
# Let's visualize our population distribution.

# Defining a function to create bins easily
def get_bins(array, bin_size=1):
    smallestNum = int(array.min())
    
    largestNum = int(array.max())
    upperLimit = largestNum + bin_size + 1
    
    return np.arange(smallestNum, upperLimit, bin_size)

In [5]:
measured = recent_data.get("Life expectancy ")

#generate number of bins
n_bins = get_bins(measured, 1) # <-- Try playing around with the bin size

#lets plot the histogram
recent_data.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

## POPULATION DISTRIBUTION
- life expectancy of all countries in our POPULATION (entire dataset)

In [6]:
# different sample sizes

num_samples = 50

# num_samples = 40

In [7]:
# How do we create a representative sample?
collected = recent_data.sample(n=num_samples, replace=False)

#we need new bin sizes
n_bins = get_bins(collected.get('Life expectancy '),1)


#lets plot the histogram
plt.title("Sample Distribution")
collected.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)
# plt.show()

## SAMPLE DISTRIBUTION
- life expectancy of all countries in our SAMPLE (random selection of 15 countries)

In [8]:
sample_mean = collected.get('Life expectancy ').mean()
sample_mean

# We can show our mean in relation to the sample:

#plot the historgram again
collected.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

#draw the sample mean
plt.title("Sample Mean")
plt.axvline(sample_mean, c='r')
# plt.show()

## SAMPLE MEAN
- **mean** life expectancy of all countries in our SAMPLE (random selection of 40 countries)

In [9]:
# Run this multiple time to see what changes.

resampled = collected.sample(num_samples,replace=True)
resampled_mean = resampled.get('Life expectancy ').mean()
n_bins = get_bins(collected.get('Life expectancy '), 1)

print("The resampled mean is:\t\t", resampled_mean, "\nCompared to the original:\t", sample_mean)

#plot the historgram again
resampled.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

#lets show the sampled_mean and resampled_mean
plt.title("Resampled mean")
plt.axvline(resampled_mean, c='r')
plt.axvline(sample_mean, c='b')
# plt.show()

## RESAMPLED MEAN
- **mean** life expectancy of all countries in our NEW SAMPLE

In [10]:
# bootstrapping loop

sample_means = np.array([])

for i in range(1000):
    bootstrapped = collected.sample(num_samples,replace=True)
    boot_mean = bootstrapped.get('Life expectancy ').mean()
    sample_means = np.append(sample_means, boot_mean)

In [11]:
plt.title("Distribution of Sample Means")
plt.hist(sample_means, bins=get_bins(sample_means, 0.5))
# plt.show()

## DISTRIBUTION OF SAMPLE MEANS
- distribution of **mean** life expectancy from 1000 different samples (bootstrapping!)

In [12]:
plt.hist(sample_means, bins=get_bins(sample_means, 0.5))
plt.scatter(pop_mean, 0, color='red', s=80).set_zorder(2)

## POPULATION MEAN TOO
- comparing population mean to the distribution of sample means

<div style="padding: 15px; border: 1px solid transparent; border-color: transparent; margin-bottom: 20px; border-radius: 4px; color: #3c763d; background-color: #dff0d8; border-color: #d6e9c6;">

### Everything above should hopefully be familiar from last week
- if not [here](https://ucsd.zoom.us/rec/play/FNbiilaGa1BNkSAKGdIAIuBtvTMnua2wyZGWCKZ7SEj1l426mV18AkgUHnFTMypCepd5t5mm8cD85Ukp.dSsb2q67ubzlcB55?startTime=1605664878000&_x_zm_rtaid=0Oagfxv1TH6w3vwvZqPTbQ.1606170859635.ea29f4b90ffe94632dd6376dbf87eb23&_x_zm_rhtaid=235) is a link to last week's discussion!

</div>

# So now what? 

- What conclusions can we make about the **population mean** based on our distribution of sample means?

# Confidence Intervals

- We would like to come up with a range of values that contain X% of all bootstrapped sample means. 
- This interval corresponds to an X% confidence interval

### How to do this?
- We need our array of sample means and we need to compute a few percentiles based on what X% confidence interval we'd like to return

# Question 1 
Suppose we'd like to construct 90% and 82% Confidence Intervals over some statistic.

What are the upper and lower percentiles we need in each case?

In [13]:
# compute the lower percentile given a confidence interval
def compute_lower_percentile(perc_conf):
    
    lower_perc = ...
    
    return lower_perc

# compute the upper percentile given a confidence interval
def compute_upper_percentile(perc_conf):
    
    upper_perc = ...
    
    return upper_perc

In [14]:
lower_perc_90 = ...
print(f"Lower percentile for 90% C.I. : {lower_perc_90}")

upper_perc_90 = ...
print(f"Upper percentile for 90% C.I. : {upper_perc_90}")

In [15]:
lower_perc_82 = ...
print(f"Lower percentile for 82% C.I. : {lower_perc_82}")

upper_perc_82 = ...
print(f"Upper percentile for 82% C.I. : {upper_perc_82}")

# Question 2 

Which of the two confidence intervals (90% or 82%) is larger? Why?

In [16]:
# choose 90 or 82
 
larger_ci = ...

...

# Question 3

Compute the upper and lower bounds of a 95% confidence interval for our ```sample_means``` data from above.

In [17]:
def compute_ci(confidence_level, sample_means):

    # What is the mean we're estimating?
    mean = ...

    # What are the percentiles?
    # Use the functions we made above
    lower_perc = ...
    upper_perc = ...

    # And then our lower and upper bounds?
    lower_bound = ...
    upper_bound = ...

    # Printing it out so we can easily see our results.
    print("""
    Mean:\t{}

    Lower Percentile:\t{}
    Upper Percentile:\t{}

    Lower Bound:\t{}
    Upper Bound:\t{}

    Confidence Level:\t{}%
    """.format(mean, lower_perc, upper_perc, lower_bound, upper_bound, confidence_level))
    
    return lower_bound, upper_bound

In [18]:
confidence_level = ...

# compute the ci
...

### Lets visualize the confidence interval on the histogram from earlier

In [19]:
def plot_ci(ci, lower_bound, upper_bound, sample_means, pop_mean):
    plt.title(f"{ci}% confidence interval")
    plt.hist(sample_means, bins=get_bins(sample_means, 0.5))
    plt.scatter(pop_mean, 0, color='red', s=80).set_zorder(3)
    plt.plot([lower_bound, upper_bound], [0,0], color='lime', linewidth=4, zorder=2)

In [20]:
plot_ci(confidence_level, lower_bound, upper_bound, sample_means, pop_mean)

# Question 4

Interpret what the confidence interval means in the context of this problem. 

In [21]:
...

In [22]:
lower_bound

In [23]:
upper_bound

# Question 5

Compute 100%, 80%, and 50% confidence intervals using the same ```sample_means``` and visualize the results of each.

In [24]:
# compute the bounds
print("100% CI")
...

print("80% CI")
...

print("50% CI")
...

In [25]:
# visualize the results
plot_ci(100, lower_100, upper_100, sample_means, pop_mean)
plot_ci(80, lower_80, upper_80, sample_means, pop_mean)
plot_ci(50, lower_50, upper_50, sample_means, pop_mean)

# Question 6

Do any of the above confidence intervals (100%, 95%, 80%, 50%) NOT contain the true population mean?

In [26]:
pop_mean

In [27]:
# answer True or False
exists_interval = ...
exists_interval

# Question 7

Is it possible for the 80% confidence interval to contain the true population mean while the 95% confidence interval does not?

In [28]:
# answer True or False
possible = ...
possible


# Area Under the Curve

Area under the curve normally follows Chebychev's Bounds:

For all lists, and all numbers  z , the proportion of entries that are in the range "average  $\pm z$  SDs" is at least $1 - \frac{1}{z^{2}} $

In other words, we can say that at least $1-\frac{1}{z^2}$ of data from a sample must fall within $z$ standard deviations from the mean.

How is this useful? We can actually use it to find the what proportion of entries lie within a certain standard deviation which allows us to compute the area under a curve easily.

NOTE : Chebyshev's inequality holds for any shaped distribution!

### Question 0.1

What is the proportion of entries that are in the range average $\pm 1$ SD 

In [29]:
cheby_area_pm_1 = ...
cheby_area_pm_1

### Question 0.2

What is the proportion of entries that are in the range average $\pm 2$ SD 

In [30]:
cheby_area_pm_2 = ...
cheby_area_pm_2

### Question 0.3

What is the proportion of entries that are in the range average $\pm 3$ SD 

In [31]:
cheby_area_pm_3 = ...
cheby_area_pm_3

## Area Under the Curve : Normal Distribution

In the case of a normal distribution the area under the curve does increase much more due to certain properties of the normal distribution. 
Let us explore what the same bounds look like under normal distributions with the help of scipy.stats. 

We will use the stats.norm.cdf function which gives us the cumulative distribution function till a certain point. So if I say stats.norm.cdf(1) it will give me the area between all the points to the left of 1 in a normal curve.

In general, are within $[a.b]$ is ```stats.norm.cdf(b) - stats.norm.cdf(a)```

### Question 0.4

What is the proportion of entries that are in the range average $\pm 1$ SD under the normal curve

In [32]:
normal_area_pm_1 = ...
normal_area_pm_1

### Question 0.5

What is the proportion of entries that are in the range average $\pm 2$ SD under the normal curve

In [33]:
normal_area_pm_2 = ...
normal_area_pm_2

### Question 0.6

What is the proportion of entries that are in the range average $\pm 3$ SD under the normal curve

In [34]:
normal_area_pm_3 = ...
normal_area_pm_3

In [35]:
# comparing AUC results

print(f"For ±1 SD --> Cheby. : {round(cheby_area_pm_1,3)}\t Normal : {round(normal_area_pm_1,3)}")
print(f"For ±2 SD --> Cheby. : {round(cheby_area_pm_2,3)}\t Normal : {round(normal_area_pm_2,3)}")
print(f"For ±3 SD --> Cheby. : {round(cheby_area_pm_3,3)}\t Normal : {round(normal_area_pm_3,3)}")

Although it is completely valid, Chebyshev's inequality provides a much weaker lower bound to the proportion of data that lies within $z$ standard deviations from the mean.

# Central Limit Theorem

The Central Limit Theorem says that the probability distribution of the sum or average of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

This is really useful since it can allow us to work with normal curves in most problems. 

Until now you have used this fact when computing the p value. When we say p value <= 0.05 we actually mean that our statistic is at least $\pm 2$ SDs away from the normal mean which is pretty rare under a normal curve. In any other curve under the Chebychev bounds $\pm 2$ SDs is much narrower (around 75%).

In [36]:
# Let us introduce a random uniformly distributed dataset

data = np.random.uniform(0, 20, 200)
plt.hist(data)

As you can see above we have a dataset that is clearly not normal. Let's try bootstrapping this and computing the mean

In [37]:
num_simulations = 500
sample_means = np.array([])
for _ in range(num_simulations):
    sample = np.random.choice(data, 200) # Note: Using .sample is better. I am working with a numpy array which is why I use this
    mean_of_sample = np.mean(sample)
    sample_means = np.append(sample_means, mean_of_sample)

plt.hist(sample_means)

This may be surprising, but our statistics are normally distributed!

This is extremely useful since we can compute the p-value even with non-normal data as the distribution of the statistics are normal (as a result of the CLT).

### Question 2.1

What is the p value if we have an observed statistic of 9?

In [38]:
p_value_with_obs_9 = ...
p_value_with_obs_9

### Question 2.2

What is the p value if we have an observed statistic of 10?

In [39]:
p_value_with_obs_10 = ...
p_value_with_obs_10

# Recall from last time...

### Plot the distribution of sample means
- Grab a sample
- Use bootstrapping to sample from the sample
- Compute the statistic for each bootstrapped sample
- Plot them all

## Population Mean

In [40]:
# compute population mean to compare

pop_data = recent_data.get('Life expectancy ')
pop_mean = pop_data.mean()
pop_mean

# Population Distribution

In [41]:
# Visualization help
def get_bins(array, bin_size=1):
    smallestNum = int(array.min())
    
    largestNum = int(array.max())
    upperLimit = largestNum + bin_size + 1
    
    return np.arange(smallestNum, upperLimit, bin_size)

plt.title("Population Distribution")
plt.hist(pop_data, bins=get_bins(pop_data,0.5))

## The population distribution is clearly not a normal distribution

# Distribution of Sample Means

In [42]:
# Get a sample
num_samples = 60
collected = recent_data.sample(n=num_samples, replace=False)

# Bootstrap
sample_means = np.array([])

for i in range(1000):
    bootstrapped = collected.sample(num_samples,replace=True)
    boot_mean = bootstrapped.get('Life expectancy ').mean()
    sample_means = np.append(sample_means, boot_mean)
    
plt.title("Distribution of Sample Means (with Population Mean)")
plt.hist(sample_means, bins=get_bins(sample_means, 0.5))
plt.scatter(pop_mean, 0, color='red', s=80).set_zorder(2)

## However, the distribution of sample means is a normal distribution -- Thanks Central Limit Theorem!

# Bootstrapped Confidence Intervals

In [43]:
# compute the lower percentile given a confidence interval
def compute_lower_percentile(perc_conf):
    
    lower_perc = (100-perc_conf)/2
    
    return lower_perc

# compute the upper percentile given a confidence interval
def compute_upper_percentile(perc_conf):
    
    upper_perc = 100 - (100-perc_conf)/2 
    
    return upper_perc

def compute_ci(confidence_level, sample_means):

    # What is the mean we're estimating?
    mean = np.mean(sample_means) 

    # What are the percentiles?
    # Use the functions we made above
    lower_perc = compute_lower_percentile(confidence_level)
    upper_perc = compute_upper_percentile(confidence_level)

    # And then our lower and upper bounds?
    lower_bound = np.percentile(sample_means, lower_perc) 
    upper_bound = np.percentile(sample_means, upper_perc) 

    # Printing it out so we can easily see our results.
    print("""
    Mean:\t{}

    Lower Percentile:\t{}
    Upper Percentile:\t{}

    Lower Bound:\t{}
    Upper Bound:\t{}

    Confidence Level:\t{}%
    """.format(mean, lower_perc, upper_perc, lower_bound, upper_bound, confidence_level))
    
    return lower_bound, upper_bound

lower_bound, upper_bound = compute_ci(95, sample_means)

def plot_ci(ci, lower_bound, upper_bound, sample_means, pop_mean):
    plt.title(f"{ci}% confidence interval")
    plt.hist(sample_means, bins=get_bins(sample_means, 0.5), density=True)
    plt.scatter(pop_mean, 0, color='red', s=80).set_zorder(3)
    plt.plot([lower_bound, upper_bound], [0,0], color='lime', linewidth=4, zorder=2)
    
plot_ci(95, lower_bound, upper_bound, sample_means, pop_mean)

# A bit of recap info...

---
- Our **POPULATION DISTRIBUTION** is unknown, and can be any shape.


- A **SAMPLE DISTRIBUTION** should have a shape roughly similar to the population distribution.  
(provided that the sample was large enough and was properly randomized)


- A **SAMPLE MEAN** is just the mean of that sample distribution. This is just a single value.


- We can collect a handful of sample means (or fake it by bootstrapping)


- The **DISTRIBUTION OF SAMPLE MEANS** will resemble a normal distribution as the number of sample means increases.


- The **CENTER/MEAN** of the distribution of sample means should be similar to the true population mean.  
(provided that our original sample was proper)

## So what does this all mean...

---

Since we know that a normal distribution will arise as the number of resamples increases, then do we really need to go through all the effort of running a bootstrap?

Instead, we can rely on what we know about normal distributions!  The two defining features of a normal distribution are its center/mean and it's spread/standard deviation.

Let us compute the **mean** and **standard deviation** of our **DISTRIBUTION OF SAMPLE MEANS** and parameterize a normal curve!


In [44]:
from scipy.stats import norm

# compute the mean
sample_dist_mean = ...
sample_dist_std = ...

# set limits for plot
start = sample_dist_mean-5*sample_dist_std
stop = sample_dist_mean+5*sample_dist_std
x = np.linspace(start, stop, 100)

plt.title("Distribution of Sample Means (and Normal Curve)")

# plot histogram
plt.hist(sample_means, bins=get_bins(sample_means, 0.5), density=True)

# plot normal curve
plt.plot(x, norm.pdf(x, sample_dist_mean, sample_dist_std), c='r')

print(f"Center (mean) : {round(sample_dist_mean,3)}")
print(f"Spread (std) : {round(sample_dist_std,3)}")

### We now know the Mean and Standard Deviation of the normal curve associated with the distribution of sample means

As you can see above, this normal curve is centered at our sample mean (70.609 years) and has a standard deviation of 1.118 years.

However, we often want to standardize this distribution to be centered at 0 and have a standard deviation of 1.

Standardizing distributions make it very easy to compare multiple normal distributions that originally had vastly different centers and spreads. It also makes it really easy to compute different statistics about the distribution.

Let's take a look at how to do that now.

# Standard Normal Curve 

## CENTERING
- Mean = 0

In [45]:
# recall our sample of means
print(f"First 5 sample means : \t\t\t{sample_means[:5]}")
print(f"Center of sample distribution : \t{round(sample_dist_mean,3)}")
print(f"Std of sample distribution : \t\t{round(sample_dist_std,3)}")

In [46]:
# center the data to have mean = 0
centered_sample_means = ...
centered_sample_dist_mean = ...
centered_sample_dist_std = ...

print(f"First 5 centered sample means : \t\t{centered_sample_means[:5]}")
print(f"Center of centered sample distribution : \t{round(centered_sample_dist_mean,3)}")
print(f"Std of centered sample distribution : \t\t{round(centered_sample_dist_std,3)}")

In [47]:
# visualize 
plt.title("Distribution of Sample Means (Centered)")
plt.hist(centered_sample_means, bins=get_bins(centered_sample_means, 0.5), density=True)

## SCALING
- Standard Deviation = 1

In [48]:
# scale the data to have std = 1
centered_and_scaled_means = ...
centered_and_scaled_sample_dist_mean = ...
centered_and_scaled_sample_dist_std = ...

print(f"First 5 centered and scaled sample means : \t\t{centered_and_scaled_means[:5]}")
print(f"Center of centered and scaled sample distribution : \t{round(centered_and_scaled_sample_dist_mean,3)}")
print(f"Std of centered and scaled sample distribution : \t{round(centered_and_scaled_sample_dist_std,3)}")

In [49]:
# visualize 
plt.title("Distribution of Sample Means (Centered and Scaled)")
plt.hist(centered_sample_means, bins=get_bins(centered_and_scaled_means, 0.5), density=True)

# Plot the normal curve

In [50]:
# get the mean and std

centered_and_scaled_sample_dist_mean = ...
centered_and_scaled_sample_dist_std = ...

# set limits for plot
start = centered_and_scaled_sample_dist_mean-5*centered_and_scaled_sample_dist_std
stop = centered_and_scaled_sample_dist_mean+5*centered_and_scaled_sample_dist_std
x = np.linspace(start, stop, 100)

plt.title("Distribution of Sample Means (and Normal Curve)")

# plot histogram
plt.hist(centered_and_scaled_means, bins=get_bins(centered_and_scaled_means, 0.5), density=True)

# plot normal curve
plt.plot(x, norm.pdf(x, centered_and_scaled_sample_dist_mean, centered_and_scaled_sample_dist_std), c='r')

print(f"Center (mean) : {round(centered_and_scaled_sample_dist_mean,3)}")
print(f"Spread (std) : {round(centered_and_scaled_sample_dist_std,3)}")

Now that we are looking at a normal distribution, let's talk about standard units and area.

# Standard Units and Area
- Define $z(x) = \frac{x-\text{mean}}{\text{std}}$
- $z(x)$ maps $x$ to standard units 
    - If a distribution is roughly normal, then the area between $a$ and $b$ is approx. equal to the area between $z(a)$ and $z(b)$

## What proportion of countries have a life expectancy between 72 and 74 years?

### Using Standard Units

In [51]:
# define z(x)
def z(x):
    ...

In [52]:
# define age bounds
lower_age = 72
upper_age = 74

# comute standard units
lower_standard = ...
upper_standard = ...

print(f"Mean life expectancy : {round(sample_dist_mean,2)}")

print(f"LOWER : {lower_age} years --> {round(lower_standard,2)} standard units --> {round(lower_standard,2)} stdev's above the mean")
print(f"UPPER : {upper_age} years --> {round(upper_standard,2)} standard units --> {round(upper_standard,2)} stdev's above the mean")

In [53]:
# compute the area under the curve between
approx_prop_standard = ...
approx_prop_standard

In [54]:
# plot area under curve

plt.title("Area Under Curve (Standard)")

start = centered_and_scaled_sample_dist_mean-5*centered_and_scaled_sample_dist_std
stop = centered_and_scaled_sample_dist_mean+5*centered_and_scaled_sample_dist_std
x = np.linspace(start, stop, 100)
y = norm.pdf(x, centered_and_scaled_sample_dist_mean, centered_and_scaled_sample_dist_std)

# plot normal curve
plt.plot(x, y, c='r')

ix = (x>=lower_standard) & (x<=upper_standard)
plt.fill_between(x[ix],y[ix],alpha=0.5)

plt.axvline(lower_standard,color='C1')
plt.axvline(upper_standard,color='C1')

### Using Sample Distribution

In [55]:
# compute proportion using distribution
approx_prop_dist = ...
approx_prop_dist

In [56]:
# plot area under curve

plt.title("Area Under Curve (Distribution)")

# plot histogram
plt.hist(sample_means, bins=get_bins(sample_means, 0.5), density=True)

ix = (x>=lower_age) & (x<=upper_age)
plt.fill_between(x[ix],y[ix],alpha=0.5)

plt.axvline(lower_age,color='C1')
plt.axvline(upper_age,color='C1')