In [None]:
#: the usual imports
import babypandas as bpd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

plt.style.use('fivethirtyeight')

from notebook.services.config import ConfigManager

cm = ConfigManager()
cm.update(
   "livereveal", {
       'width': 1200,
       'height': 700,
       "scroll": True,
})

## Announcements

* CAPE evaluations
* Final exam released tomorrow 3pm PST, due Sunday 11:59pm PDT
    - Note the time change

Which topic do you **most** want to cover?

A. hypothesis testing  
B. permutation testing  
C. central limit theorem and sample means  
D. bootstrapping  
E. histograms  
F. percentiles  



Which topic do you **least** want to cover?

A. hypothesis testing  
B. permutation testing  
C. central limit theorem and sample means  
D. bootstrapping  
E. histograms  
F. percentiles  



## Hypothesis Testing and Permutation Testing

In [None]:
restaurants = bpd.read_csv('data/restaurants.csv')
restaurants.columns

In [None]:
restaurants = restaurants.get(['business_name', 'inspection_date', 'inspection_score', 'risk_category', 'Neighborhoods', 'Zip Codes'])
restaurants 

In [None]:
at_risk = restaurants[(restaurants.get('inspection_score')>-1)&
                      ((restaurants.get('risk_category')=="High Risk")|(restaurants.get('risk_category')=="Low Risk")|
                       (restaurants.get('risk_category')=="Moderate Risk"))]
at_risk

In [None]:
(
    at_risk[at_risk.get('risk_category') == 'High Risk']
    .get('inspection_score')
    .plot(kind='hist', label='High Risk', color='green', alpha = 0.6, bins = 25, density = True)
)
(
    at_risk[at_risk.get('risk_category') == 'Low Risk']
    .get('inspection_score').plot(kind='hist', label='Low Risk', color='red', alpha = 0.5, bins = 25, density = True)
)
plt.xlabel('Inspection Score')

plt.legend(['High Risk', 'Low Risk'])

We want to compare high risk restaurants to low risk restaurants and see if their inspection scores are different. What technique should we use?

A. A/B testing  
B. Standard hypothesis testing  
C. Bootstrapping  
D. Confidence intervals

Can you state the null and alternative hypotheses?

Null: The inspection scores of low-risk and high-risk restaurants are drawn from the same distribution.

Alternative: The inspection scores of low-risk and high-risk restaurants are NOT drawn from the same distribution.

It looks like the high risk and low risk groups have pretty different histograms.
What test statistic(s) can we use to quantify the difference between the two groups displayed in a given histogram?

A. total variation distance  
B. difference in the means  
C. either of the above

In [None]:
at_risk

In [None]:
high_low = at_risk[(at_risk.get('risk_category')=="High Risk")|(at_risk.get('risk_category')=="Low Risk")]
high_low = high_low.get(['inspection_score', 'risk_category'])
high_low

In [None]:
high_low_shuffled = high_low.assign(shuffled_label=np.random.permutation(high_low.get('risk_category')))
high_low_shuffled

Does the next cell calculated the observed value of the test statistic or a simulated value of the test statistic?

A. observed  
B. simulated

How would you calculate the other?

In [None]:
grouped = high_low_shuffled.groupby('risk_category').mean().get('inspection_score')
grouped.loc["Low Risk"] - grouped.loc["High Risk"]

In [None]:
grouped = high_low_shuffled.groupby('shuffled_label').mean().get('inspection_score')
grouped.loc["Low Risk"] - grouped.loc["High Risk"]

In [None]:
def calculate_test_statistic():
    high_low_shuffled = high_low.assign(shuffled_label=np.random.permutation(high_low.get('risk_category')))
    grouped = high_low_shuffled.groupby('shuffled_label').mean().get('inspection_score')
    return grouped.loc["Low Risk"] - grouped.loc["High Risk"]

In [None]:
calculate_test_statistic()

In [None]:
simulated_stats = np.array([])

for i in np.arange(1000):
    sim_stat = calculate_test_statistic()
    simulated_stats = np.append(simulated_stats, sim_stat)

In [None]:
np.count_nonzero(simulated_stats>0)

In [None]:
grouped = high_low_shuffled.groupby('risk_category').mean().get('inspection_score')
observed = grouped.loc["Low Risk"] - grouped.loc["High Risk"]

bpd.DataFrame().assign(DifferenceInMeans=simulated_stats).plot(kind='hist', density=True)
plt.axvline(observed, color='red')

What's the p-value?

In [None]:
np.count_nonzero(simulated_stats>=observed)/100

You work as a family physician and you want to test the following hypotheses:

Null Hypothesis: Family physicians see an equal number of children and adults.

Alternative Hypothesis: Family physicians see an unequal number of children and adults.

You collect data and you find that in 6354 patients, 3115 were children and 3239 were adults.

Which test statistic(s) could be used for this hypothesis test? Which values of the test statistic point towards the alternative?

A. proportion of children seen   
B. number of children seen  
C. number of children minus number of adults seen  
D. absolute value of number of children minus number of adults seen  

What if we used a different alternative hypothesis? Which test statistics would work then? 

Null Hypothesis: Family physicians see an equal number of children and adults.

Alternative Hypothesis: Family physicians see more adults than children.

How do you generate one value of the test statistic? Let's use number of children seen, in 6354 patients.

In [None]:
np.random.multinomial(6354, [0.5, 0.5])[0]

Can you do it without using `np.random.multinomial`?

In [None]:
patients = bpd.DataFrame().assign(Patient=['Adult', 'Child'])

In [None]:
many_patients = patients.sample(6354, replace=True)
many_patients

In [None]:
np.count_nonzero(many_patients.get('Patient')=='Child')

Is this an example of bootstrapping?  
A. Yes, because we are sampling with replacment.  
B. No, this is not bootstrapping.

In [None]:
test_stats = np.array([])

for i in np.arange(10000):
    stat = np.random.multinomial(6354, [0.5, 0.5])[0]
    test_stats = np.append(test_stats, stat)

In [None]:
bpd.DataFrame().assign(NumChildren=test_stats).plot(kind='hist', density=True)


Observed data: You collect data and you find that in 6354 patients, 3115 were children and 3239 were adults.

Null Hypothesis: Family physicians see an equal number of children and adults.

Alternative Hypothesis: Family physicians see more adults than children.

A. reject the null  
B. fail to reject the null  
C. not sure  

How do we calculate a p-value?

A. `np.count_nonzero(test_stats<3115)/10000`  
B. `np.count_nonzero(test_stats<=3115)/10000`  
C. `np.count_nonzero(test_stats>3115)/10000`  
D. `np.count_nonzero(test_stats>=3115)/10000`  

## The Central Limit Theorem

> The distribution of sums (and averages) of large random samples (w/ replacement) are roughly normal, regardless of the distribution of the population from which the sample was drawn

In [None]:
bakeries = restaurants[(restaurants.get('business_name').str.contains('Bake'))&(restaurants.get('inspection_score')>-1)]
bakeries

In [None]:
bakeries.sample(200, replace=True)

In [None]:
bakeries.sample(200, replace=True).get('inspection_score').mean()

In [None]:
sample_means = np.array([])

for i in np.arange(10000):
    sample_mean = bakeries.sample(200, replace=True).get('inspection_score').mean()
    sample_means = np.append(sample_means, sample_mean)

## Distribution of the Sample Mean (aka, Sampling Distribution of Mean)

In [None]:
bpd.DataFrame().assign(SampleMean=sample_means).plot(kind='hist', density=True)

Is this a probability histogram or an empirical histogram?  
A. probability histogram  
B. empirical histogram  

In [None]:
np.mean(sample_means), np.std(sample_means)

Based on this distribution, which of the following represents the standard deviation of inspection scores for all bakeries?

A. `0.6`  
B. `0.6/(200)**0.5`  
C. `0.6*(200)**0.5`

In [None]:
np.std(bakeries.get('inspection_score'))

### Sample: A random 200 bakeries

In [None]:
one_sample = bakeries.sample(200, replace=True)
one_sample.plot(kind='hist', y='inspection_score', density=True)

In [None]:
np.mean(one_sample.get('inspection_score').values), np.std(one_sample.get('inspection_score'))

### Population: All bakeries in San Francisco with an inspection score

In [None]:
bakeries.plot(kind='hist', y='inspection_score')

In [None]:
np.mean(bakeries.get('inspection_score').values), np.std(bakeries.get('inspection_score'))

### According to the Central Limit Theorem, the SD of the distribution of the sample mean

In [None]:
np.std(bakeries.get('inspection_score'))/np.sqrt(200) #if you don't have access to population SD, use sample SD

In [None]:
np.std(sample_means)

In [None]:
one_sample

Based on my one sample of 200 bakeries, how can we estimate the median inspection score of all bakeries in San Francisco with an inspection score? What technique should we use?

A. A/B testing  
B. Standard hypothesis testing  
C. Bootstrapping  
D. Confidence intervals

Can I use a normal confidence interval? Take the mean and stepping out 2 SDs in either direction to get 95% CI.  
A.  Yes  
B.  No  

In [None]:
one_sample

In [None]:
np.median(one_sample.get('inspection_score'))

In [None]:
resample_median = np.median(bakeries.sample(bakeries.shape[0], replace=True).get('inspection_score'))
resample_median

In [None]:
boot_medians = np.array([])

for i in np.arange(5000):
    resample_median = np.median(bakeries.sample(bakeries.shape[0], replace=True).get('inspection_score'))
    boot_medians = np.append(boot_medians,resample_median)

In [None]:
bpd.DataFrame().assign(BootstrappedMedians=boot_medians).plot(kind='hist', density=True)

In [None]:
np.percentile(boot_medians, 2.5)

In [None]:
np.percentile(boot_medians, 97.5)

Which of the following interpretations of this confidence interval is valid?  

1. 95% of SF bakeries have an inspection score between 84 and 86.  
2. 95% of the resamples have a median inspection score between 84 and 86.  
3. There is a 95% chance that our sample has a median inspection score between 84 and 86.  
4. There is a 95% chance that the median inspecition score of all SF bakeries is between 84 and 86.  
5.  If we had taken 100 samples from the same population, about 95 of these samples would have a median inspection score between 84 and 86.  
6.  If we had taken 100 samples from the same population, about 95 of the confidence intervals created would contain the median inspection score of all SF bakeries.  

## Histograms

In [None]:
happy_donuts = restaurants[(restaurants.get('business_name').str.contains('Happy Donuts'))&(restaurants.get('inspection_score')>-1)]
happy_donuts

In [None]:
happy_donuts.plot(kind='hist', y='inspection_score', density=True, bins=np.arange(70, 100, 4))                                           

How many Happy Donut restaurants had an inspection score of at least 90?

A. 3  
B. 5  
C. 8  
D. 9  