# DSC 10 Discussion Week 8

<img src="data/panda_tree.jpg" width="500">

#### Extra
- You can find additional help on these topics in the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).
- [Here](https://ucsd-ets.github.io/dsc10-2020-fa/published/default/reference/babypandas-reference.pdf) is a pointer to that reference sheet we saw last time.

# Permutation Testing

###  A/B testing through simulation
- Decide whether two random samples come from the same distribution
- Statistic : difference between means
- Null hypothesis : the two groups are sampled from the same distribution
    - *PROBLEM* : we don't know what that distribution is!
    
### Permutation tests
- We can't draw samples from a distribution like we're used to because we don't know what the distribution is!
- Instead : randomly shuffle (permute) group labels during simulation
    - Compute the "difference in means" test statistic between groups of shuffled data

### Causation
- Observation study - rejecting the null hypothesis does not establish causality 
    - Correlation ≠ causation
    - Confounding factors
- Randomized Controlled Trial (RCT)
    - A/B test in a RCT does support causality

In [None]:
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import otter
grader = otter.Notebook()

from notebook.services.config import ConfigManager

cm = ConfigManager()
cm.update(
    "livereveal", {
        'width': 1500,
        'height': 700,
        "scroll": True,
})

## Life expectancy data

This data comes from the World Health Organization.  We can learn more about the meanings of the columns by looking here: https://www.kaggle.com/kumarajarshi/life-expectancy-who

Let's travel back in time to the year 2015 and collect some data!  For the duration of this discussion, we're going to consider the following data to be our *"population"*.

Let's take a look at it.

In [None]:
# load in all the data
life_expectancy = bpd.read_csv("data/Life Expectancy Data.csv")
life_expectancy

In [None]:
# choose only data from 2015
recent_data = life_expectancy[life_expectancy.get("Year") == 2015]
recent_data

## Sampling from the data

From now on, the above data will be considered our population, so we will sample from this population to complete the following experiment.

First let us therefore take a sample of 50 countries from the population.

In [None]:
# grab a sample
recent_sample = recent_data.sample(50,replace=False).get(["Status","Life expectancy "])
recent_sample

## Life expectancy and country status

Question : **Is life expectancy of people born in developing countries significantly shorter than that of peole born in developing countries?**

### Question 1

Set up the null and alterantive hypotheses for this experiment.


**Null:** 

**Alternative:** 

### Question 2 

How many countries in each group?

In [None]:
countries_per_group = 
countries_per_group

### Question 3

What is the average life expectancy in each group?

In [None]:
expectancy_per_group = 
expectancy_per_group

### Question 4

Visualize the distribution of life expectancy for each group in a histogram plot.

In [None]:
# create a new dataframe for each group
developed_expectancy = 
developing_expectancy = 

# check to make sure the counts match those from above
print(f"There are {developed_expectancy.shape[0]} developed countries",end='')
print(f" and {developing_expectancy.shape[0]} developing countries")

In [None]:
# plot on a histogram
fig, ax = plt.subplots()

# first plot developed 


# then plot developing

plt.legend(["Developed", "Developing"])
plt.xlabel("Life expectancy")

### Question 5

What test statistic should we use to compare these two sample distributions? 
Decide on which test statistic is best then compute it for the observered sample.

Hint (use ```expectancy_per_group``` from above)

In [None]:
expectancy_per_group

In [None]:
# observed test statistic
test_statistic_name = 

means = 
observed_test_statistic = 
observed_test_statistic

### Question 6

Randomly permute the group labels and create a new dataframe based on ```recent_sample``` with an additional column.

In [None]:
original_and_shuffled = 
original_and_shuffled

### Question 7

Compute the mean life expectancy for each group in the newly permuted data.

What do you notice?

In [None]:
diff_between_means = 
diff_between_means

### Question 8

Is it not clear? Let's try taking the mean difference

In [None]:
obs_mean_difference = 
obs_mean_difference

### Question 9
Wow! That's a huge difference? Could it be chance? Let's repeat this 5000 times and store the shuffled difference in an array

In [None]:
simulated_stats = np.array([]) 
num_observations = 5000 

for _ in range(num_observations):
    original_and_shuffled = recent_sample.assign(
        shuffled_life_expectancy = np.random.permutation(recent_sample.get("Life expectancy ")))
    diff_between_means = original_and_shuffled.groupby("Status").mean()
    mean_difference = diff_between_means.loc['Developing'] - diff_between_means.loc['Developed']
    shuffled_difference = mean_difference.get('shuffled_life_expectancy')
    simulated_stats = np.append(simulated_stats, shuffled_difference)
    
simulated_stats
    

### Question 10

Doesn't look like we can blame these differences on chance (assuming our null hypothesis is true). Let's see the likelihood of our observed difference given this result



In [None]:
p_val = 
p_val

### Question 11

Looks like our p value is pretty high. Should we reject the nullhypothesis with a 10% significance threshold?

In [None]:
reject_null_hypothesis = 
reject_null_hypothesis

# Quick recap about sampling!

Here we'll take a look at the same life expectancy data and do some sampling exercises.

In [None]:
# Let's visualize our population distribution.

# Defining a function to create bins easily
def get_bins(array, bin_size=1):
    smallestNum = int(array.min())
    
    largestNum = int(array.max())
    upperLimit = largestNum + bin_size + 1
    
    return np.arange(smallestNum, upperLimit, bin_size)

In [None]:
measured = recent_data.get("Life expectancy ")

#generate number of bins
n_bins = get_bins(measured, 1) # <-- Try playing around with the bin size

#lets plot the histogram
recent_data.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

In [None]:
# This is our ... ?

```
POPULATION DISTRIBUTION
```

So, what is our aim?  

We want to estimate the average life expectancy for the globe!  Let's say we don't have access to the entire population.  

Flying around the world is pretty expensive, so we can only collect data from 15 countries.

We can sample and use bootstrapping to find this.


In [None]:
# How do we create a representative sample?
collected = recent_data.sample(n=15, replace=False)


In [None]:
collected

In [None]:
#we need new bin sizes
n_bins = get_bins(collected.get('Life expectancy '),1)


#lets plot the histogram
collected.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

In [None]:
# This is our ...?

```
SAMPLE DISTRIBUTION
```

We're interested in estimating the mean life expectancy.  So, let's find the mean of our sample.

In [None]:
sample_mean = collected.get('Life expectancy ').mean()
sample_mean

In [None]:
# We can show our mean in relation to the sample:

#plot the historgram again
collected.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

#draw the sample mean
plt.axvline(sample_mean, c='r')

In [None]:
# This is our ... ?

```
SAMPLE MEAN
```

What happens when we resample?

In [None]:
# Run this multiple time to see what changes.

resampled = collected.sample(15,replace=True)
resampled_mean = resampled.get('Life expectancy ').mean()
n_bins = get_bins(collected.get('Life expectancy '), 1)

print("The resampled mean is:\t\t", resampled_mean, "\nCompared to the original:\t", sample_mean)

#plot the historgram again
resampled.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

#lets show the sampled_mean and resampled_mean
plt.axvline(resampled_mean, c='r')
plt.axvline(sample_mean, c='b')

In [None]:
# This is our ... ?

```
RESAMPLED MEAN
```

Now, let's run the bootstrap so we can create a distribution!

In [None]:
sample_means = np.array([])

for i in range(5000):
    bootstrapped = collected.sample(15,replace=True)
    boot_mean = bootstrapped.get('Life expectancy ').mean()
    sample_means = np.append(sample_means, boot_mean)
    

plt.hist(sample_means, bins=get_bins(sample_means, 0.5))

In [None]:
# This is our ... ?

```
DISTRIBUTION OF SAMPLE MEANS
```