# DSC 10 Discussion Week 3

<img src="data/panda_tree.jpg" width="500">

#### Extra
- You can find additional help on these topics in the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).
- [Here](https://eldridgejm.github.io/dsc10-2021-su/published/default/reference/reference.pdf) is a pointer to that reference sheet we saw last time.

## $\underline{Models\ and\ Statistics}$

### Model
- a set of assumptions about data
- assessing the quality of models $\rightarrow$ statistical inference!

### Terminology
- **Parameter** : a number associated with the *population* $\rightarrow$ rarely known exactly
- **Statistic** : a number calculated from the *sample* $\rightarrow$ estimate of a parameter

### Bias-Variance trade-off
- **Bias** : systematic error in one direction (too high or too low) $\rightarrow$ good estimates have *LOW bias*
- **Variance** : degree to which the value of an estimate varies $\rightarrow$ good estimates have *LOW variance*

### Simulation
- **Single experiment** : ```np.random.multinomial(sample_size, pop_distribution)```
- **A bunch of experiments** : iteration!
- **Visualize** : plot! $\rightarrow$ often *histogram* to show distribution

## $\underline{Hypothesis\ Testing}$

### Two Viewpoints
- **Null Hypothesis** : default view $\rightarrow$ must be simulatable
- **Alternate Hypothesis** : opposite of Null Hypothesis 

### Computing statistics under Null Hypothesis
- Choose a relevant *test statistic*
    - counts, ratios, differences, absolute differences, etc. depending on problem
    - **Total Variation Difference** : difference between two distributions  
    - be careful with use of ```abs()```!
- Track experiment outcomes and compute the **empirical distribution of the statistic under the null hypothesis**

### Drawing conclusions
- Compare the following : 
    - **observed test statistic** (red dot/line from class) 
    - **empirical distribution under the null hypothesis** (histograms from experiments)
- Determine if observed value is consistent
    - by visualization or some other conventional quantitative measure
    - **p-value** : probability that a result *at least* as extreme as the observation holds under the null hypothesis
        - common cutoff is 5% for statistical significance

#### Extra
- You can find additional help on these topics in the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html).
- [Here](https://ucsd-ets.github.io/dsc10-2020-fa/published/default/reference/babypandas-reference.pdf) is a pointer to that reference sheet we saw last time.

In [1]:
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt

import otter
grader = otter.Notebook()

%matplotlib inline

# Example 1: Fighting Professors

Two professors are fighting about who is a better teacher. To settle the matter, they decide to give each of their classes the same exam. Whoever's class performs better will be considered the best teacher.

## The Data

In [2]:
scores = bpd.read_csv('data/scores.csv')
scores

## Exploration

<!-- BEGIN QUESTION -->

Which professor (A or B) appears to have "won"?

<!--
BEGIN QUESTION
name: q10
manual: true
-->

In [3]:
won_prof = ...

In [None]:
grader.check("q10")

<!-- END QUESTION -->



## Question 1

The winning professor claims that they are significantly better than the other professor -- and it isn't just due to random chance. What technique can we use to evaluate their claim?

**Answer**:

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q11
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Question 2

What are the null and alternative hypotheses?

- **Null**:
- **Alternative**:

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q12
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Question 3

What test statistic can we use? Remember: it is usually better for *large* values of the test statistic to point towards the alternative hypothesis.

**Answer**:

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q13
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Question 4

What was the *observed value* of your test statistic?

In [5]:
obs = ...
obs

In [None]:
grader.check("q14")

## Question 5

Implement your chosen technique to test whether the null hypothesis should be rejected.

In [7]:
num_simulations = 1000
simulated_stats = ...
simulated_stats

## Question 6

What is the probability that we see our observed value of the test statistic if the null hypothesis is true?

In [8]:
p_val = ...
p_val

In [None]:
grader.check("q16")

## Question 7

The "winning" professor claims that the results show that they are the better teacher. Is this correct?

In [10]:
claim_true_or_false = ...
claim_true_or_false

In [None]:
grader.check("q17")

# Example 2: Fun with Test Statistics

## Question 8

You want to test whether a coin is fair. Your hypotheses are:

- **Null**: the coin is fair
- **Alternative**: the coin is not fair

You'll flip the coin 100 times. What test statistic should you use to assess your claim?

In [12]:
# fill out the following code to set up this experiment

num_flips = ...

# model the probability of our coin
model = ...

# flip our coin ... times
flip_outcomes = ...

# flip_outcomes = [num_heads, num_tails]
num_heads = flip_outcomes[0]

# What is our test statistic?
def test_statistic(num_heads):
    ...

# compute test statistic
print(f"Test statistic result : {test_statistic(num_heads)}")

## Question 9

In your experiment, you saw 61 heads. What is the observed value of your test statistic?

In [13]:
num_heads_experiment = 61

observed_test_statistic = ...

print(f"Test statistic result : {observed_test_statistic}")

## Question 10

You want to test whether an *n*-sided die is fair. Your hypotheses are:

- **Null**: the die is fair
- **Alternative**: the die is not fair

You'll roll the die 100 times. What test statistic should you use to assess your claim?

In [14]:
# fill out the following code to set up this experiment


# specify number of sides
N = 20
num_rolls = ...

# model the probability of our die
model_die = ...

# roll our die ... times
roll_outcomes = ...

# roll_outcomes = [count_num_side_1 ,..., ..., count_num_side_N]
# roll_outcomes_prob = [perc_num_side_1 ,..., ..., perc_num_side_N]

roll_outcomes_prob = ...

# What is our test statistic?
def test_statistic_die(roll_outcomes_prob, model_die):
    ...

# compute test statistic
print(f"Test statistic result : {test_statistic_die(roll_outcomes_prob, model_die)}")

## Question 11

You rolled a 4-sided side 100 times and got "one" 20 times, "two" 30 times, "three" 40 times, and "four" 10 times. What is the observed value of your test statistic?

In [15]:
# specify number of sides
N = 4
num_rolls = ...

# Given roll outcomes
roll_outcomes = np.array([20, 30, 40, 10]) 
roll_outcomes_prob = ...

# model the probability of our die
model_die = ...

# compute the test statistic
test_statistic = ...

# display results
print(f"Test statistic result : {test_statistic_die(roll_outcomes_prob, model_die)}")

## Question 12

You rolled a 2-sided die 100 times and got "one" 61 times and "two" 39 times. What is the observed value of your test statistic?

In [16]:
# specify number of sides
N = 2
num_rolls = ...

# Given roll outcomes
roll_outcomes = np.array([61, 39]) 
roll_outcomes_prob = ...

# model the probability of our die
model_die = ...

# compute the test statistic
test_statistic = ...

# display results
print(f"Test statistic result : {test_statistic_die(roll_outcomes_prob, model_die)}")

# Permutation Testing

###  A/B testing through simulation
- Decide whether two random samples come from the same distribution
- Statistic : difference between means
- Null hypothesis : the two groups are sampled from the same distribution
    - *PROBLEM* : we don't know what that distribution is!
    
### Permutation tests
- We can't draw samples from a distribution like we're used to because we don't know what the distribution is!
- Instead : randomly shuffle (permute) group labels during simulation
    - Compute the "difference in means" test statistic between groups of shuffled data

### Causation
- Observation study - rejecting the null hypothesis does not establish causality 
    - Correlation ≠ causation
    - Confounding factors
- Randomized Controlled Trial (RCT)
    - A/B test in a RCT does support causality

## Life expectancy data

This data comes from the World Health Organization.  We can learn more about the meanings of the columns by looking here: https://www.kaggle.com/kumarajarshi/life-expectancy-who

Let's travel back in time to the year 2015 and collect some data!  For the duration of this discussion, we're going to consider the following data to be our *"population"*.

Let's take a look at it.

In [17]:
# load in all the data
life_expectancy = bpd.read_csv("data/life_expectancy.csv")
life_expectancy

In [18]:
# choose only data from 2015
recent_data = life_expectancy[life_expectancy.get("Year") == 2015]
recent_data

## Sampling from the data

From now on, the above data will be considered our population, so we will sample from this population to complete the following experiment.

First let us therefore take a sample of 50 countries from the population.

In [19]:
# grab a sample
recent_sample = recent_data.sample(50,replace=False).get(["Status","Life expectancy "])
recent_sample

## Life expectancy and country status

Question : **Is life expectancy of people born in developing countries significantly shorter than that of peole born in developing countries?**

### Question 1

Set up the null and alterantive hypotheses for this experiment.

In [20]:
# NULL Hypothesis

# Alternative Hypothesis

### Question 2 

How many countries in each group?

In [21]:
countries_per_group = ...
countries_per_group

### Question 3

What is the average life expectancy in each group?

In [22]:
expectancy_per_group = ...
expectancy_per_group

### Question 4

Visualize the distribution of life expectancy for each group in a histogram plot.

In [23]:
# create a new dataframe for each group
developed_expectancy = ...
developing_expectancy = ...

# check to make sure the counts match those from above
print(f"There are {developed_expectancy.shape[0]} developed countries",end='')
print(f" and {developing_expectancy.shape[0]} developing countries")

In [24]:
# plot on a histogram
fig, ax = plt.subplots()

# first plot developed 
...

# then plot developing
...

plt.legend(["Developed", "Developing"])
plt.xlabel("Life expectancy")

### Question 5

What test statistic should we use to compare these two sample distributions? 
Decide on which test statistic is best then compute it for the observered sample.

Hint (use ```expectancy_per_group``` from above)

In [25]:
expectancy_per_group

In [26]:
# observed test statistic
test_statistic_name = ...

observed_test_statistic = ...
observed_test_statistic

### Question 6

Randomly permute the group labels and create a new dataframe based on ```recent_sample``` with an additional column.

In [27]:
original_and_shuffled = recent_sample.assign(
    shuffled_life_expectancy = ...
)
original_and_shuffled

### Question 7

Compute the mean life expectancy for each group in the newly permuted data.

What do you notice?

In [28]:
diff_between_means = ...
diff_between_means

### Question 8

Is it not clear? Let's try taking the mean difference

In [29]:
obs_mean_difference = ...
obs_mean_difference

### Question 9
Wow! That's a huge difference? Could it be chance? Let's repeat this 5000 times and store the shuffled difference in an array

In [30]:
simulated_stats = np.array([]) # BEGIN SOLUTION
num_observations = 5000 # BEGIN SOLUTION

...
simulated_stats
    

### Question 10

Doesn't look like we can blame these differences on chance (assuming our null hypothesis is true). Let's see the likelihood of our observed difference given this result



In [31]:
p_val = ...
p_val

### Question 11

Looks like our p value is pretty high. Should we reject the nullhypothesis with a 10% significance threshold?

In [32]:
reject_null_hypothesis = ...
reject_null_hypothesis

# Quick recap about sampling!

Here we'll take a look at the same life expectancy data and do some sampling exercises.

In [33]:
# Let's visualize our population distribution.

# Defining a function to create bins easily
def get_bins(array, bin_size=1):
    smallestNum = int(array.min())
    
    largestNum = int(array.max())
    upperLimit = largestNum + bin_size + 1
    
    return np.arange(smallestNum, upperLimit, bin_size)

In [34]:
measured = recent_data.get("Life expectancy ")

#generate number of bins
n_bins = get_bins(measured, 1) # <-- Try playing around with the bin size

#lets plot the histogram
recent_data.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

In [35]:
# This is our ... ?

```
POPULATION DISTRIBUTION
```

So, what is our aim?  

We want to estimate the average life expectancy for the globe!  Let's say we don't have access to the entire population.  

Flying around the world is pretty expensive, so we can only collect data from 15 countries.

We can sample and use bootstrapping to find this.


In [36]:
# How do we create a representative sample?
collected = recent_data.sample(n=15, replace=False)


In [37]:
collected

In [38]:
#we need new bin sizes
n_bins = get_bins(collected.get('Life expectancy '),1)


#lets plot the histogram
collected.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

In [39]:
# This is our ...?

```
SAMPLE DISTRIBUTION
```

We're interested in estimating the mean life expectancy.  So, let's find the mean of our sample.

In [40]:
sample_mean = collected.get('Life expectancy ').mean()
sample_mean

In [41]:
# We can show our mean in relation to the sample:

#plot the historgram again
collected.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

#draw the sample mean
plt.axvline(sample_mean, c='r')

In [42]:
# This is our ... ?

```
SAMPLE MEAN
```

What happens when we resample?

In [43]:
# Run this multiple time to see what changes.

resampled = collected.sample(15,replace=True)
resampled_mean = resampled.get('Life expectancy ').mean()
n_bins = get_bins(collected.get('Life expectancy '), 1)

print("The resampled mean is:\t\t", resampled_mean, "\nCompared to the original:\t", sample_mean)

#plot the historgram again
resampled.get('Life expectancy ').plot(kind='hist', bins=n_bins, density=True)

#lets show the sampled_mean and resampled_mean
plt.axvline(resampled_mean, c='r')
plt.axvline(sample_mean, c='b')

In [44]:
# This is our ... ?

```
RESAMPLED MEAN
```

Now, let's run the bootstrap so we can create a distribution!

In [45]:
sample_means = np.array([])

for i in range(5000):
    bootstrapped = collected.sample(15,replace=True)
    boot_mean = bootstrapped.get('Life expectancy ').mean()
    sample_means = np.append(sample_means, boot_mean)
    

plt.hist(sample_means, bins=get_bins(sample_means, 0.5))

In [46]:
# This is our ... ?

```
DISTRIBUTION OF SAMPLE MEANS
```