# Lecture 06-pt2

Continue with the covid data set for:
* Hypothesis Testing
* Bootstrap Sampling

In [1]:
import numpy as np
import numpy.random as npr
import random
import itertools

import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

# Reading:
    
    

## Visualizing Multiple Data Sets

In [None]:
import pandas as pd

df = pd.read_csv( 'https://www.fdsp.net/data/covid-merged.csv' )
df.set_index('state', inplace=True)
df['cases_norm'] = df['cases'] / df['population'] * 1000
df['gdp_norm'] = df['gdp'] / df['population'] * 1000;


### Partitioning Based on GDP

The gross domestic product (GDP) provides a measure of how affluent a state is. Let's use it to partition into groups of equal size -- so we will use the ______


In [None]:
#DEMO


To partition the data, we introduce a new dataframe method called `query()` 

In [None]:
#DEMO


In [None]:
#DEMO


In [None]:
#DEMO


In [None]:
#DEMO


Let's compare using histograms:

In [None]:
#DEMO


**ISSUE:**

In [None]:
#DEMO
import numpy as np



In [None]:
#DEMO
# Copy and use new edges
# First time no transparency


**DISCUSSION**

The histogram shows that states with GDPs per capita over the median have higher case counts in general (for instance, the larger bars in the range 2-5), as well as larger maximum values (above 8).  Unfortunately, the data does not provide any insight into why that might be the case. One reason might be that states with higher GDPs may have larger network effects, where more of the population interacts with a larger number of people. For instance, such states may have more of their population employed in office jobs and more likely to travel for their work. 

When an effect is observed visually, we usually want to quantify that effect in some way and then test whether any observed difference is "real"

### Partitioning Based on Urban Index


Now we use that median to partition the data and plot the histograms:

As with normalized GDP, a substantial difference in the distribution of the COVID case rates is seen for states that are more urban versus less urban. The average normalized case rates for these partitions are:

# Null Hypothesis Testing with Real Data

In [None]:
# Magic cell to make MPL save to both png and pdf
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('png', 'pdf')

When dealing with data from different groups, we will generally observe differences. For instance, we may measure the means or medians of the data sets, and we will usually find that these summary statistics differ between the groups.  At this point, we wish to understand whether the observed difference is "significant": is this difference likely to be a property of the underlying groups, or is it just caused by random variations in the data?

The number of samples of random phenomena has a significant impact on how accurately we can estimate those phenomena. When it comes to real data, the problem is amplified because we do not know the ground-truth characterization of the random phenomena producing the data. 

Let's start by restricting our analysis to a small subset of our data to see how effects observed with small data may be caused simply by random sampling. By restricting our data, we can show examples of the technique that we will be using, and the examples will be small enough to understand easily.


## Small Data Example Using Covid Rates

Start by loading the data and calculating the COVID cases per 1000 residents and GDP per 1000 residents:

Now consider the normalized COVID case rates from the first six states and partition them into two sets using alphabetical order. Since Arizona is the third state alphabetically, we can get the first three as follows:

Similarly, we can get the next three using the following:

Because the data is so small, histograms do not make sense.

Instead, we directly turn to using summary statistics. 

Let's calculate the means for these two groups:

**Observation**: The mean COVID case rate for states in `first3` is much larger than for states in `second3`.

**How to test whether this differnece could be caused by random variations in the data.**

**First step:** reduce the observed data to a single statistic, called the *test statistic*:

**test statistic**
>  A single numerical value that can be used in a statistical test.

In [None]:
#DEMO
# Choose ordering to make positive!


**This is a ______ difference!**

Possible hypotheses (explanations):
1. States that come later alphabetically might have higher COVID rates. (Hard to justify?)
2. The states in `second3` might differ from the states in group `first3` in some significant way. (Maybe but... why?)
3. The differences could be caused by random sampling. That is, the COVID rates all come from the same underlying random phenomena, but because of randomness, some states end up with higher rates in the observed data than others. In `first3` and `second3`, it just happened by randomness that `second3` ended up with more states with higher COVID rates, and `first3` ended up with more states with lower rates.

In statistics, we want to create hypotheses that can be *tested*, which means that we can assess the likelihood that the hypothesis is true or false.  In this case, we use the term *statistical hypothesis*:


<div class="alert alert-info">
    <b>Statistical hypothesis</b>
    
**Statistical hypothesis** is an explanation for phenomena observed in a data set that can be formally tested using the data.
</div>

Hypothesis 3 is a very common type of hypothesis when working with two groups of data that have different values for some summary statistic of interest. This type of hypothesis is called the *null hypothesis*:


**Null hypothesis (for multiple groups)**
>   The *null hypothesis* is that there is no real difference between the two data sets, and any differences are just based on random sampling from the underlying population.


Usually denoted by $H_0$

*Intuitively, the zero can be read as implying zero difference between the populations (in the feature being compared).*

**Alternate hypothesis ($H_a$)**
>  the feature being measured actually comes from random distributions that differ by group.

We will conduct a form of *binary hypothesis test*:

**Binary hypothesis test**
>   A binary hypothesis test is a statistical test that decides between two competing statistical hypotheses.

Usually, $H_a$ cannot be tested directly because we do not know ahead of time *how* the underlying phenomena differ. 

Instead, we conduct tests by assuming $H_0$ is true, called a *null hypothesis test*: 

**Null hypothesis significance test (NHST)**
>   A type of binary hypothesis  test that estimates the probability of observing such a large value of the  statistic under the condition that some null hypothesis, $H_0$, is true.

Two broad approaches to NHSTs:

**Model-based methods**
>   The data is assumed to come from some known statistical distribution; often allows the use of analytical methods.

**Model-free methods**
>   No assumption is made about the data fitting to some statistical model; analytical  methods are usually not possible. 

In model-free methods, techniques must be applied to the data itself to answer questions about the data. 

Either approach relies on assumptions.

The model-free method because it has several important advantages:
1. Model-based methods typically require the data reach a certain size before the model is a reasonable approximation for the test statistic. This is not an issue with model-free methods (although data size is important for other reasons).
2. It does not require analytical formulas to be available for every summary statistic that may be considered.
3. It does not rely on mathematical formulas that may seem obscure or that require a significant amount of foundational work in probability to understand.
4. The model-free approach we will use is easier to get started with, especially for engineers who are experienced in programming and simulating phenomena.

We want to draw data from the distribution of the data under $H_0$, but we don't have a model. Instead, we use 
*resampling* to approximate drawing from the distribution of the data under $H_0$:

**resampling**
>   Resampling is a type of statistical simulation in which new samples are repeatedly drawn from the existing data for each of the groups under consideration, and the statistical measures being used are evaluated for each new sample group. 

Our $H_0$:  the data in the two groups come from the same underlying random phenomena

We will _______ the data and draw samples:

<div class="alert alert-info">
    <b>Pooling</b>
    
**Pooling** describes the practice of gathering together small sets of data that are assumed to have been *drawn* from the same underlying population and using the combined larger set (the *pool*) to obtain a more precise estimate of that population.
</div>

In [None]:
#DEMO
import numpy as np


For numerical data, we leverage the NumPy.random submodule to draw samples, which we will import as `npr`:

In [None]:
#DEMO


We can randomly draw data from `pooled` using `npr.choice()`, which requires a variable to draw from as the first argument and can take a second argument as the number of items to draw (if not given, the default is one).

In [None]:
#DEMO
#Note that the resulting array has a repeated value, even though there are no repeated values in the variable `pooled`. 

There are two ways to sample from data:
1. **Sampling with replacement:** Items drawn are placed back into the array from which data is being sampled. Any number of items may be drawn.
2. **Sampling without replacement:** Items drawn are removed from the array from which data is being sampled. The maximum number of items that can be drawn is the size of the original array.

`npr.choice()` defaults to sampling with replacement but can perform sampling without replacement if passed the keyword argument `replace=False`.

For example, here are two draws of 6 items using each approach:

In [None]:
#DEMO


To perform resampling, we simulate how `first3` and `second3` **could have been created** if they came from the same random phenomena.

* Create new vectors `first3_sample` and `second3_sample` by sampling from the pooled data
* Each sample vector should be the **same size** as the original vectors

The most popular method draws the data **with replacement**.  
* Called **bootstrap sampling**
* often used to simulate random values of the test statistic under $H_0$

In [None]:
#DEMO -- run a few times

print(f'new group first3_sample: {first3_sample}')
print(f'new group second3_sample: {second3_sample}')

Goal of a NHST: determine whether the observed value of the test statistic could be attributed just to randomness
*  use resampling to estimate probability of  seeing such a large test statistic  under $H_0$

In [None]:
#DEMO -- copy from above and add new lines. Run a few times

print(f'new group first3_sample: {first3_sample}')
print(f'new group second3_sample: {second3_sample}')

print(f'original value of test statistic: {diff}')
print(f'sample value of test statistic: ',
      f'{second3_sample.mean() - first3_sample.mean()}')

Note that the sample test statistic can be either positive or negative

Should we consider the signed value or magnitude? It depends on the original hypothesis, and we will discuss this more later.

For now, let's consider only the *magnitude* of the sample test statistic


The probability of seeing such a large value of the test statistic under $H_0$ is called the $\mathbf{p}$**-value**

For now, we will say that the difference is *statistically significant* if the observed $p$-value is smaller than a threshold
* i.e., it is very unlikely to observe such an extreme difference in summary statistics under $H_0$

**IMPORTANT** Need to declare ahead of time:
* Exactly what is being tested
* Criterion for statistical significance

The significance threshold (for rejecting $H_0$) determines the max. prob. of rejecting the null hypothesis when it is actually true

## Resampling Simulation for Estimating $p$-value


In [None]:
#DEMO
# These are common to most simulations:
# 1) Set up the number of iterations (no. of samples from the pool)
# 2) Initialize our counter to zero
num_sims = 10_000
count = 0

# Put these outside the loop to save execution time 
# since they don't change.
# Even though we know these, it is good to get in the habit of 
# setting them dynamically from the data



for sim in range(num_sims):
    # Bootstrap sampling


    # Calculate the test statistic


    # Update the counter if observed difference as large as original


print(f'Prob. of seeing a result this extreme =~ {count / num_sims}')

**Draw conclusions**

Since this $p$-value is much larger than our threshold of 5% (i.e., 0.05), we **fail to reject the null hypothesis**. The observed difference will occur approximately 20% of the time, even if there is no difference in normalized COVID rates among these states.

**NOTE** This does not mean that there is no difference among these two groups. There could be a difference, but the data is not sufficient to be sure that they come from different distributions

## Testing the Observed Differences in COVID Rates

Now let's apply bootstrap resampling to test the previously observed differences in COVID rates based on GDP per capita using the full set of states:

In [None]:
#DEMO


We will use the difference in sample means as the test statistic for a NHST, so one of our first steps is to calculate the value of this test statistic:

For this case, the pooled data is simply all of the normalized COVID data:

Copy the simulation from the previous section and modify it to draw data to represent new `higher_gdp` and `lower_gdp` groups:

In [None]:
# These are common to most simulation:
# 1) Set up the number of iterations (draws from the pool)
# 2) Initialize our counter to zero
num_sims = 10_000
count = 0




for sim in range(num_sims):
    # Bootstrap sampling


    # Calculate the absolute value of the difference of means


    # Update the counter if observed difference as large as original


print(f'Prob. of seeing a result this extreme =~ {count / num_sims: .3f}')

**CONCLUSIONS:**

The observed $p$-value is approx. 0.07, which is above our threshold of 0.05. We **fail to reject the null hypothesis**.

The two groups may have data coming from different distributions, but the data is not sufficient to be confident that the observed difference is not from the effects of random sampling with small sample sizes

**Make into function**

In [None]:
def resample_mean(pooled_data, diff, len1, len2, num_sims=10_000):
    '''Resample from pooled data and conduct a two-tailed NHST 
    on the mean-difference
    
    Inputs
    ------
    pooled_data: a NumPy array containing all the data 
                 in the original 2 groups
    
    diff: the observed difference in sample means 
          in the original groups
    
    len1, len2: the lengths of the original groups
    
    num_sims: the number of simulation iterations
    
    Output
    ------
    prints resulting $p$-value
    '''
  


    print(f'Prob. of seeing a result this extreme =~ {count / num_sims}')

Now let's apply bootstrap resampling when the states are grouped using the urban index:

We can use our `resample_mean()` function to perform the two-sided test via bootstrap resampling:

**CONCLUSIONS**

The observed $p$-value is .01, which is below our $5\%$ threshold, so the result is statistically significant. More specifically, our conclusion is that the 25 states with a higher urban index have a statistically significant difference in COVID rates in comparison to the 25 states with a lower urban index. 

## Distribution of the bootstrap mean-difference

In each iteration of the bootstrap simulation, random sampling produces a new difference of sample means, which is a numerical random value.

These random values can be characterized by the set of values and the mapping of probability to those values. 

We call this the *distribution* of the random values.

The distribution of the test statistic can be estimated using a histogram of  the observed values.

Create new function to store and plot histogram of test statistics:

In [None]:
def resample_mean_hist(pooled_data, diff, len1, len2, num_sims=10_000):
    '''Resample from pooled data and conduct a two-tailed NHST
    on the mean-difference
    
    Inputs
    ------
    pooled_data: a NumPy array containing all the data
                 in the original 2 groups
    
    diff: the observed difference in sample means
          in the original groups
    
    len1, len2: the lengths of the original groups
    
    num_sims: the number of simulation iterations
    
    Output
    ------
    prints resulting $p$-value
    plot histogram of the differences of sample means
    '''


    print(f'Prob. of seeing a result this extreme =~ {count / num_sims}')
    
    # *** Plot the histogram with 40 bins ***

**Note:** histogram does not depend on the mean difference observed in the data. 

**OBSERVATIONS**
 The majority of the values fall between -2 and +2. Thus, it is not surprising that getting a mean difference as large as 2.34 is rare.

<!-- Now we are ready to solve **problem 1**:

**<font color=orange>Scenario A:</font>** Suppose that we always decide $A_1$. For $P(A_0) = \frac{2}{5}$, $P(A_1)=\frac{3}{5}$, we have:

\begin{align*}
P(A_0|B_0) &\underset{A_1}{\overset{A_0}{\gtrless}} P(A_1|B_0)\\
\frac{P(B_0|A_0)P(A_0)}{P(B_0)} &\underset{A_1}{\overset{A_0}{\gtrless}} \frac{P(B_0|A_1)P(A_1)}{P(B_0)} \\
\frac{\frac{7}{8}\times\frac{2}{5}}{\frac{9}{20}} &\underset{A_1}{\overset{A_0}{\gtrless}} \frac{\frac{1}{6}\times\frac{3}{5}}{\frac{9}{20}}\\
\frac{7}{9} &\underset{A_1}{\overset{A_0}{\gtrless}} \frac{2}{9} \Rightarrow \text{ Decide }A_0 
\end{align*}

So, when $B_0$ is received, MAP decision rule is to decide $A_0$ with probability $P(A_0|B_0)=\frac{7}{9}$.

Similarly, when $B_1$ is received, MAP decision rule is to decide $A_1$ with probability $P(A_1|B_1)=\frac{10}{11}$.

If the (optimal) decision rule is to always decide $A_1$, then we will have an error whenever we receive $B_0$. We can use the Law of Total probability to compute the overall probability of error $P(E)$:

$$P(E) = P(E|B_0)P(B_0) + P(E|B_1)P(B_1)$$

where

$$P(B_0) = P(B_0|A_0)P(A_0) + P(B_0|A_1)P(A_1) = \frac{7}{8}\times\frac{2}{5} + \frac{1}{6}\times\frac{3}{5} = \frac{9}{20}$$
and
$$P(B_1) = P(B_1|A_0)P(A_0) + P(B_1|A_1)P(A_1) = \frac{1}{8}\times\frac{2}{5} + \frac{5}{6}\times\frac{3}{5} = 1- P(B_0) = \frac{11}{20}$$

For this decision rule, we have that: 

$$P(E) = \frac{7}{9}\times\frac{9}{20} + \left(1-\frac{10}{11}\right)\times\frac{11}{20} = 0.4$$ -->

<!-- **<font color=orange>Scenario B:</font>** Suppose that we always decide $A_1$. For $P(A_0) = \frac{1}{10}$, $P(A_1)=\frac{9}{10}$, we have:

\begin{align*}
P(A_0|B_0) &\underset{A_1}{\overset{A_0}{\gtrless}} P(A_1|B_0)\\
\frac{P(B_0|A_0)P(A_0)}{P(B_0)} &\underset{A_1}{\overset{A_0}{\gtrless}} \frac{P(B_0|A_1)P(A_1)}{P(B_0)} \\
\frac{\frac{7}{8}\times\frac{1}{10}}{\frac{19}{80}} &\underset{A_1}{\overset{A_0}{\gtrless}} \frac{\frac{1}{6}\times\frac{9}{10}}{\frac{19}{80}}\\
\frac{7}{19} &\underset{A_1}{\overset{A_0}{\gtrless}} \frac{12}{19} \Rightarrow \text{ Decide }A_1
\end{align*}

So, when $B_0$ is received, MAP decision rule is to decide $A_1$ with probability $P(A_1|B_0)=\frac{12}{19}$.

Similarly, when $B_1$ is received, MAP decision rule is to decide $A_1$ with probability $P(A_1|B_1)=\frac{60}{61}$.

When using the MAP decision rule, we will always decide $A_1$. This does not mean we will not make an error. We will make the correct decision for when receiving $B_0$ with $\frac{12}{19}$ probability. But we will be in error with $1-\frac{12}{19}$ probability for when we receive $B_0$. We apply the same reasoning for $B_1$.

$$P(E) = P(E|B_0)P(B_0) + P(E|B_1)P(B_1)$$

where

$$P(B_0) = P(B_0|A_0)P(A_0) + P(B_0|A_1)P(A_1) = \frac{7}{8}\times\frac{1}{10} + \frac{1}{6}\times\frac{9}{10} = \frac{19}{80}$$
and
$$P(B_1) = P(B_1|A_0)P(A_0) + P(B_1|A_1)P(A_1) = \frac{1}{8}\times\frac{1}{10} + \frac{5}{6}\times\frac{9}{10} = 1- P(B_0) = \frac{61}{80}$$

For this decision rule, we have that: 

$$P(E) = P(E|B_0)P(B_0) + P(E|B_1)P(B_1) = \left(1-\frac{12}{19}\right)\times\frac{19}{80} + \left(1-\frac{60}{61}\right)\times\frac{61}{80} = 0.1$$ -->

<!-- We changed the prior probability for both transmitters. In scenario B we are assuming that the probability that the Tx $A_0$ was used at 10\% chance. Whereas in scenario A, we are assuming that the probability that Tx $A_0$ was used at 40\%.  -->

___