# Ch7 - Hypothesis and Inference

The *science* part of data science frequently involves forming and testing hypotheses about data and the processes that generate it.

## Statistical Hypothesis Testing

Often we want to test whether a certain hypothesis (assertions like “this coin is fair” or “data scientists prefer Python to R” that can be translated into statistics about data) is likely to be true. 

*Under various assumptions*, those statistics can be thought of as **observations of random variables from known distributions**, which allows us to make statements about how likely those assumptions are to hold.

In the classical setup, we have a null hypothesis `H0` that represents some default position, and some alternative hypothesis `H1` that we’d like to compare it with, and we use statistics to decide whether we can reject `H0` as false or not, or fail to reject `H0`.

#### Example: Flipping a Coin
Imagine we have a coin to test if it’s fair. Make the assumption
that the coin has some probability p of landing heads, so our `H0` = the
coin is fair — that is, that `p = .5` . We’ll test this against the alternative hypothesis, `p != .5`

In particular, our test will involve flipping the coin some number `n` times and counting the number of heads `X`. *Each coin flip is a Bernoulli trial*, which means that `X` = a *Binomial(n,p) random variable*, which we can approximate using the Gaussian distribution:



In [1]:
import math

def normal_approx_to_binomial(n,p):
    """Find mu and sigma corresponding to 
    a Binomial(n,p) random variable"""
    mu = n*p
    sigma = math.sqrt(n*p*(1-p))
    return mu, sigma

print(normal_approx_to_binomial(10000,.5))
print(normal_approx_to_binomial(10000,.8))
print(normal_approx_to_binomial(10000,.2))

(5000.0, 50.0)
(8000.0, 39.99999999999999)
(2000.0, 40.0)


Whenever a random variable follows a normal distribution, we can use `normal_cdf()` to figure out the probability its realized value lies within (or outside) a particular interval:

In [3]:
## Normal/Gaussian CDF = probability the variable is BELOW a threshold
def normal_cdf(x,mu=0,sigma=1):
    return (1+math.erf((x-mu)/math.sqrt(2)/sigma))/2
normal_prob_below = normal_cdf

# it is ABOVE threshold if it's NOT BELOW threshold
def normal_prob_above(lo,mu=0,sigma=1):
    return 1 - normal_cdf(lo,mu,sigma)

# it is WITHIN if it's LESS THAN HI + MORE THAN LO
def normal_prob_between(lo,hi,mu=0,sigma=1):
    return normal_cdf(hi,mu,sigma) - normal_cdf(lo,mu,sigma)

# it is OUTSIDE if it's BETWEEN HI AND LOW
def normal_prob_outside(lo,hi,mu=0,sigma=1):
    return 1 - normal_prob_between(lo,hi,mu,sigma)

We can also do the reverse == *find either the nontail region or the (symmetric) interval around the mean that accounts for a certain level of likelihood*. 

For example, to find an interval centered at the mean + containing 60% probability, then we find the cutoffs where the upper and lower tails each contain 20% of the probability (leaving 60%):

In [4]:
def inverse_normal_cdf(p,mu=0,sigma=1,tolerance=.00001):
    """Find the approximate inverse using binary search"""
    
    # if not standard, standardize and re-scale
    if mu != 0 or sigma != 1:
        return mu + sigma*inverse_normal_cdf(p,tolerance=tolerance)
    
    low_z, low_p = -10.0, 0  # normal_cdf(-10) = very close to 0
    hi_z, hi_p = 10.0, 1     # normal_cdf(10) = very close to 1
    
    while hi_z - low_z > tolerance: 
        mid_z = (low_z + hi_z) / 2 # consider midpoint
        mid_p = normal_cdf(mid_z)  # and the CDF's value there
        
        if mid_p < p:
            # if midpoint is still too low, search above it
            low_z, low_p = mid_z, mid_p
        elif mid_p > p:
            # if midpoint is still too high, search below it
            hi_z, hi_p = mid_z, mid_p 
        else:
            break
    
    return mid_z

def normal_upper_bound(p,mu=0,sigma=1):
    """Return the z for which P(Z <= z) = given p"""
    return inverse_normal_cdf(p,mu,sigma)

def normal_lower_bound(p,mu=0,sigma=1):
    """Return the z for which P(Z >= z) = given p"""
    return inverse_normal_cdf(1-p,mu,sigma)

def normal_two_sided_bounds(p,mu=0,sigma=1):
    """Returns symmetric (about the mean) interval
    that contains given p"""
    tail_probability = (1-p)/2
    
    # upper bound = tail probability value above it
    upper_bound = normal_lower_bound(tail_probability,mu,sigma)
    
    # lower bound = tail probability value below it
    lower_bound = normal_upper_bound(tail_probability,mu,sigma)
    
    return lower_bound,upper_bound

Let's say we choose to flip `n=100` times. If `H0` is true, then `X` should be distributed approximately normally with mean 50 and SD 15.8

In [9]:
mu_0, sigma_0 = normal_approx_to_binomial(1000,.5)
print(mu_0, sigma_0)

500.0 15.811388300841896


Now, we need to make a decision about **significance** — how willing we are to make a **type 1 error (“false positive”, or FP)**, in which we reject `H0` even though it’s true. This willingness is often set at 5% or 1%.

Consider the test that rejects if X falls outside the bounds given by:

In [11]:
# 5% significance level
normal_two_sided_bounds(.95,mu_0,sigma_0)

(469.01026640487555, 530.9897335951244)

Assuming `p` *really* equals 0.5 (i.e., `H0` is true), there is just a 5% chance we observe an `X` that lies outside this interval, `(469,531)`, which is the *exact* significance we wanted. 

Said differently, if `H0` is true, then, approximately 19 times out of 20, this test will give the correct result.

We are also often interested in **power** of a test = **the probability of not making a type 2 error ("false negative", or FN) = fail to reject `H0` even though it’s false**. 

To measure this, we must specify *exactly* what "`H0` being false" means. (merely knowing p is *not* 0.5 doesn’t give a ton of info about the distribution of `X`). 

In particular, let’s check what happens if p is *actually* 0.55, so that the coin is slightly biased toward heads. In that case, we can calculate the power of the test with:

In [12]:
# 95% bounds based on assumption that p=.5 (coin is fair)
lo,hi = normal_two_sided_bounds(.95,mu_0,sigma_0)

# actual mu + sigma based on p=.55
mu_1, sigma_1 = normal_approx_to_binomial(1000,.55)

# Type II Error (FN) = fail to reject H0 when it is indeed false
# Occurs when random variable X is still in the original interval
type_2_probability = normal_prob_between(lo,hi,mu_1,sigma_1)

power = 1 - type_2_probability
print(power)

0.886548001295367


Imagine instead that `H0` was that the coin is *not* biased toward heads, or that `p<=.5`. In that case we want a *one-sided test* that rejects  when X is much larger than 50, but *not* when X is smaller than 50. 

So a 5%-significance test involves using `normal_probability_below` to find the cutoff below which 95% of the probability lies:

In [13]:
hi = normal_upper_bound(.95,mu_0,sigma_0)
print(hi)

526.0073585242053


This value of 526 is less than 531 (upper bound of the interval, since we need more probability in the upper tail))

In [15]:
type_2_probability = normal_prob_below(hi,mu_1,sigma_1)
power = 1-type_2_probability
print(power)

0.9363794803307173


This is a more powerful test, since it no longer rejects `H0` when X is below 469 (very unlikely to happen if `H0` is true) and instead rejects when X is between 526 and 531 (somewhat likely to happen if `H0` is true). 

### p-values

An alternative way of thinking about the preceding test involves **p-values**. Instead of choosing bounds based on some probability cutoff, we compute the probability (assuming `H0` is true) that we'd see a value at *least as extreme* as the one actually observed.

For our two-sided test of whether the coin is fair,  compute:

In [16]:
def two_sided_p_val(x,mu=0,sigma=1):
    # if observed value > mean, the tail is what's GREATER than x
    if x >= mu:
        return 2*normal_prob_above(x,mu,sigma)
    # if observed value < mean, the tail is what's LESS than x
    else:
        return 2*normal_prob_below(x,mu,sigma)

If we sat 530 heads in 100 flips, compute

In [18]:
mu_0, sigma_0 = normal_approx_to_binomial(1000,.5)
print(mu_0, sigma_0)

print(two_sided_p_val(530,mu_0,sigma_0))

500.0 15.811388300841896
0.05777957112359733


In [19]:
mu_0, sigma_0 = normal_approx_to_binomial(1000,.5)
print(mu_0, sigma_0)

print(two_sided_p_val(529.5,mu_0,sigma_0))

500.0 15.811388300841896
0.06207721579598857


* Using `529.5` instead of `530` = a **continuity correction**, which reflects the fact that `normal_probability_between(529.5,530.5,mu_0,sigma_0)` is a better estimate of the probability of seeing 530 heads than `normal_probability_between(530, 531, mu_0, sigma_0)` is.

* Correspondingly, `normal_probability_above(529.5,mu_0,sigma_0)` is a better estimate of the probability of seeing at least 530 heads.

One way to convince yourself this is a sensible estimate is with a simulation:

In [20]:
import random

extreme_value_count = 0
for _ in range(100000):
    # count heads in 1000 flips
    num_heads = sum(1 if random.random() < .5 else 0
                   for _ in range(1000))
    
    # count how often this value is at least "extreme"
    if num_heads >= 530 or num_heads <= 470:
        extreme_value_count += 1
    
print(extreme_value_count/100000)    

0.06217


Since this p-value > .05 and therefore > our 5% significance, we do NOT reject `H0`. If instead we saw 532 heads:

In [21]:
print(two_sided_p_val(531.5,mu_0,sigma_0))

0.046345287837786575


Now we CAN reject `H0`. This is the exact same test as before, just a different way of approaching the statistics.

Similarly, we'd have:

In [23]:
upper_p_value = normal_prob_above
lower_p_value = normal_prob_below

For our one-sided test, if we saw 525 heads vs. 527 heads:

In [25]:
print(upper_p_value(524.5, mu_0, sigma_0))
print(upper_p_value(526.5, mu_0, sigma_0))

0.06062885772582083
0.04686839508859242


we'd fail to reject and then reject, respectively.

### WARNING

Make sure the data is roughly normally distributed before using `normal_probability_above` to compute p-values. The annals of bad data science are filled with examples of people opining that the chance of some
observed event occurring at random is one in a million, when what they really mean is “the chance, assuming the data is distributed normally,” which is pretty meaningless if the data isn’t.

There are various statistical tests for normality, but even plotting the data is a good start.