In [None]:
import numpy as np

# Hypothesis Testing

The goal with hypothesis testing is to construct a logical sequence of steps that fits into the probability computations we have been computing around point estimators and follows a standard pattern.

## Statistical Tests

The common elements for a test will be:

- A Null Hypothesis: $H_0$
- An Alternative Hypothesis:  $H_a$
- A Test Statistic
- A Rejection Region

The Alternative Hypothesis must be the logical complement of the Null Hypothesis:  $H_a = \neg H_0$

The test is looking for sufficient evidence to accept the *Alternative Hypothesis*, in the absence of such evidence the *Null Hypothesis* is considered true. This distinction affects how we should construct the test - we choose for the Alternative Hypothesis the statement that needs evidence (the thing we are hoping to show).

### Example: Car Accidents

A city has made a major change to the timing of lights in the city to improve safety. Prior to this accidents in lighted intersections in the city fit a Poisson Distribution with a mean of 25 accidents per months. In the first month after the changes there were 8 accidents. Do we have sufficient evidence to conclude that interesections are safer?

Notice we are asking:  

- Could the 8 accidents or fewer just be the result of the Poisson Distribution with mean 25 and randomness (This will be our Null Hypothesis)
- The alternative hypothesis is thus that the mean of the distribution has changed. Ideally for hypothesis testing in general we are confident that not so much has changed that the entire distribution has change.

Note the choice: The alternative hypothesis is the one for which we are searching for sufficient evidence. Or to rephrase, the one for which we are trying to determine if the evidence we do have is significant. While the null hypothesis boils down to the evidence we do appear to have is just the result of the randomness in the problem.

- The Test Statistics is the number of accidents in the month since the changes $Y$


In [None]:
from scipy.stats import poisson

In [None]:
mu = 25
y = 12

# This is our first time using the scipy.stats for a discrete variable. For discrete variables the method .pmf 
# is the probability mass function and gives the probability at a value.

# The .cdf method is the cummulative distribution function and gives the value we expect P(Y \leq y)

poisson.cdf(y, mu)

In [None]:
# Verifying that the cdf is the \leq variety

sum( [ poisson.pmf(k, mu) for k in range(y+1) ])

So we see that there is less than a 5% chance that with the mean of 25 the Poisson random variable would return a value of $Y \leq 12$. We can conclude with 95% confidence that the modifications to the light timings has reduced the mean number of accidents per month. 

#### Rejection Region

For the moment forget that we know what $Y$ is:

- The Rejection Region are the values of $Y$ for which we would conclude there is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis.

In [None]:
# Using the ppf we can inquire of our distribution where the cdf crosses 0.05. 

poisson.ppf(0.05, mu)

In [None]:
# note that at 17 the probability is bigger than 0.05 so we use 16.

poisson.cdf(16, mu)

And so any value of $Y\leq 16$ would lead to use rejecting the Null Hypothesis.

### Type I Errors: Rejecting $H_0$ when it is TRUE

There are two ways we could make the wrong conclusion:  The first, called a Type I error would be that we reject the null hypothesis when it was in fact true. We use $\alpha$ to denote the probability of a Type I error.



#### Type II Errors:  Accepting $H_0$ when it is FALSE

Type II error is when we accept the null hypothesis but it is in fact false. We use $\beta$ to denote the probability of a Type II error.


### Errors

Taking a step back and looking at our test, a question is given the rejection region $Y\leq 16$ how likely are the two errors for all possible $Y$ values?

Type I Error:  The test is built for a 95% confidence and so the probability we incorrectly reject the null hypothesis is at most 5%. Note this is the parameter we can tune for the test.

Type II Error:  This one takes some work to compute. It is the likliehood that the $Y$ was not in the rejection region (i.e. $Y \geq 17$) and yet $H_0$ is false.

$$ \beta = P(\mbox{type II error}) = P(Y \geq 17 \quad \mbox{when the alternative hypothesis is true}) $$

It is a little bit open ended because if the mean is not 25, we do not know what it is. However generally we expect that the smaller the difference between the true mean and 25 the greater the likliehood of a type II error. So to get a bound on $\beta$, we will compute the above probability when the mean is something less than 25; generally I approach it as if we are rejecting $H_0$ and accepting $H_a$ then our best estimate for the new mean is from the sample we have just collected and so $\mu_a = Y$. 

In [None]:
# In practice this is then something we tune. 
# We choose the cutoff that balances the chance of a Type I and Type II error 
# to what we can accept. 

cutoff = 17
alpha = poisson.cdf(cutoff, 25)
beta = 1 - poisson.cdf(cutoff, cutoff)

alpha, beta

In [None]:
import matplotlib.pyplot as plt

In [None]:
x = np.arange(0, 50, 1)
plt.figure(figsize = (10, 6) )

plt.plot(x, poisson.pmf(x, 25), 'b-' )
plt.plot(x, poisson.pmf(x, 17), 'r-' )
plt.vlines(17, ymin = 0, ymax = poisson.pmf(17, 17), color='r', linestyle='dashed')


## This is unsatisfacotry

Play around with it a bit and you should notice that it appears to be substantially difficult to minimize both $\alpha$ and $\beta$ (especially with a Poisson or other discrete variable).

So what are we to do?  Clearly what needs to happen is that we should increase the sample size (in this case we went with data from just one month). 

Increasing the sample size has two effects:  The Central Limit Theorem will start to apply as we sample from a distribution and then take the mean of the samples (if the mean is what our hypothesis is about). 

Increasing the sample size decreases the variance. 

----

For the Poisson random variables the other way to think about increasing the sample size is that we are running the time period out longer getting a Poisson random variable with a larger mean. 

In [None]:
# The mean number of car accidents per year in this city's intersections is

mu = 25*12
mu

In this case our rejection region would be:


In [None]:
poisson.ppf(0.05, 300)

In [None]:
poisson.cdf(271, 300)

In [None]:
cutoff = 271
alpha = poisson.cdf(cutoff, 300)
beta = 1 - poisson.cdf(cutoff, cutoff)
alpha, beta

## Large Sample Hypothesis Test Example

Of course the real power comes as we do examples from unknown distributions but with a large enough size we are confident that the Central Limit Theorem applies. 


In [None]:
import numpy as np
from scipy.stats import norm

The claim is that students at a university are working no more than 20 hours per week jobs. To test this claim 25 students are selected at random and asked how many hours per week they are working:

In [None]:
data = [0, 5, 10, 12, 12, 12, 19, 19, 19, 19, 20, 20, 25, 30, 30, 35, 35, 35, 39, 39, 40, 41, 42, 50, 51]

In [None]:
# Some items we need for our computation

Ybar = np.mean(data)
S = np.std(data, ddof = 1)
n = len(data)
Ybar

The $H_0: \mu \leq 20$ and the $H_a: \mu > 20$

Our test statistic is

In [None]:
Zstar = (Ybar - 20) / S * np.sqrt(n)
Zstar

So the p-value is $P( Z \geq Z_\star)$:

In [None]:
pvalue = 1 - norm.cdf(Zstar)
pvalue

In [None]:
from scipy.stats import t

pvalue_t = 1 - t.cdf(Zstar, n-1)
pvalue_t

We have insufficient evidence to reject the null the hypothesis at 99% confidence, but sufficient to reject it at 95% confidence. Note there is a difference between the t and normal distributions but it is not significant enough to change either conclusion. 

### Rejection Region

Treating the $S$ as fixed we can ask what our rejection region would look like. I.e. for what value of $Z_\star$ would we condlude that we reject the null hypothesis. At 95% confidence ($\alpha = 0.05$) that region would be $\bar{Y}$ bigger than:




In [None]:
cutoff = norm.ppf(0.95)*S/np.sqrt(n) + 20
cutoff

And with that we can then ask what the probability of a Type II error with this test would be. Again to answer this we need to have an alternative mean in mind.

In [None]:
mu_a = 25
zbeta = (cutoff - mu_a) / S * np.sqrt(n) 
zbeta

The probability that with $Y < Y_{\mbox{rejection}} $ we would incorrectly keep the null hypothesis because the true mean was 25 is then: 

In [None]:
norm.cdf(zbeta)

In [None]:
x = np.arange(15, 30, 0.1)
plt.figure(figsize = (10, 6) )

plt.plot(x, norm.pdf(x-20)*S/np.sqrt(n), 'b-' )
plt.plot(x, norm.pdf(x-25)*S/np.sqrt(n), 'r-' )
plt.vlines(cutoff, ymin = 0, ymax = norm.pdf(cutoff-25)*S/np.sqrt(n), color='r', linestyle='dashed')


### Tuning the test

Note that the likliehood of a Type I error is set by the test, and from that we produce the rejection region. 

The Type II error is determine by first making a choice of what to assume the alternative hypothesis actually gives for a mean and then reporting the test statistic. Note that the opposite tail of the distribution is used. This choice of the alternative test value represents the **sensitivity** of the test.

However in order to control $\beta$ we notice that we have one more free variable we can tune:  $n$ the number of samples to collect. 

Varrying $n$ will have the effect of changing the location of the rejection region, and it will also, for a fixed $\mu_a$ affect the size $\beta$. Note it gets complicated because the rejection region also changes as we change $n$.

Let $ z_{\alpha} $ be chosen such that $P ( Z > z_\alpha) = 0.05$ which we computed above using the .ppf() method.

Then our rejection region is given by:

$$ \bar{Y} > z_\alpha S / \sqrt{n} + 20 $$

Normalizing this with our alternative mean $25$ for a test sensitivity of 5 hours per week we get:

$$ z_\beta = \sqrt{n} \frac{z_\alpha S / \sqrt{n} + 20 - 25}{S} = z_\alpha - \sqrt{n} \frac{5}{S} $$

and we then choose a $n$ large enough that 

$$\beta = P( Z < z_\beta ) = 0.05 $$ 

You could do some algebra and find it exactly or just plug and check values:



In [None]:
n = 25
zbeta = norm.ppf(0.95) - np.sqrt(n) * 5 / S
beta = norm.cdf(zbeta)
beta

In [None]:
x = np.arange(15, 30, 0.1)
plt.figure(figsize = (10, 6) )

plt.plot(x, norm.pdf(x-20)*S/np.sqrt(n), 'b-' )
plt.plot(x, norm.pdf(x-25)*S/np.sqrt(n), 'r-' )
plt.vlines(cutoff, ymin = 0, ymax = norm.pdf(cutoff-25)*S/np.sqrt(n), color='r', linestyle='dashed')


It has always surprised me that you do not actually need to take $n$ to be that much larger in most cases.

The only wrinkle in practice is that because you are going to redo the experiment the value of $S$ is going to change. The estimate here for $\beta$ is what we would call an *a priori* estimate, and you would want to redo this computation with the new $S$ to get an *a posteriori* estimate. 

### Sensitivity

Note why we say that the difference between $\mu_0$ and $\mu_a$ is the test sensitivity. If we move $\mu_a$ closer to $\mu_0$ we would expect to need to take $\sqrt{n}$ to be larger in order to decrese $\beta$ as small. Note that once $\mu_a = \mu_0$ we would then have $z_\beta = z_\alpha$ and no freedom available.

Try it by computing the $n$ needed if we want the test to be senstive to within 1 hour per week and $\alpha = \beta = 0.05$ or less.
