### Hypothesis Testing

[Hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing) is used to determine if there is enough statistical evidence in favor of a specific hypothesis. It allows us to draw conclusions about an entire population based on a representative sample. These tests evaluate two **mutually exclusive** statements about a population to determine which statement is best supported by the sample data. They answer a very precise question with a definite answer: yes or no. The statement that is favoured is called the **null hypothesis**, while the antithesis to the null hypothesis is called the **alternative hypothesis**. 

For example, we may want to determine if a new drug (called "0") is effective for treating insomnia: a sample of patients is randomly selected; half of them is given the drug "0" while the other half is given a pill that doesn't contain any medicine (placebo). The conditions of the patients are then measured and compared. The null hypothesis in this case is the efficacy of the new drug against insomnia, and the test wants to answer the question: "is the new drug effective? Yes or no?". Note that if more drugs are provided at the same time (i.e., there are more than one alternative hypotheses), this kind of tests is ineffective. We can only reject the hypothesis that drug "0" is effective, but we cannot say anything else about other possibilities (say, drugs "B", "C", ...).

Conventionally, the (two) hypotheses are named as such:
- The null hypothesis is represented by $H_0$ (falling asleep thanks to drug "0")
- The alternative hypothesis is represented by $H_A$ (all other cases)

#### *p*-values and $\alpha$-values
In the context of hypothesis testing, the  ***p*-value** is the probability of obtaining results at least as extreme or abnormal as the results actually observed in the sample *under the assumptions of the null hypothesis*. A small *p*-value is evidence **against** the null hypothesis. In the example above, the *p*-value would be the probability of staying awake after taking drug "0".

The *p*-value is almost never considered alone and is usually confronted against the **$\alpha$-value** (also called *statistical level*). The $\alpha$-value is the probability of rejecting the null hypothesis when the null hypothesis is true. It corresponds to **the probability of making a mistake** (i.e., rejecting the null hypothesis incorrectly). A priori, this threshold can be set on purpose to direct the tests. For example, if we want to have a drug that is effective in 95% of the cases, then we want to take an $\alpha$-value of 5%. That means that there is only 5% of chances that we're making a mistake in our evaluation.

After choosing the $\alpha$-value (how confident we want to be), we can run the test. We, then, have two cases:
- if $p \leq \alpha$, then $H_0$ must be **rejected** (low strength of evidence for $H_0$)
- if $p >\alpha$, then $H_0$ must be **accepted**

In our example, this is because, if $p \leq \alpha$, then the probability of drug "0" not working (the patient stays awake) is smaller than the probability of incorrectly rejecting the null hypothesis. Or, in other words, the probability of making a mistake is higher than affirming that drug "0" doesn't work. We're not saying that drug "0" doesn't work or that other hypotheses do work. We're just saying that we cannot determine it from the sample we have.

#### What type of test to run?

Graphically, the *p*-value is the area of the tail of a probability distribution. The *p*-value is calculated using the sampling distribution of the test statistic under the null hypothesis. Since the distribution is symmetrical, the confidence level can apply to either side of the distribution. A hypothesis test can be **one-tailed** or **two-tailed**. 

![](images/hyp_test_types1.png)

From the image above, it's easy to see what the *p*- and the $\alpha$-value represent. The $\alpha$-value is the threshold we set. If the *p*-value is beyond this threshold, i.e., further away in the tails, then there is a non-null probability that an extreme event happens, even if we're are supposing that the null hypothesis is true.

According to the kind of tests (one-/two-tailed on a symmetrical/asymmetrical probability distribution) you want to make, the *p*-value is defined as:
- $p = \Pr(x \geq 1-\alpha \mid H_0)$ for a one-sided right-tail test-statistic distribution,
- $p = \Pr(x \leq \alpha \mid H_0)$ for a one-sided left-tail test-statistic distribution,
- $p = 2\min\{\Pr(x \geq \alpha \mid H_0),\Pr(x \leq \alpha \mid H_0)\}$ for a two-sided asymmetric test-statistic distribution,
- $p =\Pr(|x| \geq |\alpha| \mid H_0)$ for a two-sided symmetric test-statistic distribution

#### Example
We can use a dataset that contains some information about the (male) human body temperature. From the medical literature, we know that the average body temperature should be 98.6 Fahrenheit degrees. Let's see if we can infer this result from our dataset. In other words, we want to test whether the sample mean is compatible with the theoretical mean. We want to target a confidence level of 95%. We can set up the problem as follows:

We will set the hypotheses as follows:
- $H_0 = 98.6 \leftrightarrow$ the male body temperature is 98.6 degrees
- $H_A\neq 98.6 \leftrightarrow$ the male body temperature is NOT 98.6 degrees

In this case, since the alternate hypothesis has a $\neq$ condition, we will use a **two-sided test**. Also $\alpha = 1-0.95 = 0.05$.

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from statistics import mean, stdev

In [3]:
body = pd.read_csv('data/bodytemp.csv')
body.head() # 0 means male and 1 means female

Unnamed: 0,temp,sex,bpm
0,96.3,0,70
1,96.7,0,71
2,96.9,0,74
3,97.0,0,80
4,97.1,0,73


In [4]:
n = len(body[body['sex'] == 0].temp)
m = body[body['sex'] == 0].temp.mean()
s = body[body['sex'] == 0].temp.std()
err = s/np.sqrt(n)
alpha = 1-0.95
print(f"N. of Samples:\t{n}\nMean:\t\t{m:}\nStd.Dev.:\t{s}\nError:\t\t{err}\nAlpha:\t\t{alpha:.2f}")

N. of Samples:	65
Mean:		98.10461538461539
Std.Dev.:	0.6987557623265904
Error:		0.08666998552285868
Alpha:		0.05


As we can see, the sample mean is 0.5 degrees off the theoretical mean ($\mu=98.6$). However, we cannot reject or accept the null hypothesis that this sample belongs to that population without performing a null-hypothesis test. If we reject the null hypothesis, than it means that our sample comes from another population or, at least, that its mean is not 98.6.

In this case, the test statistic we can take is the z-score, which measures the distance between sample and population means, rescaled by the standard deviaton. We can compute the z-score, measure the corresponding $p$-value (i.e., the probabilty of that z-score) and compare it to one minus the confidence we want (1-$\alpha$). 

In [7]:
# calculate the test statistic (z-score)
z_score = (m - 98.6) / err
print(f"z-score: {z_score:.6f}")

# calculate the p-value
p_value = 2 * stats.norm.pdf(abs(z_score)) # two-sided test using the probability function => times 2 
print(f"p-value: {p_value:e}")
print(f"alpha-value: {alpha:.2f}")
print(f"==> Since {p_value:e} << {alpha:.2f}, we REJECT the null hypothesis.")

z-score: -5.715757
p-value: 6.423401e-08
alpha-value: 0.05
==> Since 6.423401e-08 << 0.05, we REJECT the null hypothesis.


Now, our dataset has more than 30 samples, so maybe it was a wrong assumption to assume that the errors follow a Gaussian distribution. Let's see if by assuming that they follow a Student's t-distribution we can accept the null hypothesis:

In [8]:
n = len(body[body['sex'] == 0].temp)
m = mean(body[body['sex'] == 0].temp)  # exactly the same as before
s = stdev(body[body['sex'] == 0].temp) # different from before
err = s/np.sqrt(n)
alpha = 1-0.95
print(f"N. of Samples:\t{n}\nMean:\t\t{m}\nStd.Dev.:\t{s}\nError:\t\t{err}\nAlpha:\t\t{alpha:.2f}")

N. of Samples:	65
Mean:		98.10461538461539
Std.Dev.:	0.698755762326591
Error:		0.08666998552285875
Alpha:		0.05


In [9]:
# calculate the test statistic (t-score)
t_score = (m - 98.6) / err
print(f"t-score: {t_score:.6f}")

# calculate the p-value
p_value = 2 * stats.t.sf(abs(t_score), n-1) # two-sided test using the probability function => times 2 
print(f"p-value: {p_value:e}")
print(f"alpha-value: {alpha:.2f}")
print(f"==> Since {p_value:e} << {alpha:.2f}, we REJECT the null hypothesis.")
print("The p-value is beyond the threshold we set (i.e., more in the tail).")

t-score: -5.715757
p-value: 3.083840e-07
alpha-value: 0.05
==> Since 3.083840e-07 << 0.05, we REJECT the null hypothesis.
The p-value is beyond the threshold we set (i.e., more in the tail).


Above we can see that the conclusion is the same: we reject the null hypothesis and we cannot affirm that this sample is drawn from a population of mean 98.6. What is mostly important, however, is noticing that the two test statistics give different *p*-values.

Remember that the *p*-value here is obtained by taking twice the value of the probability function, because we're performing a two-sided test using a *symmetric* probability function. In general, one-sided tests or tests where the probability distribution has different left and right tails can be performed following steps similar to the one performed in this notebook.