### Hypothesis Testing

[Hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing) is used to determine if there is enough statistical evidence in favor of a specific hypothesis. It allows us to draw conclusions about an entire population based on a representative sample. These tests evaluate two **mutually exclusive** statements about a population to determine which statement is best supported by the sample data. They answer a very precise question with a definite answer: yes or no. The statement representing the *status quo*, the currently accepted thesis is called the **null hypothesis**, while the pothential new thesis that is challenging the established one is called the **alternative hypothesis**. 

Conventionally, the (two) hypotheses are named as such:
- The null hypothesis is represented by $H_0$ (established knowledge)
- The alternative hypothesis is represented by $H_A$ (challenging thesis)

#### *p*-values and $\alpha$-values
In the context of hypothesis testing, the  ***p*-value** is the probability of obtaining results at least as extreme or abnormal as the results actually observed in the sample *under the assumptions of the null hypothesis*. A small *p*-value is evidence **against** the null hypothesis, since it means that the new sample has very low probability to be drawn from the original distribution. You basically check how likely it is that the new sample comes from the reference distribution.

Note that the *p*-value is almost never considered alone and is usually confronted against the **$\alpha$-value** (also called *statistical level*). The $\alpha$-value is the *target* probability of rejecting the null hypothesis when the null hypothesis is true. It corresponds to **how strict we want to be** in making a mistake. This threshold is set with the purpose to direct the tests. For example, if we want to have a drug that is effective in 95% of the cases, then we want to take an $\alpha$-value of 5%. That means that we allow only mistakes that happen with less than 5% of chances. If, then, we produce a new sample and calculate its *p*-value, we can compare it against the $\alpha$ value and check whether we're seing something that belongs to the original distribution or that requires new knowledge.

More formally, after choosing the $\alpha$-value (how confident we want to be), we can run the hypothesis test. We, then, have two cases:
- if $p \leq \alpha$, then $H_0$ must be **rejected** (it's unlikely that that sample can be generated under the assumptions of $H_0$)
- if $p >\alpha$, then $H_0$ must be **accepted** (the sample belongs to the original distribution *within* our tolerated error level)


#### How to choose between *one-tailed* and *two-tailed* tests

When performing a statistical test, we may want to answer different questions:
1. Is the sample mean, $\hat x$, *different from* the population mean, $\mu$
2. Is the sample mean, $\hat x$, *greater than* the population mean, $\mu$
3. Is the sample mean, $\hat x$, *smaller than* the population mean, $\mu$
Question n.1 requires a **two-tailed** test, as we're testing one hypothesis ($H_a$) against the other ($H_0$) and we need to check extreme cases both at the lower and upper ends of the CDF. Questions n.2 and n.3, instead, require a **one-tailed**, since we're looking only at one tail of the CDF.

Note, however, that the PDF can be symmetrical (Gaussian, *t*-Student's, ...) or asymmetrical (Gamma, ...), and so will be the corresponding CDF. It follows that in some cases the one-tailed *p*-values will be different whether we look at the upper or lower side of the same distribution.

According to the test we may want to perform, the *p*-value is defined as:
- $p = \Pr(x \geq 1-\alpha \mid H_0)$ for a *one-sided*, *right-tail* test-statistic distribution,
- $p = \Pr(x \leq \alpha \mid H_0)$ for a *one-sided*, *left-tail* test-statistic distribution,
- $p = 2\cdot\min\{\Pr(x \geq \alpha \mid H_0),\Pr(x \leq \alpha \mid H_0)\}$ for a *two-sided*, *asymmetric* test-statistic distribution,
- $p =\Pr(|x| \geq |\alpha| \mid H_0)\,=\,2\cdot\Pr(x \geq 1-\alpha \mid H_0)\,=\,2\cdot\Pr(x \leq \alpha \mid H_0)$ for a *two-sided*, *symmetric* test-statistic distribution.

#### Example n.1: two-sided test
We can use a dataset that contains some information about the (male) human body temperature. From the medical literature, we know that the average body temperature is 98.6 Fahrenheit degrees (37 Celsius degrees). We have a sample of $N=65$ samples and the population st.dev. is unknown. We will, thus, perform a *t*-test. Let's see if the sample mean is compatible with the population mean. We want to have confidence level of 95% (i.e., $\alpha$ = 5%). We can set up the problem as follows:
- $H_0 = 98.6 \leftrightarrow$ the male body temperature is 98.6 degrees
- $H_A\neq 98.6 \leftrightarrow$ the male body temperature is NOT 98.6 degrees ($\rightarrow$ two-sided test)
- $\mu=98.6$
- $\sigma$ is unknown ($\rightarrow$ *t*-test)
- $N=65$
- $r=0.95$, i.e., $\alpha=0.05$

Let's compute our *t*-score and perform the test:

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from statistics import mean, stdev

In [2]:
body = pd.read_csv('data/bodytemp.csv')
body.head() # 0 means male and 1 means female

Unnamed: 0,temp,sex,bpm
0,96.3,0,70
1,96.7,0,71
2,96.9,0,74
3,97.0,0,80
4,97.1,0,73


In [3]:
n = len(body[body['sex'] == 0].temp)
m = body[body['sex'] == 0].temp.mean()
s = body[body['sex'] == 0].temp.std()
err = s/np.sqrt(n)
alpha = 1-0.95
print(f"N. of Samples:\t{n}\nMean:\t\t{m:}\nStd.Dev.:\t{s}\nError:\t\t{err}\nAlpha:\t\t{alpha:.2f}")

N. of Samples:	65
Mean:		98.10461538461539
Std.Dev.:	0.6987557623265904
Error:		0.08666998552285868
Alpha:		0.05


As we can see, the sample mean is 0.5 degrees off what we believe to be the population mean ($\mu=98.6$). Is it possible to say that this sample is drawn from this population (thus, confirming the null hypothesis)? We can perform a statistical *t*-test and verify it:

In [8]:
n = len(body[body['sex'] == 0].temp)   # sample size
m = mean(body[body['sex'] == 0].temp)  # sample mean
s = stdev(body[body['sex'] == 0].temp) # sample st.dev.
err = s/np.sqrt(n)
alpha = 1-0.95                         # target confidence
print(f"N. of Samples:\t{n}\nMean:\t\t{m:.1f}\nStd.Dev.:\t{s:.1f}\nError:\t\t{err:.2f}\nAlpha:\t\t{alpha:.2f}")

N. of Samples:	65
Mean:		98.1
Std.Dev.:	0.7
Error:		0.09
Alpha:		0.05


In [14]:
# calculate the test statistic (t-score)
t_score = (m - 98.6) / err
print(f"t-score: {t_score:.6f}")

# calculate the p-value
p_value = 2 * stats.t.sf(np.abs(t_score), n-1) # two-sided test using the (symmetrical) t-Student CDF
print(f"p-value: {p_value:e}")
print(f"alpha-value: {alpha:.2f}")

t-score: -5.715757
p-value: 3.083840e-07
alpha-value: 0.05


As we can see, we get a *t*-score of -5.7. The t-Student's CDF (with $\nu=n-1$ dof) evaluated at this point is $\text{CDF}(t=-5.7, \nu=64)\,=\,1.54\cdot10^{-7}$. Since this distribution is symmetrical, the *p*-value is twice this number. However, we get $p<<\alpha$, that shows that this sample doesn't belong to the known population. There is statistically significant difference.

For sake of completeness, let's see what a *z*-test would give (assuming that $\sigma\approx s$ is a good approximation):

In [17]:
# calculate the test statistic (z-score)
z_score = (m - 98.6) / err
print(f"z-score: {z_score:.6f}")

# calculate the p-value
p_value = 2 * stats.norm.pdf(abs(z_score)) # two-sided test using the (symmetrical) Gaussian CDF
print(f"p-value: {p_value:e}")
print(f"alpha-value: {alpha:.2f}")

z-score: -5.715757
p-value: 6.423401e-08
alpha-value: 0.05


As we can see, the *z*-test gives even a more compelling result. This is because assuming that the st.dev. is correct, the difference between sample and population means is even more accentuade.

#### Example n.2: one-sided test

Let's build a one-sided test on the same test. Let's say that we know from the literature that the human body temperature is considered to be in the range [96.8, 99.5] degrees ([36, 37.5] in Celsius). We can perform a one-sided test with the current hypotheses:
- $H_0\rightarrow \mu_0 \leq 37.5$
- $H_a\rightarrow \mu_a > 37.5$

Note that this is a one-sided test because we're not testing an inequality but a *greater than/less than* relation. In this case, we're looking at the *right* tail of the distribution ($\mu_a > 37.5$). The procedure is very similar:

In [33]:
# calculate the test statistic (t-score)
t_score = (m - 98.6) / err
print(f"t-score: {t_score:.6f}")

# calculate the p-value
p_value = stats.t.sf(t_score, n-1) # two-sided test using the (symmetrical) t-Student CDF
print(f"p-value: {p_value:.7f}")
print(f"alpha-value: {1-alpha:.2f}") # right tail

t-score: -5.715757
p-value: 0.9999998
alpha-value: 0.95


The result is still consistent with the two-sided test, meaning that we still need to reject the null hypothesis. However, what is interesting to see is that the difference between the *p*- and the $\alpha$-values is slimmer. Remember that this CDF is symmetrical, so a *left*-sided test would have given the same results (you would use `alpha` instead of `1-alpha` and `stats.t.sf(-t_score, n-1)` instead of `stats.t.sf(t_score, n-1)`).