___

# HYPOTHESIS TESTING

## Goal 

Understanding relationships between response and predictors is about testing two opposite hypotheses:
+ null hypothesis $H_0$: there is no relationship between a predictor and a response.
+ alternative hypothesis: there is a relationship.


## Experimental Design

Hypothesis testing is used to **make decisions about a population** using sample data. 

+ We start with a **null hypothesis $H_0$** that we we asssume to be true. For instance:
    + the sample parameter is equal to a given value.
    + samples with different characteristics are drawn from the same population.
+ We run an **experiment** to test this hypothesis:
    + **collect data** from a sample of predetermined size.
    + perform the appropriate **statistical test**.
+ Based on the experimental results, we can either **reject** or **fail to reject** this null hypothesis. 
+ If we reject it, we say that the data supports another, mutually exclusive, **alternate hypothesis**.


## Test Distribution

A statistical test is based on some assumption regarding the sampling distribution of the test statistics under the null hypothesis; The assumptions depend on the test. 


## P-value

We measure the probability that a sample would have a test statistics at least as extreme as the one observed, if these assumptions (and thus the null hypothesis) were true. This probability is called the **p-value**. 

To draw conclusions, we use a predetermined cutoff probability called the level of significance $\alpha$ (typically 5%). 

+ $\text{p-value }\leq\alpha$: the observed data is very unlikely under the null hypothesis so we reject it. The observed effect is statistically significant.
+ $\text{p-value }\gt\alpha$: we fail to reject the null hypothesis. The observed effect is not statistically significant.


___

# PARAMETERS OF A TEST

A statistical test has several important parameters:
+ alpha (Type I Error): probability of false positive.
+ beta (Type II Error): probability of false negative.
+ power (1 - beta): probability of true negative (correcting failing to reject the null hypothesis).


## Types of Errors

There are four possible outcomes for our hypothesis testing, with two [types of errors](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors):

| Decision          | $$H_0$$ is True                      | $$H_0$$ is False                     |
|-------------------:|:---------------------------------:|:---------------------------------:|
| **Reject H0** | **Type I error**: False Positive   | Correct inference: True Positive |
| **Fail to reject H0** | Correct inference: True Negative | **Type II error**: False Negative |

_Note: Decreasing the Type I error increases the probability of the Type II error._


### Type I error

The Type I error is the probability of incorrecly rejecting the null hypothesis when the sample belongs to the population but with extreme values; this probability is equal to the level of significance $\alpha$. It is also called False Positive: falsely stating that the alternate hypothesis is true.


### Type II error

The Type II error $\beta$ is the probability of incorrectly failing to reject a null hypothesis; it is also called False Negative.


## Statistical Power

[Power](https://en.wikipedia.org/wiki/Statistical_power), also called sensitivity, is the probability of correctly rejecting a false $H_0$; It is equal to $1 - \beta$.

Calculating the power helps asserting whether we can confidently fail to reject the null hypothesis (i.e. large p-values). A low power means that the effect might be too small to detect in the conditions of our test.

Two key things impact statistical power:
+ the effect size: a large difference between groups is easier to detect.
+ the sample size: it directly impacts the test statistic and the p-value.


## P-Value vs Errors

The p-value is linked to both error types:

+ alpha is the maximum p-value we consider low enough to safely reject $H_0$.
+ power is the probability we correctly fail to reject $H_0$ when the p-value is above alpha.


## Bonferroni Corrections

The chance of capturing rare event increases when testing multiple hypothesis. It means the likelihood of incorrectly rejecting a null hypothesis (false positive) increases. 

The Bonferroni correction rejects the null hypothesis for each $p_{i} \leq \frac {\alpha}{m}$. This ensures the [Family Wise Error Rate](https://en.wikipedia.org/wiki/Family-wise_error_rate) stays below the significance level $\alpha$. More information can be found [here](https://stats.stackexchange.com/questions/153122/bonferroni-correction-for-post-hoc-analysis-in-anova-regression).

It is useful for post-hoc tests after performing one-way ANOVA or Chi-Square tests that reject the null hypothesis. When comparing $N$ multiple groups, we can either do:
+ pairwise testing. In that case, $m$ will be ${N \choose 2}$.
+ one vs the rest. In that case, $m$ will be $N$.


___

# PARAMETRIC vs NON-PARAMETRIC

Outliers?

References:
+ discussions around the importance of normality assumptions can be found [here](https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546) and [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/).
+ discussions around t-tests vs non-parametric tests can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3445820/#B5) and [here](https://www.contemporaryclinicaltrials.com/article/S1551-7144(09)00109-8/fulltext).


___

# EFFECT SIZE vs SAMPLE SIZE
## Effect Size

The effect size quantifies how large is the difference in outcome is due to the input variable.

+ continuous outcome in relation to discrete factors (t-tests & ANOVA): Cohen's d (difference of means divided by pooled standard deviation).
+ discrete variables: odd ratio (strength of association between events).
+ continuous variables: correlation coefficient.


## Choosing the right Sample Size

Given the variance of data $\sigma$ and the minimum difference to detect $\delta$, a typical formula to assess [sample size](https://en.wikipedia.org/wiki/Sample_size_determination) is:

$$N = (z_\alpha + z_\beta)^2 \times \frac{\sigma^2}{\delta^2}$$

Where $z_\alpha$ and $z_\beta$ are the z-score of $\alpha$ and $\beta$, respectively. 

You need to collect data for the sample of size $N$ calculated above before being able to draw conclusions.

In a normal distribution, 95% of the data is between --2 and +2 standard deviations from the mean. Even for skewed data, going two standard deviations away from the mean often captures nearly all of the data.

If we know the minimum and maximum values that the population is likely to take (excluding outliers), we can suppose they represent this interval of four standard deviations.

It means the standard deviation of a population $\sigma$ can be approximated by:

$\sigma \simeq 1/4 \times \Delta_{range}$

If we know the margin of error $E$ we are ready to accept at $1 - \alpha$ confidence, the sample size we need can be approximated by:

$n \simeq [Z_{\alpha/2} \times \sigma / E]^2 \simeq [Z_{\alpha/2} \times  \Delta_{range} / 4 E]^2 $

A more accurate method to estimate the sample size: iteratively evaluate the following formula, until the $n$ value chosen to calculate the t-value matches the resulting $n$.

$n \simeq [t_{\alpha/2, n-1} \times  \Delta_{range} / 4 E]^2 $


## Large samples

Tests become more sensitive with large sample sizes (i.e. they can capture smaller variations). Applying small-sample statistical inference to large samples means that even minuscule effects can become statistically significant.

> The question is not whether differences are ‘significant’ (they nearly always are in large samples), but whether they are interesting. Forget statistical significance, what is the practical significance of the results?

*More details can be found in this [academic paper](https://pdfs.semanticscholar.org/262b/854628d8e2b073816935d82b5095e1703977.pdf/).*

More useful links:
+ https://statmodeling.stat.columbia.edu/2009/06/18/the_sample_size/
+ https://stats.stackexchange.com/questions/2516/are-large-data-sets-inappropriate-for-hypothesis-testing


## Appendix - Further Reads

A few interesting Wikipedia articles:

Generalities
+ https://en.wikipedia.org/wiki/Sampling_distribution
+ https://en.wikipedia.org/wiki/Statistical_hypothesis_testing 

Probabilities
+ https://en.wikipedia.org/wiki/Probability_interpretations
+ https://en.wikipedia.org/wiki/Frequentist_probability
+ https://en.wikipedia.org/wiki/Bayesian_probability

Inference paradigms:
+ https://en.wikipedia.org/wiki/Frequentist_inference
+ https://en.wikipedia.org/wiki/Bayesian_inference
+ https://en.wikipedia.org/wiki/Lindley%27s_paradox
+ https://www.stat.berkeley.edu/~stark/Preprints/611.pdf

PArametric vs Ordinal
+ https://tech.snmjournals.org/content/46/3/318.2#:~:text=Currie%20writes%2C%20%E2%80%9CThe%20Likert%20scale,the%20data%20ordinal%20in%20nature.&text=Moreover%2C%20he%20concludes%20that%20parametric,distribution%20of%20data)%20are%20violated.
+ https://www.researchgate.net/post/What_is_the_most_suitable_statistical_test_for_ordinal_data_eg_Likert_scales


## TODO - Topics

+ Power & Significance & Effect size

+ t-test
+ linear regression & link with t-test
+ ANOVA
+ Chi-square
+ F-statistic

https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546


## TODO

Can we use common distributions to approximate the sampling distribution of the test statistic & under what conditions.

_Note: many statistical methods assume the data is roughly normal. This assumption must always be checked first: many things that you might assume are normally distributed are actually not. In particular, outliers are extremely unlikely for normally distributed data; if your data does have extreme values, the normal distribution might not be the best description._

_Note: We can compare the ECDF to the theoritical CDF of the normal distribution with same mean and standard deviation to assess if the data is normally distributed._

_Note: If the data distribution is far from normal, using a t-test that focuses on the mean [might not be the most relevant test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/)._

It the t-distribution cannot be used, it is possible to use a more robust procedure such as the one-sample [**Wilcoxon procedure**](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test).


+ skewness / kurtosis
+ mean/ std vs boxplots & outliers


Sometimes, we need to summarize the data in one or two numbers. The most common one for continuous data are the mean or the median.

+ mean, based on the values of the data, is heavily influenced by outliers.
+ median, based on the ranking of the values, is immune to outliers.

The spread of data is measured by the standard deviation from the mean.
