# HYPOTHESIS TESTING

## Goal 

Understanding relationships between response and predictors is about testing two opposite hypotheses:
+ null hypothesis $H_0$: there is no relationship between a predictor and a response.
+ alternative hypothesis: there is a relationship.


## Experimental Design

Hypothesis testing is used to **make decisions about a population** using sample data. 

+ We start with a **null hypothesis $H_0$** that we we asssume to be true. For instance:
    + the sample parameter is equal to a given value.
    + samples with different characteristics are drawn from the same population.
+ We run an **experiment** to test this hypothesis:
    + **collect data** from a sample of predetermined size.
    + perform the appropriate **statistical test**.
+ Based on the experimental results, we can either **reject** or **fail to reject** this null hypothesis. 
+ If we reject it, we say that the data supports another, mutually exclusive, **alternate hypothesis**.


## Test Distribution

A statistical test is based on some assumption regarding the sampling distribution of the test statistics under the null hypothesis; The assumptions depend on the test. 


## P-value

We measure the probability that a sample would have a test statistics at least as extreme as the one observed, if these assumptions (and thus the null hypothesis) were true. This probability is called the **p-value**. 

To draw conclusions, we use a predetermined cutoff probability called the level of significance $\alpha$ (typically 5%). 

+ $\text{p-value }\leq\alpha$: the observed data is very unlikely under the null hypothesis so we reject it. The observed effect is statistically significant.
+ $\text{p-value }\gt\alpha$: we fail to reject the null hypothesis. The observed effect is not statistically significant.


___

# PROBABILITY OF FALSE INFERENCE

A statistical test has several important parameters to indicate acceptable probabilities of false inference:
+ alpha (Type I Error): probability of false positive.
+ beta (Type II Error): probability of false negative.
+ power (1 - beta): probability of true negative (correcting failing to reject the null hypothesis).

These parameters are linked to:
+ effect size: the stronger an effect, the easier it is to correctly infer it.
+ sample size: the larger the sample, the easier it is to capture even small effects.


## Types of Errors

There are four possible outcomes for our hypothesis testing, with two [types of errors](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors):

| Decision          | $$H_0$$ is True                      | $$H_0$$ is False                     |
|-------------------:|:---------------------------------:|:---------------------------------:|
| **Reject H0** | **Type I error**: False Positive   | Correct inference: True Positive |
| **Fail to reject H0** | Correct inference: True Negative | **Type II error**: False Negative |

_Note: Decreasing the Type I error increases the probability of the Type II error._


### Type I error

The Type I error is the probability of incorrecly rejecting the null hypothesis when the sample belongs to the population but with extreme values; this probability is equal to the level of significance $\alpha$. It is also called False Positive: falsely stating that the alternate hypothesis is true.


### Type II error

The Type II error $\beta$ is the probability of incorrectly failing to reject a null hypothesis; it is also called False Negative.


## Statistical Power

[Power](https://en.wikipedia.org/wiki/Statistical_power), also called sensitivity, is the probability of correctly rejecting a false $H_0$; It is equal to $1 - \beta$.

Calculating the power helps asserting whether we can confidently fail to reject the null hypothesis (when the p-values is large). A low power (typically less than 80%) means that a real but small effect might not be detected in the conditions of our test.

This [article](https://www.statisticsteacher.org/2017/09/15/what-is-power/) provides fictional studies examples that illustrate the need for careful experimental design.


## P-Value vs Errors

The p-value is linked to both error types:

+ alpha is the maximum p-value we consider low enough to safely reject $H_0$.
+ power is the probability we correctly fail to reject $H_0$ when the p-value is above alpha.


___

# EFFECT SIZE

The [effect size](https://en.wikipedia.org/wiki/Effect_size) is a number that measures the strength of the relationship between two variables. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event (such as a heart attack) happening. 


## Difference Between Means

An effect size based on means usually considers the standardized mean difference between two populations or samples. For instance, [Cohen's d](https://en.wikipedia.org/wiki/Effect_size#Cohen's_d) is the difference of means divided by pooled standard deviation. The smaller its value, the smaller the effect size.


## Correlation & Variance Explained

These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model. 

The [Pearson's Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) is the covariance of two variables, divided by the product of their standard deviations. Its values can vary in magnitude from −1 to 1, with −1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation, and 0 indicating no linear relation between two variables.

[Cohen's $f^2$](https://en.wikipedia.org/wiki/Effect_size#Cohen's_%C6%922) is one of several effect size measures used in the context of an F-test for ANOVA or multiple regression.


## Categorical variables

Commonly used measures of association for categorical variables are the Phi coefficient and [Cohen's w](https://en.wikipedia.org/wiki/Effect_size#Cohen's_w). Another useful measure is the [Odds Ratio](https://en.wikipedia.org/wiki/Odds_ratio).


___

# SAMPLE SIZE
## Choosing the right Sample Size

Most statistical tests can assess the value of one parameter among alpha, power, normalized effect size and sample size, assuming the other three are known. This means that we can determine the proper [sample size](https://en.wikipedia.org/wiki/Sample_size_determination) required to perform a sound experiment, before actually collecting any data. We'll cover this in more details in the next chapters. 


## Large Samples

Tests become more sensitive with large sample sizes: even minuscule effects can become statistically significant. This is why it is very important to measure the effect size for large samples:

> The question is not whether differences are ‘significant’ (they nearly always are in large samples), but whether they are interesting. Forget statistical significance, what is the practical significance of the results?

More details can be found in this [academic paper](https://pdfs.semanticscholar.org/262b/854628d8e2b073816935d82b5095e1703977.pdf/).

More useful links:
+ https://statmodeling.stat.columbia.edu/2009/06/18/the_sample_size/
+ https://stats.stackexchange.com/questions/2516/are-large-data-sets-inappropriate-for-hypothesis-testing
