# MODULES


In [5]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

from sklearn import datasets
import pandas as pd
from statsmodels.formula.api import ols
import statsmodels.api as sm

from scipy.stats import chi2
from scipy.stats import chisquare


___

# STATISTICAL TESTS
## Choosing a Statistical Test
Choosing a [statistical test](http://www.biostathandbook.com/) depends on:
+ what hypothesis is tested.
+ the type of the variable of interest & its probability distribution.

In some simple cases, we do not have to explicitely model relationships and can use specific statistical tests instead; The most common example is the t-test to compare two samples. As we'll see in the next article, these statistical tests are particular cases of more general linear models. For example, t-tests are a linear model where X = "sample the observation belongs to". We try to assess if there is a relationship between X and Y, or if the response depends on the sample.

More generally, choosing a [statistical test](http://www.biostathandbook.com/) depends on what we want to measure:

<br></br>
![png](../../img/stat_tests/stat_tests_overview.png)


___

# INFERENCE
## Overview

We can infer the value of the **population parameter** based on the sample statistics. Which parameter represents the population the best depends on the probability distribution.

_Fisher & Chi-Squared tests measure the proportion of samples in different categories._

<br></br>
![png](../../img/stat_tests/stat_tests_inference.png)


___

# DIFFERENCE BETWEEN SAMPLES
## Overview

Comparing samples aims to determine if some characteristics of the population have an impact on the variable of interest. More specifically, we check if different values of some **categorical variable(s)** lead to **different probability distributions** for the variable of interest.

<br></br>
![png](../../img/stat_tests/stat_tests_diff_between_samples.png)


___

# CORRELATION BETWEEN VARIABLES
## Overview

Correlation is the measure of dependance between **two continuous or ordinal variables**; It typically indicates their linear relationship, but more broadly measures how in sync they vary. This is expressed by their **covariance**. 

A more common measure is the **[Pearson product-moment correlation coefficient](https://en.wikipedia.org/wiki/Correlation_and_dependence#Pearson's_product-moment_coefficient)**, built on top of the covariance. It's akin to the standard variation vs the variance for bivariate data and represents how far the relationship is from the line of best fit.

The correlation coefficient divides the covariance by the product of the standard deviations. This normalizes the covariance into a unit-less variable whose values are between -1 and +1.

The line of best fit has a slope equal to the Pearson coefficient multiplied by SDy / SDx.

<br></br>
![png](../../img/stat_tests/stat_tests_correlation.png)


___

# MODELING
## Overview

Linear Regression:
+ only incude variables that are correlated to the outcome.
+ check for collinearity.

<br></br>
![png](../../img/stat_tests/stat_tests_modeling.png)


## TODO - Topics

+ linear regression & link with t-test
+ ANOVA
+ Chi-square
+ F-statistic

https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546


___

# TESTING FOR NORMALITY
## Parametric Assumptions of Normality

T-tests and ANOVA assume that all the samples are normally distributed. Testing for normality is required for small samples but not for large ones, as the mean of large samples is close to the population mean. Testing for normality can be done with the Shapiro-Wilk test (or visually with QQ-plots).

Non-parametric tests can be used when
+ the assumption of normality is not met. 
+ the **mean** is **not the most appropriate** parameter to describe the population.

They are less sensitive than parametric tests, which means their chances of true positives is lower and chances of false negatives are higher.

_Note: many statistical methods assume the data is roughly normal. This assumption must always be checked first: many things that you might assume are normally distributed are actually not. In particular, outliers are extremely unlikely for normally distributed data; if your data does have extreme values, the normal distribution might not be the best description._

_Note: We can compare the ECDF to the theoritical CDF of the normal distribution with same mean and standard deviation to assess if the data is normally distributed._

References:
+ discussions around the importance of normality assumptions can be found [here](https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546) and [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/).
+ discussions around t-tests vs non-parametric tests can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3445820/#B5) and [here](https://www.contemporaryclinicaltrials.com/article/S1551-7144(09)00109-8/fulltext).


## TODO

Can we use common distributions to approximate the sampling distribution of the test statistic & under what conditions.

_Note: many statistical methods assume the data is roughly normal. This assumption must always be checked first: many things that you might assume are normally distributed are actually not. In particular, outliers are extremely unlikely for normally distributed data; if your data does have extreme values, the normal distribution might not be the best description._

_Note: We can compare the ECDF to the theoritical CDF of the normal distribution with same mean and standard deviation to assess if the data is normally distributed._

_Note: If the data distribution is far from normal, using a t-test that focuses on the mean [might not be the most relevant test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/)._

It the t-distribution cannot be used, it is possible to use a more robust procedure such as the one-sample [**Wilcoxon procedure**](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test).


+ skewness / kurtosis
+ mean/ std vs boxplots & outliers


Sometimes, we need to summarize the data in one or two numbers. The most common one for continuous data are the mean or the median.

+ mean, based on the values of the data, is heavily influenced by outliers.
+ median, based on the ranking of the values, is immune to outliers.

The spread of data is measured by the standard deviation from the mean.


## Q-Q Plots

### Construction

Q–Q (quantile-quantile) plots compare two probability distributions by plotting their quantiles against each other. They are commonly used to compare a dataset to a theoretical model, providing a graphical assessment of "goodness of fit" rather than a numerical summary.

_Note: When the two datasets have the same size, the Q_Q plot orders each set in increasing order and pairs off the corresponding values. Otherwise, it is necessary to interpolate quantile estimates for the smallest dataset._

_Note: in an ordered sample, the kth-smallest value is called its [kth-order statistic](https://en.wikipedia.org/wiki/Order_statistic). For the normal distribution, tje order statistics are called [rankits](https://en.wikipedia.org/wiki/Rankit)._


### Interpretation

If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the line y = x.

A non-linear pattern suggests the two datasets don't have the same probability distribution.


### Examples

The examples below show PDFs, CDFs and QQ-plots for a few common distributions. Note that the PDF of actual samples can only be approximated by histograms and Kernel Density Estimates, so the CDF will be easier to analyze.

![qqplot](../../img/qq-plot.png)


## Shapiro-Wilk Test




## Jarque-Bera Test

The Jarque–Bera test is a goodness-of-fit test based on the sample skewness and kurtosis: its null hypothesis states that both skewness and excess kurtosis are zero. If the data comes from a normal distribution, the JB test statistic asymptotically follows a chi-squared distribution with two degrees of freedom.

The chi-squared approximation is overly sensitive for small samples, often rejecting the null hypothesis when it is true (large Type I error rate). This is why this test is only recommended for large samples (n > 2000).

_Note: some implementations interpolate p-values for small samples via Monte-Carlo simulations, in order to account for discrepancies between calculations and true alpha values._


___

# TESTING THE NULL HYPOTHESIS

We can formulate three possible **alternate hypotheses**:

Test          | Left-tailed      | Two-sided          | Right-tailed        |
-------------:|-----------------:|:------------------:|:-------------------:|
One-sample    | $\mu - \mu_0 \lt 0$  | $\mu - \mu_0 \neq 0$   | $\mu - \mu_0 \gt 0$     |
Two-sample    | $\mu_1 - \mu_2 \lt 0$  | $\mu_1 - \mu_2 \neq 0$   | $\mu_1 - \mu_2 \gt 0$     |

We calculate the probability of observing the sample statistic $t^*$ under $H_0$. Depending on the alternate hypothesis and significance level $\alpha$, we reject $H_0$ if $t^*$ is in the blue areas of the **sampling distribution of sample statistic**:

<img class="center-block" src="https://sebastienplat.s3.amazonaws.com/21a0a7a855f51f6426dfbf6115b872161490032937519"/>

_Note: for two-tailed tests, we use $\alpha/2$ for each tail and $2*t^*$ as the p-value. This ensures the total probability of extreme values is $\alpha$._


## TODO - Bonferroni Corrections

The chance of capturing rare event increases when testing multiple hypothesis. It means the likelihood of incorrectly rejecting a null hypothesis (false positive) increases. 

The Bonferroni correction rejects the null hypothesis for each $p_{i} \leq \frac {\alpha}{m}$. This ensures the [Family Wise Error Rate](https://en.wikipedia.org/wiki/Family-wise_error_rate) stays below the significance level $\alpha$. More information can be found [here](https://stats.stackexchange.com/questions/153122/bonferroni-correction-for-post-hoc-analysis-in-anova-regression).

It is useful for post-hoc tests after performing one-way ANOVA or Chi-Square tests (explained in the next chapters) that reject the null hypothesis. When comparing $N$ multiple groups, we can either do:
+ pairwise testing. In that case, $m$ will be ${N \choose 2}$.
+ one vs the rest. In that case, $m$ will be $N$.

## T-Tests

In its most common form, a t-test **compare means** of continuous variables. Both one-tail and two-tailed alternate hypothesis are possible.

+ one-sample null hypothesis: the mean of a population has a specific value.
+ two-sample null hypothesis: the means of two populations are equal.
+ dependent two-sample null hypothesis: the mean of a sample is unchanged after some event.

Different versions of t-tests exist to handle different situations:
+ same sample size, equal variance.
+ different sample size, equal variance.
+ different variance (Welch's t-test).


### Assumptions

T-tests make the following **assumptions**:
+ the sample **mean(s)** follow a **normal distribution** (this is always the case for large samples under the CLT).
+ the sample **variance(s)** follow a **$\chi^2$ distribution** (this is always the case for normally distributed data).

In practice, t-tests can be used when:
+ the sample size **is large** (30+ observations), OR
+ the **population** is roughly **normal** (very small samples - use normal probability plots to assess normality).

It the t-distribution cannot be used, it is possible to use a more robust procedure such as the one-sample [**Wilcoxon procedure**](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test).

_Note: If the data distribution is far from normal, using a t-test that focuses on the mean [might not be the most relevant test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/)._


### T-statistic

A t-statistic will be larger (i.e. less likely to happen by chance) if:
+ the compared statistic values are very different.
+ the pooled standard deviation is small, ie. the compared distributions do not overlap much.
+ the samples are large.


### One-Sample T-Test

The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.

A t-test is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistics (under certain conditions) follow a Student's t distribution. 


___

# USEFUL DISTRIBUTIONS
