# STATISTICAL TESTS
## Choosing a Statistical Test
Choosing a [statistical test](http://www.biostathandbook.com/) depends on:
+ what hypothesis is tested.
+ the type of the variable of interest & its probability distribution.

In some simple cases, we do not have to explicitely model relationships and can use specific statistical tests instead; The most common example is the t-test to compare two samples. As we'll see in the next article, these statistical tests are particular cases of more general linear models. For example, t-tests are a linear model where X = "sample the observation belongs to". We try to assess if there is a relationship between X and Y, or if the response depends on the sample.

More generally, choosing a [statistical test](http://www.biostathandbook.com/) depends on what we want to measure:

<br></br>
![png](../../img/stat_tests/stat_tests_overview.png)


___

# DIFFERENCE BETWEEN SAMPLES
## Overview

Comparing samples aims to determine if some characteristics of the population have an impact on the variable of interest. More specifically, we check if different values of some **categorical variable(s)** lead to **different probability distributions** for the variable of interest.

<br></br>
![png](../../img/stat_tests/stat_tests_diff_between_samples.png)


___

# CORRELATION BETWEEN VARIABLES
## Overview

Correlation is the measure of dependance between **two continuous or ordinal variables**; It typically indicates their linear relationship, but more broadly measures how in sync they vary. This is expressed by their **covariance**. 

A more common measure is the **[Pearson product-moment correlation coefficient](https://en.wikipedia.org/wiki/Correlation_and_dependence#Pearson's_product-moment_coefficient)**, built on top of the covariance. It's akin to the standard variation vs the variance for bivariate data and represents how far the relationship is from the line of best fit.

The correlation coefficient divides the covariance by the product of the standard deviations. This normalizes the covariance into a unit-less variable whose values are between -1 and +1.

The line of best fit has a slope equal to the Pearson coefficient multiplied by SDy / SDx.

<br></br>
![png](../../img/stat_tests/stat_tests_correlation.png)


___

# MODELING
## Overview

Linear Regression:
+ only incude variables that are correlated to the outcome.
+ check for collinearity.

<br></br>
![png](../../img/stat_tests/stat_tests_modeling.png)


## TODO - Topics

+ linear regression & link with t-test
+ ANOVA
+ Chi-square
+ F-statistic

https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546


___

# TESTING FOR NORMALITY
## Parametric Assumptions of Normality

T-tests and ANOVA assume that all the samples are normally distributed. Testing for normality is required for small samples but not for large ones, as the mean of large samples is close to the population mean. Testing for normality can be done with the Shapiro-Wilk test (or visually with QQ-plots).

Non-parametric tests can be used when
+ the assumption of normality is not met. 
+ the **mean** is **not the most appropriate** parameter to describe the population.

They are less sensitive than parametric tests, which means their chances of true positives is lower and chances of false negatives are higher.

_Note: many statistical methods assume the data is roughly normal. This assumption must always be checked first: many things that you might assume are normally distributed are actually not. In particular, outliers are extremely unlikely for normally distributed data; if your data does have extreme values, the normal distribution might not be the best description._

_Note: We can compare the ECDF to the theoritical CDF of the normal distribution with same mean and standard deviation to assess if the data is normally distributed._

References:
+ discussions around the importance of normality assumptions can be found [here](https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546) and [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/).
+ discussions around t-tests vs non-parametric tests can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3445820/#B5) and [here](https://www.contemporaryclinicaltrials.com/article/S1551-7144(09)00109-8/fulltext).


## Q-Q Plots

### Construction

Q–Q (quantile-quantile) plots compare two probability distributions by plotting their quantiles against each other. They are commonly used to compare a dataset to a theoretical model, providing a graphical assessment of "goodness of fit" rather than a numerical summary.

_Note: When the two datasets have the same size, the Q_Q plot orders each set in increasing order and pairs off the corresponding values. Otherwise, it is necessary to interpolate quantile estimates for the smallest dataset._

_Note: in an ordered sample, the kth-smallest value is called its [kth-order statistic](https://en.wikipedia.org/wiki/Order_statistic). For the normal distribution, tje order statistics are called [rankits](https://en.wikipedia.org/wiki/Rankit)._


### Interpretation

If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the line y = x.

A non-linear pattern suggests the two datasets don't have the same probability distribution.


### Examples

The examples below show PDFs, CDFs and QQ-plots for a few common distributions. Note that the PDF of actual samples can only be approximated by histograms and Kernel Density Estimates, so the CDF will be easier to analyze.

![qqplot](../../img/qq-plot.png)


## Shapiro-Wilk Test

The Shapiro–Wilk test tests the null hypothesis that a sample $X_1, ..., X_n$ came from a normally distributed population. We reject the null hypothesis if the p-value is below $\alpha$, typically 0.05.

 Below is the test performed on three species of iris: all the test statistics are above $\alpha$ so we fail to reject the null hypothesis.


In [25]:
# load & format iris dataset
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df['species'] = [iris.target_names[i] for i in iris.target]

# sample statistics for iris species
for species in ['setosa', 'versicolor', 'virginica']:

    petal_length = iris_df.loc[iris_df['species'] == species, 'petal length (cm)'].to_numpy()
    _, sample_p = stats.shapiro(petal_length)
    print('p-value for {:}: {:.3f}'.format(species, sample_p))


p-value for setosa: 0.055
p-value for versicolor: 0.158
p-value for virginica: 0.110


## Jarque-Bera Test

The Jarque–Bera test is a goodness-of-fit test based on the sample skewness and kurtosis: its null hypothesis states that both skewness and excess kurtosis are zero. If the data comes from a normal distribution, the JB test statistic asymptotically follows a chi-squared distribution with two degrees of freedom.

The chi-squared approximation is overly sensitive for small samples, often rejecting the null hypothesis when it is true (large Type I error rate). This is why this test is only recommended for large samples (n > 2000).

_Note: some implementations interpolate p-values for small samples via Monte-Carlo simulations, in order to account for discrepancies between calculations and true alpha values._
