# Parametric tests


## Normal vs T-distribution

Let $X_{1},\ldots ,X_{n}$ be a sequence of $n$ independent and identically distributed (i.i.d.) random variables drawn from a distribution of expected value $\mu$ and finite variance $\sigma ^{2}$. Let $\bar {X}$ be the sample mean. If certain assumptions are met (see below), then:

$$\frac { \bar {X} - \mu }{\sigma /\sqrt {n}} \sim N(0, 1)$$

In practice, the population variance is rarely known. We can substitute the unbiased sample variance $\widehat {\sigma}^2$ instead. In this case:

$$\frac { \bar {X} - \mu }{\widehat {\sigma} /\sqrt {n}} \sim t_{n-1}$$

Small samples are more likely to underestimate $\sigma$ and have a mean that differs from $\mu$. The t-distribution accounts for this uncertainty with heavier tails compared to a Gaussian: the probability of extreme values becomes comparatively higher.

_Note: when the sample size is large (30+ observations), the Student Distribution becomes extremely close to the normal distribution._

_Note: $\sigma /\sqrt {n}$ is the standard deviation of the sampling distribution of the sample mean. It's called the standard error of the mean._ 

## Assumptions of t-distribution

The [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) states that the sampling distribution of sample means tends toward a normal distribution when the sample size increases, even when the data itself is not normally distributed. 

It means that the assumptions underlying the t-distribution hold when:

+ the population is normally distributed, even for small samples ([mathematical proof](https://www.math.arizona.edu/~jwatkins/ttest.pdf)).
+ samples are large, regardless of the underlying distribution of data, thanks to the CLT.

_Note: If the data distribution is far from normal, using a t-test that focuses on the mean [might not be the most relevant test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/)._


## Confidence Intervals

The random variable
$T={\frac {{\bar {X}}-\mu }{{\widehat {\sigma }}/{\sqrt {n}}}}$
follows a $t$-distribution with $n-1$ degrees of freedom, which we can use to calculate confidence intervals for the population mean $\mu$. 

The mean of our sample $\bar {X}$ can potentially be among the extremes values of the $t$-distribution. Let $\alpha$ be the percentage of samples whose mean is so extreme that we are comfortable assuming doesn't include our sample. The $t$-distribution being symetrical, it means we exclude $\alpha/2$ samples on each tail. 

Assuming our sample mean is indeed out of these extreme values, the value of T for our sample is between two set values: $t_{\alpha/2,\,n-1} < T < t_{1 - \alpha/2, \,n-1}$. The $t$-distribution is symetrical, so $t_{\alpha/2,\,n-1} = - t_{1 - \alpha/2, \,n-1}$. It follows that:

$$\mu \in \bar {X} \pm t_{1-\alpha/2, \,n-1} \frac{\widehat {\sigma }}{\sqrt {n}}$$

This is called the $1-\alpha$ confidence interval for the population mean: for $1-\alpha$ % of samples drawn from the population, $\mu$ will indeed be in this interval.


## Slope of regression line

Let's consider a model $Y=\alpha +\beta x+\epsilon$ where $\alpha$ and $\beta$ are unknown. Let $\hat{\alpha}$ and $\hat{\beta}$ be the least-square estimates; $SE_{\alpha}$ and $SE_{\beta}$ their standard errors. 

We can construct confidence intervals for the linear regression coefficients under assumptions of normality, which hold when:
+ $\epsilon$ is a normally distributed random variable with mean 0 and unknown variance $\sigma^2$.
+ the number of observations $n$ is sufficiently large, in which case the estimator is approximately normally distributed thanks to the CLT.

In that case, $\hat{\beta}$ is normally distributed with mean $\beta$ and standard error proportional to $\sigma$. In practice, $\sigma$ is unknown so we use the sum of square residuals (SSR) instead. The resulting variable follows a t-distribution with $n-2$ degrees of freedom:

$$t_{slope} = \frac { \bar {\beta} - \beta }{SE_\beta} \sim t_{n-2}$$

It followa that the $1-\alpha$ confidence interval for the slope is: 

$$\beta \in \hat{\beta} \pm t_{1-\alpha/2, \,n-2} \, {SE_\beta} $$


The null hypothesis states that the slope $\beta$ is equal to 0 (i.e. that x and Y are uncorrelated). We can reject $H_0$ if 0 is not in the confidence interval. Alternatively, the p-value is 2 * t_{slope} (two-tails one sample t-test).


## T-tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.

A t-test is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistics (under certain conditions) follow a Student's t distribution. 


## One-sample t-test

The null hypothesis $H_0$ states that the population mean is equal to a specified value $\mu_0$. 

Our test statistic is 
$t={\frac {{\bar {X}}-\mu_0 }{{\widehat {\sigma }}/{\sqrt {n}}}}$
where $\bar {X}$ is the sample mean, $\widehat {\sigma}^2$ its variance and $n$ its size.

Under the null hypothesis, the test statistic follows a $t$-distribution with $n-1$ degrees of freedom. 


## Independent two-samples t-test

The null hypothesis $H_0$ states that the population mean of two samples are equal: $\mu_1 = \mu_2$. The test statistic is $t=(\bar {X_1}-\bar {X_2}) / s$, where $s^2$ is a measure of common variance whose formula depends on whether the two samples have equal size and/or variance.

Under the null hypothesis, the test statistic follows a $t$-distribution (its degrees of freedom depend on the assumptions of equal or unequal variance). 

Note: Student's t-test assume equality of variance; the Welch's unequal variances t-test, or Welch U test, doesn't make such assumptions. The litterature [does not recommend](https://onlinelibrary.wiley.com/doi/abs/10.1348/000711004849222) to test equality of variances before choosing between the two test. In general, using the Welch U test must be [preferred](https://www.rips-irsp.com/articles/10.5334/irsp.82/).


## Testing the null hypothesis

We can formulate three possible **alternate hypotheses**:

Test          | Left-tailed      | Two-sided          | Right-tailed        |
-------------:|-----------------:|:------------------:|:-------------------:|
One-sample    | $\mu - \mu_0 \lt 0$  | $\mu - \mu_0 \neq 0$   | $\mu - \mu_0 \gt 0$     |
Two-sample    | $\mu_1 - \mu_2 \lt 0$  | $\mu_1 - \mu_2 \neq 0$   | $\mu_1 - \mu_2 \gt 0$     |

We calculate the probability of observing the sample statistic $t^*$ under $H_0$. Depending on the alternate hypothesis and significance level $\alpha$, we reject $H_0$ if $t^*$ is in the blue areas of the **sampling distribution of sample statistic**:

<img class="center-block" src="https://sebastienplat.s3.amazonaws.com/21a0a7a855f51f6426dfbf6115b872161490032937519"/>

_Note: for two-tailed tests, we use $\alpha/2$ for each tail and $2*t^*$ as the p-value. This ensures the total probability of extreme values is $\alpha$._


## Parametric vs non-parametric

Outliers?

References:
+ discussions around the importance of normality assumptions can be found [here](https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546) and [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/).
+ discussions around t-tests vs non-parametric tests can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3445820/#B5) and [here](https://www.contemporaryclinicaltrials.com/article/S1551-7144(09)00109-8/fulltext).

