# INTRODUCTION

Statistical learning refers to a vast set of tools for understanding data:

+ supervised: building a statistical model for predicting, or estimating, an output based on one or more inputs.
+ unsupervised: learning relationships between inputs (no supervising output).

This series of articles will focus on supervised learning.


___

# MODELING
## Estimate function

Suppose that we observe a quantitative response $Y$ and $p$ different predictors $X = X_1, X_2,...,X_p$. We assume that there is some relationship between them, which can be written in the very general form: $Y = f(X) + \epsilon$.

+ $f$ is some fixed but unknown function of $X$.
+ $\epsilon$ is a random error term, which is independent of $X$ and has a mean of zero.

We create an estimate $\hat{f}$ that predicts $Y$: $\hat{Y} = \hat{f}(X)$. Choosing $\hat{f}$ depends on the goal of the modelisation.

## Predictions vs Inference

When focusing on **predictions accuracy**, we are not overly concerned with the shape of $\hat{f}$, as long as it yields accurate predictions for $Y$: we treat it as a black box.

When focusing on **inference**, we want to understand the way that $Y$ is affected as $X$ changes, so we cannot treat $\hat{f}$ as a black box:

+ Which predictors are associated with the response? Which ones are the most important?
+ What is the relationship between the response and each predictor: positive or negative?
+ Can the relationship between $Y$ and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?


___

# SAMPLING

We almost never have access the entire population of interest; we start with the subset of data that we were able to collect. Our goal is to make [generalizations](https://www.encyclopediaofmath.org/index.php/Statistical_inference) about unseen data based on this sample; we will build the estimate $\hat{f}$ from the sample data and apply it to new values of $X$. 

The sample needs to represent the population well for our conclusions to be valid. To do this, the sample data needs to be **randomly generated** from the entire population.


## Sampling Methods

There are several ways to sample a population:

+ [Simple random sample](https://en.wikipedia.org/wiki/Simple_random_sample) – each subject in the population has an equal chance of being selected. Some demographics might be missed.
+ [Stratified random sample](https://en.wikipedia.org/wiki/Stratified_sampling) – the population is divided into groups based on some characteristic (e.g. sex, geographic region). Then simple random sampling is done for each group based on its size in the actual population.
+ [Cluster sample](https://en.wikipedia.org/wiki/Cluster_sampling) – a random cluster of subjects is selected from the population (e.g. certain neighborhoods instead of the entire city).


## Sampling bias

There are several forms of [sampling bias](https://en.wikipedia.org/wiki/Sampling_bias) that can lead to incorrect inference:
+ selection bias: not fully representative of the entire population.
    + people who answer surveys.
    + people from specific segments of the population (polling about health at fruit stand).
+ survivorship bias: population improving over time by having lesser members leave due to death.
    + head injuries with metal helmets increasing vs cloth caps because less lethal.
    + damage in WWII planes: not uniformally distributed in planes that came back, but only in non-critical areas.

_Note: other [criteria](https://en.wikipedia.org/wiki/Selection_bias) can also impact the representativity of our sample._


## Limitations of Statistical Inference

Due to the random nature of sampling, some samples are **not representative** of the population and will produce incorrect inference. This uncertainty is reflected in the **confidence level** of statistical conclusions:
+ a small proportion of samples, typically noted $\alpha$, will produce incorrect inferences.
+ for 1 - $\alpha$ percents of all samples, the conclusions will be correct.
+ the confidence level is therefore expressed as 1 - $\alpha$.

_Note: 0.01 and 0.05 are the most common values of $\alpha$. This translates to 99% and 95% confidence intervals._


## TODO

There are two main methods to measure central tendency and dispersion. Which one is the most appropriate depends on the shape of the distribution:
+ mean and standard deviation: unimodal and symmetrical distribution.
+ median and interquartile range otherwise.


## TODO - Assumptions of Normality

+ skewness / kurtosis
+ mean/ std vs boxplots & outliers


Sometimes, we need to summarize the data in one or two numbers. The most common one for continuous data are the mean or the median.

+ mean, based on the values of the data, is heavily influenced by outliers.
+ median, based on the ranking of the values, is immune to outliers.

The spread of data is measured by the standard deviation from the mean.


___

# INFERENCE PROBLEMS

## Types of Inference problems

+ Estimate the value of a parameter. The estimation can either be:
    + **point estimate**: a particular value that best approximates some parameter.
    + **interval estimate**: interval of plausible values for the parameter. This can also include prediction intervals for future observations.

+ Hypothesis Testing: yes/no answer as to whether the parameter lies in a specified region of values.


___

# ESTIMATION OF PARAMETER

The estimation can either be:
+ **point estimate**: a particular value that best approximates some parameter.
+ **interval estimate**: interval of plausible values for the parameter.

This can also include prediction intervals for future observations.


## Frequentist vs Bayesian Paradigms

Both paradigms are based on the likelihood but their frameworks are entirely different.

### Frequentist Paradigm

In the frequentist paradigm, the **parameter** is **set but unknown**. 

Due to the random nature of sampling, some samples are not representative of the population. It means that a small proportion of samples, typically noted $\alpha$, will produce incorrect inferences. This probability of errors can be controlled to build **(1 - $\alpha$) Percent Confidence Intervals**. 

This means that for (1 - $\alpha$) percents of all samples, the calculated interval will actually include the parameter value.

_Note: this is not a probability. The interval either includes the parameter value or it doesn't._

### Bayesian Paradigm

In the Bayesian paradigm, the **parameter** is a **random variable**. 

It is assigned a **prior distribution** based on already available (prior) data. This distribution is updated by the likelihood of the sample values to obtain its **posterior distribution**. From it, both **point estimate** and **region of highest posterior density** (or credible intervals) can be derived.

### Considerations

A rigorous approach to frequentist statistics assume that the conditions of the experiment are well-defined even before any dat a is actually collected. 

Baysesian statistics, on the other hand, make no such assumptions. They are especially useful when new data is constently collected: our beliefs are constantly updated, older data being used as prior to the new data that comes in. 

We will cover Bayesian Statistics in-depth in another article.


## Point Estimate

It is often interesting to summarize the **probability distribution** with a single numerical feature of interest: the population **parameter**. We draw our conclusions about the parameter from the sample **statistic**. 

A few important limitations:
+ a sample is only part of the population; the numerical value of its statistic will not be the exact value of the parameter.
+ the observed value of the statistic depends on the selected sample.
+ some variability in the values of a statistic, over different samples, is unavoidable.


### Maximum Likelihood Estimate

The [**Maximum Likelihood Estimator**](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) is the value of the parameter space (i.e. the set of all values the parameter can take) that is the **most likely** to have **generated our sample**. As the sample size increases, the MLE converges towards the true value of the population parameter.

+ for discrete distributions, the MLE of the probability of success is equal to successes / total trials.
+ for continuous distributions:
    + the MLE of the population mean is the sample mean. 
    + the MLE of the population variance is the sample variance.

    
_Note1: the sample variance needs to be [slightly adjusted](https://en.wikipedia.org/wiki/Bessel%27s_correction) to become [unbiased](https://dawenl.github.io/files/mle_biased.pdf)._

_Note2: in more complex problems, the MLE can only be found via numerical optimization._


### Maximum A Posteriori

The Maximum A Posteriori is [very similar to the MLE](https://wiseodd.github.io/techblog/2017/01/01/mle-vs-map/), but for the posterior distribution of the population parameter. It applies weigth coming from the prior to the likelihood of the new sample data.

_Note1: for constant priors, the MAP is equal to the MLE._


___

# HYPOTHESIS TESTING

> The goal of statistical inference is to make generalizations about the population when only a sample is available.

Understanding relationships between response and predictors is about testing two opposite hypotheses:
+ null hypothesis $H_0$: there is no relationship between a predictor and a response.
+ alternative hypothesis: there is a relationship.


## Experimental Design

Hypothesis testing is used to **make decisions about a population** using sample data. 

+ We start with a **null hypothesis $H_0$** that we we asssume to be true:
    + the sample parameter is equal to a given value.
    + samples with different characteristics are drawn from the same population.
+ We run an **experiment** to test this hypothesis:
    + **collect data** from a sample of predetermined size _(see [Statistical Power](#Statistical-Power) below)_.
    + perform the appropriate **statistical test**.
+ Based on the experimental results, we can either **reject** or **fail to reject** this null hypothesis. 
+ If we reject it, we say that the data supports another, mutually exclusive, **alternate hypothesis**.


## Test Distribution

A statistical test calculates a test statistics from the sample data. The sampling distribution of the sample statistics follows a specific probability distribution under the null hypothesis. This means we can calculate the probability that a sample would have a test statistics at least as extreme as the one observed under the null hypothesis: the p-value.

## Metrics

A statistical test has several important parameters:
+ alpha (Type I Error): probability of false positive.
+ beta (Type II Error): probability of false negative.
+ power (1 - beta): probability of true negative (correcting failing to reject the null hypothesis).
+ effect size: how large is the difference in outcome due to the input variable.
+ sample size.

It has a key metric: the p-value. It measures the probability of observing the experimental results if the null hypothesis is true.


### P-Value

We reject the null hypothesis if the p-value is very small. The cutoff probability is called the level of significance $\alpha$ and is typically 5%. 

More specifically, we measure the probability that our sample(s) produce such a test statistic or one more extreme under the $H_0$ probability distribution. A low p-value means that $H_0$ is unlikely to actually describe the population: we reject the null hypothesis.

+ $P\leq\alpha$: we reject the null hypothesis. The observed effect is statistically significant.
+ $P\gt\alpha$: we fail to reject the null hypothesis. The observed effect is not statistically significant.


### Types of Errors

There are four possible outcomes for our hypothesis testing, with two [types of errors](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors):

| Decision          | $$H_0$$ is True                      | $$H_0$$ is False                     |
|-------------------:|:---------------------------------:|:---------------------------------:|
| **Reject H0** | **Type I error**: False Positive   | Correct inference: True Positive |
| **Fail to reject H0** | Correct inference: True Negative | **Type II error**: False Negative |


#### Type I error

The Type I error is the probability of incorrecly rejecting the null hypothesis when the sample belongs to the population but with extreme values; this probability is equal to the level of significance $\alpha$. It is also called False Positive: falsely stating that the alternate hypothesis is true.


#### Type II error

The Type II error $\beta$ is the probability of incorrectly failing to reject a null hypothesis; it is also called False Negative.

_Note: The probabilities of making these two kinds of errors are related. Decreasing the Type I error increases the probability of the Type II error._


### Statistical Power

[Power](https://en.wikipedia.org/wiki/Statistical_power), also called the sensitivity, is the probability of correctly rejecting a false $H_0$; It is equal to $1 - \beta$.

Calculating the power helps asserting whether we can confidently fail to reject the null hypothesis (i.e. large p-values). A low power means that the effect might be too small to detect in the conditions of our test.

Two key things impact statistical power:
+ the effect size: a large difference between groups is easier to detect.
+ the sample size: it directly impacts the test statistic and the p-value.

Given the variance of data $\sigma$ and the minimum difference to detect $\delta$, a typical formula to assess [sample size](https://en.wikipedia.org/wiki/Sample_size_determination) is:

$$N = (z_\alpha + z_\beta)^2 \times \frac{\sigma^2}{\delta^2}$$

Where $z_\alpha$ and $z_\beta$ are the z-score of $\alpha$ and $\beta$, respectively. 

You need to collect data for the sample of size $N$ calculated above before being able to draw conclusions.


### Effect Size

+ continuous outcome in relation to discrete factors (t-tests & ANOVA): Cohen's d (difference of means divided by pooled standard deviation).
+ discrete variables: odd ratio (strength of association between events).
+ continuous variables: correlation coefficient.


### P-Value vs Errors

The p-value is linked to both error types:

+ alpha is the maximum p-value we consider low enough to safely reject $H_0$.
+ power is the probability we correctly fail to reject $H_0$ when the p-value is above alpha.


## Large samples

Tests become more sensitive with large sample sizes (i.e. they can capture smaller variations). Applying small-sample statistical inference to large samples means that even minuscule effects can become statistically significant.

> The question is not whether differences are ‘significant’ (they nearly always are in large samples), but whether they are interesting. Forget statistical significance, what is the practical significance of the results?

*More details can be found in this [academic paper](https://pdfs.semanticscholar.org/262b/854628d8e2b073816935d82b5095e1703977.pdf/).*

More useful links:
+ https://statmodeling.stat.columbia.edu/2009/06/18/the_sample_size/
+ https://stats.stackexchange.com/questions/2516/are-large-data-sets-inappropriate-for-hypothesis-testing


## Bonferroni Corrections

The chance of capturing rare event increases when testing multiple hypothesis. It means the likelihood of incorrectly rejecting a null hypothesis (false positive) increases. 

The Bonferroni correction rejects the null hypothesis for each $p_{i} \leq \frac {\alpha}{m}$. This ensures the [Family Wise Error Rate](https://en.wikipedia.org/wiki/Family-wise_error_rate) stays below the significance level $\alpha$. More information can be found [here](https://stats.stackexchange.com/questions/153122/bonferroni-correction-for-post-hoc-analysis-in-anova-regression).

It is useful for post-hoc tests after performing one-way ANOVA or Chi-Square tests that reject the null hypothesis. When comparing $N$ multiple groups, we can either do:
+ pairwise testing. In that case, $m$ will be ${N \choose 2}$.
+ one vs the rest. In that case, $m$ will be $N$.


## Appendix - Further Reads

A few interesting Wikipedia articles:

Generalities
+ https://en.wikipedia.org/wiki/Sampling_distribution
+ https://en.wikipedia.org/wiki/Statistical_hypothesis_testing 

Probabilities
+ https://en.wikipedia.org/wiki/Probability_interpretations
+ https://en.wikipedia.org/wiki/Frequentist_probability
+ https://en.wikipedia.org/wiki/Bayesian_probability

Inference paradigms:
+ https://en.wikipedia.org/wiki/Frequentist_inference
+ https://en.wikipedia.org/wiki/Bayesian_inference
+ https://en.wikipedia.org/wiki/Lindley%27s_paradox
+ https://www.stat.berkeley.edu/~stark/Preprints/611.pdf

PArametric vs Ordinal
+ https://tech.snmjournals.org/content/46/3/318.2#:~:text=Currie%20writes%2C%20%E2%80%9CThe%20Likert%20scale,the%20data%20ordinal%20in%20nature.&text=Moreover%2C%20he%20concludes%20that%20parametric,distribution%20of%20data)%20are%20violated.
+ https://www.researchgate.net/post/What_is_the_most_suitable_statistical_test_for_ordinal_data_eg_Likert_scales


## TODO - Topics

+ What is statistical inference
+ Hypothesis Testing
+ Formulate H0 & alternate hypothesis
+ Select a test statistic that can answer the question
+ Sampling distribution of the test statistic under H0 (i.e. the distribution we would get from repeated sampling)
+ Can we use common distributions to approximate the sampling distribution of the test statistic & under what conditions
+ Calculate the probability of having values of the test statistic at least as extreme as the observed value under H0
+ Draw conclusions
+ Power & Significance & Effect size

+ t-test
+ linear regression & link with t-test
+ ANOVA
+ Chi-square
+ F-statistic

https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546


### TODO

_Note: many statistical methods assume the data is roughly normal. This assumption must always be checked first: many things that you might assume are normally distributed are actually not. In particular, outliers are extremely unlikely for normally distributed data; if your data does have extreme values, the normal distribution might not be the best description._

_Note: We can compare the ECDF to the theoritical CDF of the normal distribution with same mean and standard deviation to assess if the data is normally distributed._
