## Introduction

Statistical learning refers to a vast set of tools for understanding data:

+ supervised: building a statistical model for predicting, or estimating, an output based on one or more inputs.
+ unsupervised: learning relationships between inputs (no supervising output).


_TODO: add examples._

This series of articles will focus on supervised learning.

## Modelling

### Estimate function

Suppose that we observe a quantitative response $Y$ and $p$ different predictors $X = X_1, X_2,...,X_p$. We assume that there is some relationship between them, which can be written in the very general form: $Y = f(X) + \epsilon$.

+ $f$ is some fixed but unknown function of $X$.
+ $\epsilon$ is a random error term, which is independent of $X$ and has a mean of zero.

We create an estimate $\hat{f}$ that predicts $Y$: $\hat{Y} = \hat{f}(X)$. Choosing $\hat{f}$ depends on the goal of the modelisation.

### Predictions vs Inference

When focusing on **predictions accuracy**, we are not overly concerned with the shape of $\hat{f}$, as long as it yields accurate predictions for $Y$: we treat it as a black box.

When focusing on **inference**, we want to understand the way that $Y$ is affected as $X$ changes, so we cannot treat $\hat{f}$ as a black box:

+ Which predictors are associated with the response? Which ones are the most important?
+ What is the relationship between the response and each predictor: positive or negative? Is there covariance?
+ Can the relationship between $Y$ and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

_TODO: add examples._

## Sampling

We usually do not have access to an entire population, but only to a subset of its members that time and resources allow us to measure. We use this sample data to draw probabilistic conclusions about the population; this process is called [statistical inference](https://www.encyclopediaofmath.org/index.php/Statistical_inference).

There are several ways to sample a population:

+ [Simple random sample](https://en.wikipedia.org/wiki/Simple_random_sample) – each subject in the population has an equal chance of being selected. Some demographics might be missed.
+ [Stratified random sample](https://en.wikipedia.org/wiki/Stratified_sampling) – the population is divided into groups based on some characteristic (e.g. sex, geographic region). Then simple random sampling is done for each group based on its size in the actual population.
+ [Cluster sample](https://en.wikipedia.org/wiki/Cluster_sampling) – a random cluster of subjects is selected from the population (e.g. certain neighborhoods instead of the entire city).


Sampling must be probabilistic in order to make inference about the whole population. Otherwise, the inference can only be made about the sample itself.

There are several forms of [sampling bias](https://en.wikipedia.org/wiki/Sampling_bias):
+ selection bias: not fully representative of the entire population.
    + people who answer surveys.
    + people from specific segments of the population (polling about health at fruit stand).
+ survivorship bias: population improving over time by having lesser members leave due to death.
    + head injuries with metal helmets increasing vs cloth caps because less lethal.
    + damage in WWII planes: not uniformally distributed in planes that came back, but only in non-critical areas.
    
_Note: other [criteria](https://en.wikipedia.org/wiki/Selection_bias) can also impact the representativity of our sample._
