## Introduction

Suppose that we observe a quantitative response $Y$ and $p$ different predictors $X = X1, X2,...,Xp$. We assume that there is some relationship between them, which can be written in the very general form: 

$$Y = f(X) + \epsilon$$

+ $f$ is some fixed but unknown function of $X$.
+ $\epsilon$ is a random error term, which is independent of $X$ and has a mean of zero.

### Estimate function

We create an estimate $\hat{f}$ that predicts $Y$: $\hat{Y} = \hat{f}(X)$. There will always be two errors elements:

$$E (Y - \hat{Y})^2 = [f(X) - \hat{f}(X)]^{2} + Var(\epsilon)$$

Where:
+ $E (Y - \hat{Y})^2$ is the average squared error of predictions.
+ $[f(X) - \hat{f}(X)]^{2} $ is the reducible error. Our aim is to reduce this error.
+ $Var(\epsilon)$ is the irreducible error, that cannot be predicted using $X$.

### Predictions vs Inference

When focusing on **predictions accuracy**, we are not overly concern with the shape of $\hat{f}$, as long as it yields accurate predictions for $Y$: we treat it as a black box.

When focusing on **inference**, we want to understand the way that $Y$ is affected as $X$ changes, so we cannot treat $\hat{f}$ as a black box:

+ Which predictors are associated with the response? Which ones are the most important?
+ What is the relationship between the response and each predictor: positive or negative? Is there covariance?
+ Can the relationship between $Y$ and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

### Quality of Fit - Bias vs Variance

A good model **accurately predicts** the desired target value for **new data**. It will have:
+ low **bias**: how well the model approximates the data.
+ low **variance**: how stable the model is in response to new training examples.

We can link under- vs overfitting to bias and variance:
+ the **underfitting** model does not capture the relevant relations between features and outputs: it suffers from **high bias**.
+ the **overfitting** model captures the underlying noise in the training set, so changing the training set will lead to vastly different predictions: it suffers from **high variance**.

The figure below illustrates the range of predictions for a given input by a model trained with different datasets, depending on its bias and variance *([source](http://scott.fortmann-roe.com/docs/BiasVariance.html))*:

<img src="https://sebastienplat.s3.amazonaws.com/a9a3a238b8b5a0bfe07d83b1f07c85bd1472143621831" align=left>

A more detailed article about the [bias-variance tradeoff](https://sebastienplat.github.io/blog/bias-variance-tradeoff) and how to identify under- and overfitting models is available on this blog.

___

## Parametric vs Non-Parametric Methods

### Parametric Models

1. We make an **assumption about the functional form**, or shape, of $f$, the simplest of which is that it is linear.
2. We fit the model to a training set. It finds the values of the function's parameters that match $Y_{train}$ more closely.

Example for a linear model:
1. $f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p$.
2. Find values of $\beta_0, ..., \beta_p$ that minimizes the gaps between $\hat{Y}_{train}$ and $Y_{train}$.

The potential disadvantage of a parametric approach is that:
+ the model we choose will usually **not match the true unknown form of $f$**. If the chosen model is too far from the true $f$, then our estimate will be poor **(underfitting)**. 


Simple parametric models will **not work** if the number of features is **close to or higher than** the number of observations (more details [here](https://medium.com/@jennifer.zzz/more-features-than-data-points-in-linear-regression-5bcabba6883e)). These cases will require a different approach.

### Non-Parametric Models

Non-parametric methods **do not make explicit assumptions about the functional form** of $f$. Instead they seek an estimate of $f$ that gets as close to the data points as possible, without being too rough.

While non-parametric approaches avoid the issues of parametrics assumptions, they suffer from a major disadvantage: since they do not reduce the problem of estimating $f$ to a small number of parameters, a **very large number of observations** (far more
than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for $f$. It can also **follow the noise** too closely **(overfitting)**.

### Trade-off

+ **Linear models** allow for relatively **simple and interpretable** inference, but may not yield as accurate predictions as some other approaches. 
+ Highly **non-linear** approaches may provide predictions that are **more accurate**, but this comes at the expense of **less interpretability**.

### Categorical Predictors

Using categorical predictors requires some preparation. The most typical method is called one-hot encoding; it creates dummy boolean variables for all but one category: 1 if the observation has this category and 0 otherwise. The remaining category is called the baseline.


___

# Training and Testing

A succesful model has low bias and low variance, which means it will accurately predict out-of-sample performance. 

There are two main approaches to train and test a model:
+ splitting the dataset in **three subsets**: one for training, one for validation and one for final testing.
+ splitting the dataset in multiple train/test sets: **cross-validation**.

### Train / Validation / Test Sets

We prepare the dataset for training the model:

+ randomly shuffle data, in order to remove any bias that could stem from the original ordering of the dataset.
+ split the dataset into training and testing subsets. Using a `random_state` ensures the split is always the same.
+ a typical split is 80% of the observations for the training set and 20% for the test set.

This method risks overfitting the test set in case of [multiple iterations](https://glassboxmedicine.com/2019/09/15/best-use-of-train-val-test-splits-with-tips-for-medical-data/) (bleeding). A more robust method is to keep an **holdout test set** completely untouched during the entire iteration process. This holdout set will be used only at the very last stage of the process, to assess the accuracy of the final model on completely unseen data.

A typical split is 70% of the observations for the training set, 15% for the validation set and 15% for the holdout set. The drawback is that more data is required.


### K-Fold Cross-Validation

Another method can be used to assess both bias and variance of a given model: **K-fold cross-validation**.
+ the dataset is divided in K subsets of equal size called "folds".
+ the model is trained K times, each run holding out a different fold as test set.
+ the average testing score is used as an estimate of out-of-sample performance.
    
This is especially useful for hyperparameters tuning, as it avoids combinations that overfit the test set.

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width=550px align=left>

More information about Cross-Validation can be found [here](https://scikit-learn.org/stable/modules/cross_validation.html).

___ 

## Linear Models

### Use cases

Linear models are great for inference. They help explain:
+ if there is a linear relationship between variables.
+ how strong is the relationship between variables.
+ which variable(s) have a stronger impact on the outcome.


### Assumptions

The two main assumptions of linear models describe the relationship between predictors and response.
+ **additive**: the effect of changes in a predictor on the response is independent from the values of the others.
+ **linear**: one unit change of a given predictor leads to a constant change in the response. 

These highly restrictive assumptions that are often violated in practice, which requires more flexible models.

### Relationships with Outcome

**Coefficients t-statistics**

Assuming that the errors are Gaussian, we can build a **confidence interval** for each coefficient of our model (in a way that is very similar to the confidence interval of the sample mean). 

This allows us to test the **null hypothesis** that the true value of each coefficient is zero, i.e. that there is **no relationship** between a given variable and the outcome, given the estimated coefficient value and the resulting standard error. 

We can calculate the **p-value** of the related **t-statistic**: how likely such a substantial association between the predictor and the response would be observed due to chance. Having a estimated value larger than the related standard error means that this probability is very small.


**Model's F-statistic**

We can also test the **null hypothesis** that the true value of all coefficients is zero, i.e. that there is **no relationship** between the predictors and the outcome. This hypothesis can be assessed by the **p-value** of the model's **F-statistic**.

_Note: the squared t-statistic of each coefficient is the F-statistic of a model that omits that variable. So it reports the partial effect of adding that variable to the model._

**Individual p-values vs Model p-value**

Some p-values will be below the significance level by chance. This means that using individual t-statistics and associated p-values will probably lead to the incorrect conclusion that there is a relationship, especially if the number of variables is high.

The F-statistic, on the other hand, adjusts for the number of predictors. It means that it only 5% chance of Type-I error.


### Variables Selection

If the p-value of the F-statistic shows that some of our model's variables are related to the outcome, the next steps will be to identify which ones are important. There are a few classical methods to do it, if $p$ is the total number of predictors:

+ **Forward selection**: start with the null model (no predictors). Fit $p$ simple linear models and keep the one with the lowest residual errors. Keep adding variables one by one until some condition is met. 
+ **Backward selection**: start with all predictors. Remove the one with the highest p-values. Keep removing variables until some condition is met (all remaining p-values under some threshold).
+ **Mixed selection**: start with forward selection until the highest p-value crosses some threshold, then remove the predictor. Iterate until all p-values are below some threshold and adding any other predictors would lead to high p-values.

_Note: Backward selection cannot be used if p>n, while forward selection can always be used. Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this._

### Model Fit

The accuracy of a linear model is usually assessed using two related quantities:
+ the residual squared error (RSE).
+ the $R^2$ statistic.

Roughly speaking, the RSE is the average amount that the response will deviate from the true regression line. It is considered a measure of the lack of fit of the model.

$R^2$ is the proportion of variance in outcomes explained by the model. It is always between 0 and 1; the closer it is to 1, the better the fit. When the model only includes one variable, $R^2$ is equal to the squared correlation coefficient $r^2$.

_Note: if the data is inherently noisy, or outcomes are influenced by unmeasured factors, $R^2$ might be very small._

_Note: Measures of fit include Mallow’s Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R2._

**Plotting residuals** can help identifying patterns not captured by the model, like **interactions between predictors**.

### Predictions

Predictions are associated with three categories of uncertainty:
+ coefficients estimates.
+ model bias due to distance of reality from linear assumptions.
+ random errors due to noise.

These uncertainties can be quantified to provide:
+ confidence intervals: average outcome given specific values of the predictors.
+ prediction intervals: outcome for a given observation given specific values of the predictors.

Prediction intervals tend to be much wider than confidence intervals.

### Multicollinearity

___

## Improving Models

### General Strategy

+ large sample size, few features: a flexible model would fit the data better; the large sample size will limit the overfitting.
+ small sample size, large amount of features: a flexible model would probably overfit the training set.
+ large variance of the error term: a flexible model would probably capture the noise and generalize poorly.

### High Bias

Training error will also be high. Potential solutions:

+ Add new features.
+ Add more complexity by introducing polynomial features.

### High Variance

Training error will be much lower than test error. Potential solutions:

+ Increase training size.
+ Reduce number of features, especially those with weak signal to noise ratio.
+ Increase Regularization terms.

___

# Regularization Terms

Regularization aims to prevent overfitting by penalizing large weights when training the model. It adds a regularization term to the loss function, with a regularization parameter called $\lambda$.

### L1 Regularization - LASSO

+ L1 regularization penalizes the absolute value of the weights. 
+ It can do feature selection: insignificant input features are assigned a weight of zero.
+ The resulting models are simple and interpretable, but cannot learn complex patterns.

### L2 Regularization - Ridge Regularization

+ L2 regularization penalizes the square of the weights. 
+ It forces the weights to be small but not zero.
+ Taking squares into account makes it sensititive to outliers.
+ It is able to learn complex data patterns.

*See also [this link](https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2) for more information.*

___

# Features Selection

A few methods can be [used](https://scikit-learn.org/stable/modules/feature_selection.html) to reduce the number of features and decrease the risks of overfitting:

+ Remove features with low variance.
+ Univariate selection: only keep features that correlate highly with the outcome feature.
+ Regressive feature elimination: only keep the features that lead to the most accurate model in CV. 
+ LASSO: only keep features with non-null weigths.
+ Tree-based features importance.

___

# Hyperparameters Tuning

More information on hyperparameters tuning can be found [here](https://scikit-learn.org/stable/modules/grid_search.html).

In the **[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)** technique, we define a range of values for every parameter we want to tune. The Grid Search will train models for each combination of values using K-fold CV, then outputs the compared performances.

This technique can become VERY resource-intensive for large datasets. In might be better to use **[RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)** in those instances.

___

## Appendix - Further Reads

+ [Introduction to Statistical Learning](https://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) _(link to downlad the .pdf version)_
+ [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/) _(github.io based e-book)_

+ bias / variance: what / impact / how to measure?
+ loss function: how to define?
+ cross-validation: when / purpose?
+ R² vs adjusted-R²: when / good choice?
+ R² vs adjusted-R² on random noise

+ feature importance: train or test set?

Links to double-check:

+ https://en.m.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
+ http://scott.fortmann-roe.com/docs/MeasuringError.html
+ http://cs229.stanford.edu/materials/ML-advice.pdf