# Model optimism and information criteria

## Defining model optimism

Every machine learning model has some amount of error in its live predictions. That error stems from two different sources: **bias**, the tendancy of a model to underfit, and **variance**, the tendancy of a model to overfit. The fundamental relationship between these two sources of error is known as the [bias-variance tradeoff](https://www.kaggle.com/residentmario/bias-variance-tradeoff).

The training error of a model is an inaccurate prediction of its real-life performance because those observations have already been "seen" by the model. In other words, it is not a good idea to measure the performance of a model based on its performance on the data it was built with. It's a much better idea to measure performance using data that the model has never seen before. The performance on this "holdout set" will be an accurate predictor of your model's performance in the real world. This is, fundamentally, the basis of [cross-validation](https://www.kaggle.com/residentmario/cross-validation-schemes-with-food-consumption/).

In general, training error is a highly optimistic assessment of model performance. E.g. training error, also known as "in-sample error", is an "optimistic" assessment of model performance. Hence we can define the difference between in-sample error and the average out-of-sample (test) error (you might have multiple test sets) as the **model optimism**.

## Some math

Let $\text{Err}_{in}$ be in-sample error and let $\overline{\text{err}}_{out}$ be average out-of-sample error. The latter is the mean error across all folds of your test set; see the cross validation notebook if you're unsure what this means. The former may use any metric (aka "loss function") you want (cf. [Model Fit Metrics](https://www.kaggle.com/residentmario/model-fit-metrics/)). Then the measured optimism for any particular model will be:

$$\text{op} = \text{Err}_{in} - \overline{\text{err}}$$

Define the expectation of model optimism as:

$$\omega = E_{y}[\text{op}]$$

Then we may write that in-sample error is a function of test (out-of-sample) error according to the following relationship:

$$\hat{\text{Err}_{\text{in}}} = \overline{\text{err}} + \hat{\omega}$$

Where $\hat{a}$ is an estimator for the underlying value $a$.

This is just a definition of what we said earlier: in-sample error is out-of-sample error plus model optimism. Now things get interesting. It turns out that when we use mean squared error (MSE) as our error metric, the following relationship holds:

$$\omega = \frac{2}{N}\sum_{i=1}^N \text{Cov}(\hat{y}_i, y_i)$$

Thus the amount by which $\overline{\text{err}}$ underestimates the true error depends on how strongly $y_i$ affects its own prediction. The harder we fit the data, the greater $\text{Cov}(\hat{y}_i, y_i)$ will be, therefore the higher model optimism will be.

Though this equation is specific to the MSE case, a similar relationship holds for a broad range of different model metrics. In general, model optimism is a function of average truth-prediction covariance!

If we used a linear model, this equation simplifies even further. A linear predictor for $\hat{y_i}$ obtained with $d$ basis functions has the following property:

$$\sum_{i=1}^N \text{Cov}(\hat{y}_i, y_i) = d\sigma_\epsilon^2$$

Where $\epsilon$ are the residuals. So for regression models (linear, polynomial, ridge, lasso, etc.) we may write:

$$E_y(\text{Err}_{\text{in}}) = E_y(\overline{\text{err}}) + 2\frac{d}{N}\sigma^2_\epsilon$$

## More discussion

Although you shouldn't evaluate model performance based on it, in-sample error can be useful.

For one thing, it can be used during model selection. When comparing several different models to see which one is most performant (as in [hyperparameter search](https://www.kaggle.com/residentmario/gaming-cross-validation-and-hyperparameter-search/)), optimizing your model by picking the one with the lowest training error is the simplest possible solution (because you already computed these errors as part of the training process, no additional computation is required) and works surprisingly well. This is because in model selection it is the *relative* sizes of the errors that matter, not their *absolute* sizes. In-sample error is approximately equally optimistic across all models, so long as the underlying bias of the models is relatively small, and so it works.

This basic idea of estimating a model's test performance using a function of its training error underlies a set of metrics known as **information criteria**. Information criteria "estimate the relative information lost when a given model is used to represent the process that generated the data". True to their name, information criteria come from, and are based on concepts in, information theory. The two criteria most often used are the *Akaike information criterion* (AIC) and the *Bayesian information criterion* (BIC).

AIC and BIC achieve their ends by estimating (a function of) model optimism, $\hat{\omega}$, in a generalized way. Though the absolute value of an estimate of a function of $\hat{\omega}$ estimate, its relative value is, because it will allow us to measure and compare model performance per our discussion above.

Hence AIC and BIC are often used to perform model selection.

## More math

The math behind AIC and BIC is very complicated. AIC relies on the following relationship, which holds asymptotically as $N \to \infty$ for all models:

$$-2 E[\log{P_{\hat{\theta}}(Y)}] \approx -\frac{2}{N} \times E\left[\sum_{i=1}^N \log{P_{\hat{\theta}}(y_i)}\right] + 2 \times \frac{d}{N}$$

Given a tuning parameter $\hat{\alpha}$, AIC is defined as:

$$AIC(\alpha) = \overline{err}(\alpha) + 2 \frac{d(\alpha)}{N}\hat{\sigma}_\epsilon^2$$

Where $d$ is the number of [degrees of freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics) of the model.

BIC is principled on something similarly awful.

To perform model selection, we calculate the AIC or BIC for a range of datasets across values of $\alpha$, the hyperparameter we are tuning. Whichever model has the lowest AIC or BIC "wins": information theory says that that is the "best" model.

## More discussion

AIC and BIC are implemented in `scikit-learn` as variants of the standard classifiers. For example, the lasso classification algorithm (`Lasso`) has an information-criterion using variance (`LassoLarsIC`). `LassoLarsIC` differs from `Lasso` in that you do not specify an `alpha` hyperparameter (which controls the L1 norm, see [Lasso Regression with Tennis Odds](https://www.kaggle.com/residentmario/lasso-regression-with-tennis-odds/) for more on lasso). Instead $\alpha$ is determined for you automatically, by performing a grid search using the information criterions for a *bunch* of `Lasso`-generated models.

There is no "bare" AIC or BIC calculation in `scikit-learn`. This is mainly because the precise details of how they are slash should be calculated depend on the model being used.

This design means that there's no new API surface to learn. To use information criteria for model selection, just import the desired model variant, fit, and train, as you would with a standard `sklearn` model.

## Application

Here's a very quick application the JC Penny dataset. We'll predict 

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("../input/jcpenney_com-ecommerce_sample.csv")
df = (df[['list_price', 'sale_price']]
        .applymap(lambda v: str(v)[:4]).dropna().astype(np.float64)).dropna()
df.head()

In [None]:
X = df.iloc[:, 0].values[:, np.newaxis]
y = df.iloc[:, 1].values

In [None]:
from sklearn.linear_model import LassoLarsIC

clf = LassoLarsIC()
clf.fit(X, y)
y_hat = clf.predict(X)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

sns.jointplot(y, y_hat)
plt.gcf().suptitle('JC Penny Sale Price Predicted via List Price')
pass

As you can see, in this case no regularization was used, e.g. the model used was a simple linear regression model (which was obviously bound to happen):

In [None]:
clf.alpha_

## Conclusion

Why use AIC or BIC to perform model selection? Primarily because information criteria selection is very fast. They also simplify to fairly usable formulas for the linear regression case. AIC and BIC are popular "display elements" for old-school data mining software for these reasons.

They are still used to perform model selection today. However I don't think they're as much in vogue as they used to be. `sklearn` explains why:

> Information-criterion based model selection is very fast, but it relies on a proper estimation of degrees of freedom, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. They also tend to break when the problem is badly conditioned (more features than samples).

For slightly more details on AIC and BIC, check out [this demo](http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_model_selection.html) in the `sklearn` documentation. Most of this notebook was sourced from section 7.1 through 7.9 of "Elements of Statistical Learning"; for far more mathematical background, refer there.