$$\renewcommand{\kld}{D_{KL}}$$

### Notation
Throughout, think of the distribution $P$ as the true distribution of the data, from which we've obtained a dataset/sample.

$$ \newcommand{\kld}{\mathcal{D}_{KL}}$$

## From Divergence to Deviance

The **deviance** of a distribution Q is [twice] Q's average surprise at the data. It is the same as [twice] the cross entropy H(P,Q) calculated on the available dataset drawn from P.

$$D(q) = -2 \sum_i log(q_i)$$

Why the extra factor of 2? I really have no idea... But deviance is (for data drawn independently) [twice] the negative log likelihood of the dataset.

So deviance is an in-sample estimate of quite a few quantities we might care about. **Importantly**, deviance doesn't require knowing the distribution P: it's based entirely on our samples from P and the probabilities specified by our alternative distribution Q.

If we want, we can write the difference in KL divergence in terms of the deviance:

$$\kld(p, q) -\kld(p, r) = \frac{2}{N} (D(q) - D(r))$$

So, if we want to pick the model whose KL divergence from the true data-generating process is smaller, we can just pick the model with lower deviance. (Assuming the sample data we have mirrors what we'd see across the whole data generating process)

### But we are still in-sample and prone to overfitting
But a distribution or model's deiviance score is tied to the negative log likelihood, and we're often minimizing that. Shouldn't in-sample / training set deviance undersetimate the deviance we'd see on a fresh validation/test sample?

Yes. It does underestimate. And it shows the classic overfitting problem: with more complex models the training set deviance looks really good and the test set deviance isn't great at all. McElreath has a plot of in-sample and out-of sample deviance for lots of models fitted to data generated from a gaussian with standard deviation 1 and means:

$$\mu_i = 0.15 x_{1,i} - 0.4 x_{2,i}$$

The deviances in-sample and out-of sample, at 10,000 simulations for each model type, for two sample sizes are shown below.

![](images/inoutdeviance.png)

Sidebar:

- the best fit model may not be the original generating model. Remember that the choice of fit depends on the amount of data you have and the less data you have, the less parameters you should use
- on average, out of sample deviance must be larger than in-sample deviance, through an individual pair may have that order reversed because of sample peculiarity.

### AIC: Out-of-sample Deviance can be estimated from within the sample!
When one plots the mean deviances together, we see an interesting phenomenon:

![](images/devianceaic.png)

The test set deviances are roughly $2p$ above the training set deviances, where $p$ is the number of parameters in the model.

This observation leads to an estimate of the out-of-sample deviance by what is called an **information criterion**, the Akake Information Criterion, or AIC:

$$AIC = D_{train} + 2p$$

Since deviance is tied to KL and maximum likelihood and more, AIC is means of correcting for the over-optimism of those measures when calculated on data that was also used to train the model.

Caveats: 

- the 2p pattern holds only when the likelihood is approximately multivariate gaussian. However, likelihood functions often do look multivariate gaussian near their peak.
- The AIC is a little ad-hoc. It uses log base e in its definition and changing the base will cause the criterion to flip-flop on which of two models is better. Stick with base e!

AIC is fairly simple: it is just a penalized log-likelihood. Models are rewarded for higher likelihood and penalized for having more parameters. In a sense, this penalization is a simple form of regularization on our model.

We wont derive the AIC here, but if you are interested, see http://www.stat.cmu.edu/~larry/=stat705/Lecture16.pdf

#### Why not stick with cross validation?
AIC is a means of selecting a model, and so is cross validation. Which is better?

Neither one. Cross validation can be computationally expensive, especially with multiple hyper-parameters, but can be very reliable. AIC skips that computational cost- we just total up log probability on the training data.

We will have more to say about informatiom criterion when we figure how to do model selection in the bayesian context.