## Topics in Model Performance

### Underfitting vs. Overfitting

Most folks have by now heard of underfitting and overfitting a model. Simpler models should be preferred but not at the cost of accuracy. An overfit model, on the other hand, may not generalize well on new data. We can measure how well a model fits the data using the $R^2$ metric which measures the proportion of explained variance.

If we use the example of linear regression and start with a first order regression to explain the data, we may find that the data may not be adequately captured. We may have to incrementally increase the complexity of the model by increasing the order of the polynomial. Past a certain point, however, the model starts overfitting to the data. What this means is that the model simply used its representational power to memorize the data and will perform poorly on new data that is fed into the model.

We want a model that has found that balance between being underfit and overfit, this trade-off is often referred to as the bias-variance trade-off. Bias is the error in the data resulting from its inability to accomodate the data. The model does not have the representational power to capture all the variations and patterns in the data. Variance is the error resulting from the sensitivity of the model to the data which usually results a complex model. Regularization is often used for this reason to reduce the complexity in a regression (or neural network) by minimizing the number of coefficients. 

### Measures for Predictive Performance

Accuracy of the model can be measured by 

#### 1. Cross-validation 

Here we divide the data into non-overlapping subsets and perform training and validation on the different subsets. Depending on how we perform this cross-validation, it can be called K-fold cross-validation or leave-one-out cross-validation (LOOCV). In K-fold cross validation we divide the data into 'K' folds or subsets, perform training of the model on k-1 folds while the model performance is assessed on the 1 fold that was left. We iteratively select each fold to be the test fold while the others become the training folds.



![Image from the scikit-learn page for K-fold cross validation](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png) 
*K-fold Cross-validation from the scikit-learn page *

If the number of folds is equal to the number of data points, we have leave-one-out cross-validation.

#### 2. Information criteria

*Reference* [Predictive metrics presentation from Liberty Mutual](https://www.casact.org/education/rpm/2016/presentations/PM-LM-4-Tevet.pdf)

A number of ideas that are firmly rooted in Information theory help us to quantify how well a model performs. 

1. Log-likelihood and deviance

2. Akaike Information Criterion (AIC)

3. Widely Applicable Information Criterion (WAIC)

4. Bayesian Information Criterion (BIC)

#### Log-likelihood and Deviance

These terms are used to measure the error in our model with regards to the data that the model is trying to fit. Most folks are familiar with the Mean Squared Error (MSE) given by 

MSE = $\sum_1^n (y_{true} - y_{predicted})^2 / n$

While this is a perfectly acceptably way of measuring error, another way to measure the performance of a model is using the log-likelihood function.

Log Likelihood = $\sum_1^n log p(y_i | \theta)$

Note that the log likelihood function $p(y_i | \theta)$ takes values from 0 for no fit to 1 for a perfectly fit model.

If the likelihood function is a Normal, the log-likelihood is proportional to the MSE. Deviance is simply -2 times the log-likelihood

Deviance = -2 $\sum_1^n log p(y_i | \theta)$


### Entropy and KL Divergence

### Model Averaging

### Ergodicity

### EVALUATION

1. Underfitting is bad because

    a. It cannot capture complex behavior and will have inherent error (C)

    b. The predicted value is always less than the true value

2. Overfitting is bad because 

    a. The model that is overfit will learn noise (C)

    b. The model is too big

3. Variance of a model is related to 

    a. A model's ability to adapt its parameters to training data
    
    b. The sensitivity of the model to the inputs
