# Model selection with information criteria and cross validation

## Lecture 5

### GRA 4160
### Predictive modelling with machine learning

#### Lecturer: Vegard H. Larsen

## Lasso model selection: Diabetes dataset from Scikit-learn
These notes are based on: https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_model_selection.html

The diabetes dataset is a pre-processed and cleaned version of the diabetes database from the National Institute of Diabetes and Digestive and Kidney Diseases.
It consists of 442 samples, where each sample represents a patient with diabetes.
The dataset has ten continuous features that describe various factors of each patient:

- age
- sex
- body mass index (BMI)
- average blood pressure (BP)
- six blood serum measurements (S1, S2, S3, S4, S5, S6)

The target variable is a quantitative measure of disease progression one year after baseline, as measured by the level of serum glucose. This target variable is a real-valued number ranging from 25 to 346.

The dataset has been preprocessed so that each feature has been mean-centered and we will scale to unit variance. This means that each feature has zero mean and unit variance across the dataset.

In [None]:
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True, as_frame=True)
X = X/X.std()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


rng = np.random.RandomState(10)
n_random_features = 8
X_random = pd.DataFrame(
    rng.randn(X.shape[0], n_random_features),
    columns=[f"random_{i:02d}" for i in range(n_random_features)])
X = pd.concat([X, X_random], axis=1)

X

In [None]:
X.describe()

## Model selection with information criteria

Information criteria (IC) is a tool for model selection.
ICs are mathematical formulas used to evaluate the quality of a statistical model and compare different models.

They help to address the trade-off between model complexity and goodness of fit by penalizing overly complex models.
Information criteria can be used for model selection, regularization, and feature selection.

There are several commonly used information criteria:

**Akaike Information Criterion (AIC)**:
AIC measures the goodness of fit of the model while taking into account the number of parameters in the model. It penalizes models with more parameters, so it is useful in avoiding overfitting.

$$AIC = 2k - 2\log(L)$$

**Bayesian Information Criterion (BIC)**:
BIC is similar to AIC, but it places a stronger penalty on models with a large number of parameters, making it less likely to choose overly complex models.

$$BIC = k\log(n) - 2\log(L)$$

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoLarsIC
from sklearn.pipeline import make_pipeline

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)


lasso_lars_ic = make_pipeline(StandardScaler(), LassoLarsIC(criterion="aic")).fit(X, y)

# Create a Pandas DataFrame to store the results
# The DataFrame has one column for the alpha values and two columns for the AIC and BIC criteria
results = pd.DataFrame({"alphas": lasso_lars_ic[-1].alphas_,
                        "AIC criterion": lasso_lars_ic[-1].criterion_}).set_index("alphas")

# Get the alpha value that minimizes the AIC criterion
alpha_aic = lasso_lars_ic[-1].alpha_

# Get the coefficients of the Lasso model with the best alpha value
coefs = lasso_lars_ic[-1].coef_

# Print the features that have zero coefficients
zero_coefs = np.where(coefs == 0)[0]
if zero_coefs.size == 0:
    print("No features have been set to zero by Lasso")
else:
    print("The following features have been set to zero by Lasso under AIC:")
    print(zero_coefs)

# Update the LassoLarsIC estimator to use the BIC criterion
# This is done by setting the criterion parameter of the LassoLarsIC estimator to "bic"
lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y)

# Add the BIC criterion to the DataFrame
results["BIC criterion"] = lasso_lars_ic[-1].criterion_

# Get the alpha value that minimizes the BIC criterion
alpha_bic = lasso_lars_ic[-1].alpha_

# Get the coefficients of the Lasso model with the best alpha value
coefs = lasso_lars_ic[-1].coef_

# Print the features that have zero coefficients
zero_coefs = np.where(coefs == 0)[0]
if zero_coefs.size == 0:
    print("No features have been set to zero by Lasso")
else:
    print("The following features have been set to zero by Lasso under BIC:")
    print(zero_coefs)

In [None]:
# Print the results
print(results)

In [None]:
ax = results.plot()
ax.vlines(alpha_aic, results["AIC criterion"].min(), results["AIC criterion"].max(),
    label="alpha: AIC estimate", linestyles="--", color="tab:blue")
ax.vlines(alpha_bic, results["BIC criterion"].min(), results["BIC criterion"].max(),
    label="alpha: BIC estimate", linestyle="--", color="tab:orange")

ax.set_xlabel(r"$\alpha$")
ax.set_ylabel("criterion")
ax.set_xscale("log")
ax.legend()
ax.set_title(f"Information-criterion for model selection")
plt.show()

## Model selection with cross validation

Cross-validation is a technique used to evaluate the performance of a machine learning model and to tune its hyperparameters.
It's called "cross-validation" because it involves dividing the dataset into multiple "folds" and training the model on different subsets of the data, while evaluating its performance on the remaining part of the data.

The idea is to use a different subset of the data as the validation set in each iteration, so that the model is trained and evaluated on different portions of the data.

There are several types of cross-validation, including $k$-fold cross-validation, stratified k-fold cross-validation, leave-one-out cross-validation, and others.
The specific approach depends on the nature of the data and the goals of the analysis.

The purpose of cross-validation is to get an estimate of the model's performance that is more robust and less prone to overfitting than evaluating the model on a single train/test split of the data.
By training and evaluating the model on different parts of the data, we get a better sense of how well the model will generalize to new, unseen data.

## Implementation is Scikit-learn

In Scikit-learn the Lasso estimator can be implemented with different solvers: coordinate descent and least angle regression.
They differ in regard to their execution speed and sources of numerical errors.

LassoCV and LassoLarsCV that respectively solve the problem with coordinate descent and least angle regression.

In [None]:
# Import the LassoCV class from the linear_model module of the scikit-learn library
from sklearn.linear_model import LassoCV

# Create a pipeline that standardizes the features and applies LassoCV with 10-fold cross-validation
model = make_pipeline(StandardScaler(), LassoCV(cv=10)).fit(X, y)

# Set the y-axis limits of the plot
ymin, ymax = 2300, 3800

# Extract the LassoCV model from the pipeline
lasso = model[-1]

# Plot the MSE path of the LassoCV model for different values of alpha on a logarithmic x-axis
plt.semilogx(lasso.alphas_, lasso.mse_path_, linestyle=":")

# Plot the average MSE across the folds for each value of alpha as a black line
plt.plot(lasso.alphas_, lasso.mse_path_.mean(axis=-1),
    color="black", label="Average across the folds", linewidth=2)

# Add a vertical line to show the estimated alpha value selected by cross-validation
plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha: CV estimate")
plt.ylim(ymin, ymax)
plt.xlabel(r"$\alpha$")
plt.ylabel("Mean square error")
plt.legend()
plt.title(f"Mean square error on each fold: coordinate descent")
plt.show()

In [None]:
# Import the LassoLarsCV class from the linear_model module of the scikit-learn library
from sklearn.linear_model import LassoLarsCV

# Create a pipeline that standardizes the features and applies LassoLarsCV with 10-fold cross-validation
model = make_pipeline(StandardScaler(), LassoLarsCV(cv=20)).fit(X, y)

# Extract the LassoLarsCV model from the pipeline
lasso = model[-1]

# Plot the MSE path of the LassoLarsCV model for different values of alpha on a logarithmic x-axis
plt.semilogx(lasso.cv_alphas_, lasso.mse_path_, ":")

# Plot the average MSE across the folds for each value of alpha as a black line
plt.semilogx(lasso.cv_alphas_, lasso.mse_path_.mean(axis=-1),
    color="black", label="Average across the folds", linewidth=2)

# Add a vertical line to show the estimated alpha value selected by cross-validation
plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha CV")
plt.ylim(ymin, ymax)
plt.xlabel(r"$\alpha$")
plt.ylabel("Mean square error")
plt.legend()
plt.title(f"Mean square error on each fold: Lars")
plt.show()
