Skip to content

Model Complexity

Rafael Garcia Leiva edited this page Jan 3, 2020 · 4 revisions

Model Complexity

Surfeit is a measure of how unnecessarily complex is a model. Surfeit computes how much longer is our current model with respect to the optimal, shortest possible, model. On the contrary of other measures of model complexity, surfeit does not penalize a very long model, if this model is the shortest possible one for this dataset.

The class fastautoml.Surfeit allow us to compute the surfeit of a model that belongs to one of the supported families of models. Please, refer to the Reference API (TDB) to check the list of supported families of models.

Avoiding Overfitting

Surfeit allow us to compute the risk of a model being overfitting a dataset.

import numpy as np
X = np.sort(np.random.rand(n_samples) * 3)
y = np.cos(1.5 * np.pi * X)

For this example we have generated a dataset composed by 900 samples of a sinusoidal curve.

Sinusoidal Curve

We will fit the dataset using family of polynomial models, where the degree of the polynomials go from 1 to 15.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

from fastautoml.fastautoml import Surfeit
from fastautoml.fastautoml import Inaccuracy

linacc   = list()
lsurfeit = list()

for i in np.arange(1, 15):
        
    poly = PolynomialFeatures(degree=i, include_bias=False)
    newX = poly.fit_transform(X[:, np.newaxis])
    
    linear_regression = LinearRegression()
    linear_regression.fit(newX, y)

    inacc.fit(newX, y)
    inaccuracy = inacc.inaccuracy_model(linear_regression)
    
    sft.fit(newX, y)
    surfeit = sft.surfeit_model(linear_regression)
    
    linacc.append(inaccuracy)
    lsurfeit.append(surfeit)

Next figure show how the inaccuracy and the surfeit of the models changes as we increase the degree of the polynomial.

Inaccuracy vs. Surfeit

As it was expected, the higher the degree of the polynomial, the smaller is the error of the model. However, at the same time we see that the higher the polynomial, the higher the surfeit of the model, and the higher the risk of overfitting.

It is worth mentioning that the three metrics proposed by the minimum nescience principle (miscoding, inaccuracy and surfeit) are commensurable, that is, they have the same scale and the same units, and so, we can compare them.

For more information about how to compute the surfeit of a model using the `Surfeit class see the following blog entries:

  • Surfeit of different families of models (TBD)

Mathematical Formulation

Let's X be a dataset, y the target variable , and m a model. We define the surfeit of the model m for the target values y, as:

Feature Miscoding

If x is a qualitative vector (either a feature or the target variable) taking values from a set labels G = {g1, ..., gl}, the Kolmogorov complexity of x can be approximated by:

Kolmogorov Compression

If x is a quantitative vector, it has to be discretized first.

Clone this wiki locally