Skip to content

Model Complexity

Rafael Garcia Leiva edited this page Dec 24, 2019 · 4 revisions

Model Complexity

Surfeit is a measure of how unnecessarily complex is a model. Surfeit computes how much longer is our current model with respect to the optimal, shortest possible, model. On the contrary of other measures of model complexity, surfeit does not penalize a very long model, if this model is the shortest possible one for this dataset.

The class fastautoml.Surfeit allow us to compute the surfeit of a model that belongs to one of the supported families of models. Please, refer to the Reference API (TDB) to check the list of supported families of models.

Avoiding Overfitting

Surfeit allow us to compute the risk of a model being overfitting a dataset.

import numpy as np
X = np.sort(np.random.rand(n_samples) * 3)
y = np.cos(1.5 * np.pi * X)

For this example we have generated a dataset composed by 900 samples of a sinusoidal curve.

Sinusoidal Curve

We will fit the dataset using family of polynomial models, where the degree of the polynomials go from 1 to 15.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

from fastautoml.fastautoml import Surfeit
from fastautoml.fastautoml import Inaccuracy

linacc   = list()
lsurfeit = list()

for i in np.arange(1, 15):
        
    poly = PolynomialFeatures(degree=i, include_bias=False)
    newX = poly.fit_transform(X[:, np.newaxis])
    
    linear_regression = LinearRegression()
    linear_regression.fit(newX, y)

    inacc.fit(newX, y)
    inaccuracy = inacc.inaccuracy_model(linear_regression)
    
    sft.fit(newX, y)
    surfeit = sft.surfeit_model(linear_regression)
    
    linacc.append(inaccuracy)
    lsurfeit.append(surfeit)

Next figure show how the inaccuracy and the surfeit of the models changes as we increase the degree of the polynomial.

Inaccuracy vs. Surfeit

As it was expected, the higher the degree of the polynomial, the smaller is the error of the model. However, at the same time we see that the higher the polynomial, the higher the surfeit of the model, and the higher the risk of overfitting.

It is worth mentioning that the three metrics proposed by the minimum nescience principle (miscoding, inaccuracy and surfeit) are commensurable, that is, they have the same scale and the same units, and so, we can compare them.

For more information about how to compute the surfeit of a model using the `Surfeit class see the following blog entries:

  • Surfeit of different families of models (TBD)

Mathematical Formulation

(TBD)

Clone this wiki locally