-
Notifications
You must be signed in to change notification settings - Fork 6
Model Complexity
Surfeit is a measure of how unnecessarily complex is a model. Surfeit computes how much longer is our current model with respect to the optimal, shortest possible, model. On the contrary of other measures of model complexity, surfeit does not penalize a very long model, if this model is the shortest possible one for this dataset.
The class fastautoml.Surfeit allow us to compute the surfeit of a model that belongs to one of the supported families of models. Please, refer to the Reference API (TDB) to check the list of supported families of models.
Surfeit allow us to compute the risk of a model being overfitting a dataset.
import numpy as np
X = np.sort(np.random.rand(n_samples) * 3)
y = np.cos(1.5 * np.pi * X)For this example we have generated a dataset composed by 900 samples of a sinusoidal curve.

We will fit the dataset using family of polynomial models, where the degree of the polynomials go from 1 to 15.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from fastautoml.fastautoml import Surfeit
from fastautoml.fastautoml import Inaccuracy
linacc = list()
lsurfeit = list()
for i in np.arange(1, 15):
poly = PolynomialFeatures(degree=i, include_bias=False)
newX = poly.fit_transform(X[:, np.newaxis])
linear_regression = LinearRegression()
linear_regression.fit(newX, y)
inacc.fit(newX, y)
inaccuracy = inacc.inaccuracy_model(linear_regression)
sft.fit(newX, y)
surfeit = sft.surfeit_model(linear_regression)
linacc.append(inaccuracy)
lsurfeit.append(surfeit)Next figure show how the inaccuracy and the surfeit of the models changes as we increase the degree of the polynomial.

As it was expected, the higher the degree of the polynomial, the smaller is the error of the model. However, at the same time we see that the higher the polynomial, the higher the surfeit of the model, and the higher the risk of overfitting.
It is worth mentioning that the three metrics proposed by the minimum nescience principle (miscoding, inaccuracy and surfeit) are commensurable, that is, they have the same scale and the same units, and so, we can compare them.
For more information about how to compute the surfeit of a model using the `Surfeit class see the following blog entries:
- Surfeit of different families of models (TBD)
Let's X be a dataset, y the target variable , and m a model. We define the surfeit of the model m for the target values y, as:

If x is a qualitative vector (either a feature or the target variable) taking values from a set labels G = {g1, ..., gl}, the Kolmogorov complexity of x can be approximated by:

If x is a quantitative vector, it has to be discretized first.