Model Inaccuracy

Inaccuracy

Given a dataset X, a target variable y, and a model m, the inaccuracy of m measures the distance between real and predicted values, that is, how difficult is to reconstruct y given the predictions of m, and the other way around. Another interpretation of the concept inaccuracy would be as the effort (measured as the length of a computer program) required to fix the errors made by the model m.

The nescience.inaccuracy class allow us to compute the inaccuracy of a model given a dataset and a target variable.

Model Evaluation

For this example, we are going to train a DecisionTreeClassifier to classify the digits dataset using a min_sample_size of 5 to avoid overfitting.

>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.datasets import load_digits
>>> X, y = load_digits(return_X_y=True)
>>> tree = DecisionTreeClassifier(min_sample_size=5)
>>> tree.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

With the following code we can compute the inaccuracy of this model, that is, the quality of the predictions made by the model.

>>> from nescience.inaccuracy import Inaccuracy
>>> inacc = Inaccuracy(y_type='categorical')
>>> inacc.fit(X, y)
Inaccuracy()
>>> inacc.inaccuracy_model(tree)
0.17094976238811344

For more information about the inaccuracy class see the following blog entries:

Score vs. Inaccuracy (TBD)
Influence of the Shape of Errors (TBD)
Dealing with Imbalanced Datasets (TBD)

Inaccuracy of Predictions

The class Inaccuracy can also work directly with a set of predictions, instead of using a model. This allow us to compute the inaccuracy of models not implemented by scikit-learn.

>>> pred = tree.predict(X)
>>> inacc.inaccuracy_predictions(pred)
0.17094976238811344

As it was expected, both methods provide the same result.

Mathematical Formulation

Let's X be a dataset, y the target variable , m a model, and ŷ = m(X) the predicted values of m given X. We define the inaccuracy of the model m for the target values y, as:

Feature Miscoding

If x is a qualitative vector (either a feature or the target variable) taking values from a set labels G = {g₁, ..., g_l}, the Kolmogorov complexity of x can be approximated by:

Kolmogorov Compression

If x is a quantitative vector, it has to be discretized first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Inaccuracy

Inaccuracy

Model Evaluation

Inaccuracy of Predictions

Mathematical Formulation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally