-
Notifications
You must be signed in to change notification settings - Fork 6
Model Inaccuracy
Given a dataset X, a target variable y, and a model m, the inaccuracy of m measures the distance between real and predicted values, that is, how difficult is to reconstruct y given the predictions of m, and the other way around. Another interpretation of the concept inaccuracy would be as the effort (measured as the length of a computer program) required to fix the errors made by the model m.
The nescience.inaccuracy class allow us to compute the inaccuracy of a model given a dataset and a target variable.
For this example, we are going to train a DecisionTreeClassifier to classify the digits dataset using a min_sample_size of 5 to avoid overfitting.
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.datasets import load_digits
>>> X, y = load_digits(return_X_y=True)
>>> tree = DecisionTreeClassifier(min_sample_size=5)
>>> tree.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=5, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')With the following code we can compute the inaccuracy of this model, that is, the quality of the predictions made by the model.
>>> from nescience.inaccuracy import Inaccuracy
>>> inacc = Inaccuracy(y_type='categorical')
>>> inacc.fit(X, y)
Inaccuracy()
>>> inacc.inaccuracy_model(tree)
0.17094976238811344For more information about the inaccuracy class see the following blog entries:
- Score vs. Inaccuracy (TBD)
- Influence of the Shape of Errors (TBD)
- Dealing with Imbalanced Datasets (TBD)
The class Inaccuracy can also work directly with a set of predictions, instead of using a model. This allow us to compute the inaccuracy of models not implemented by scikit-learn.
>>> pred = tree.predict(X)
>>> inacc.inaccuracy_predictions(pred)
0.17094976238811344As it was expected, both methods provide the same result.
Let's X be a dataset, y the target variable , m a model, and ŷ = m(X) the predicted values of m given X. We define the inaccuracy of the model m for the target values y, as:

If x is a qualitative vector (either a feature or the target variable) taking values from a set labels G = {g1, ..., gl}, the Kolmogorov complexity of x can be approximated by:

If x is a quantitative vector, it has to be discretized first.