# Evaluation Metrics
Classification Metrics:
* Classification Accuracy
* Logarithmic Loss
* Area Under ROC Curve
* Confusion Matrix
* Classification Report

Regression Metrics:
* Mean Absolute Error
* Mean Squared Error
* R Squared ($R^2$)

In [1]:
import pandas
import math
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report



## Classification Metrics
This example uses the Pima Indians diabetes dataset as this is a binary classification problem.

In [2]:
# these examples use the Pima Indian diabetes dataset
url = "pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

In [3]:
# separate array into features (X) and label (y) parts
X = array[:,0:8]
y = array[:,8]

### Classification Accuracy
The number of correct predictions vs all predictions made. This is the most common evaluation metric for classification problems.

In [4]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()

scoring = 'accuracy'
results = cross_validation.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
print("Accuracy: %.3f (%.3f)" % (results.mean(), results.std()))

Accuracy: 0.770 (0.048)


### Logarithmic Loss
Also known as logloss. Each prediction given as a value between 0 and 1 (probability of belonging to a particular class). This metric evaluates by rewarding or punishing in proportion to the confidence (assumed by the prabability) of the prediction. The smaller the logloss result, the better (zero is perfect logloss).

In [5]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()

scoring = 'neg_log_loss'
results = cross_validation.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
print("Logloss: %.3f (%.3f)" % (results.mean(), results.std()))

Logloss: -0.493 (0.047)


For metrics where the smallest score is best, the `cross_val_score` function reports these as negative values so that they can be sorted in ascending order so that the largest score is best.

### Area Under ROC Curve (AUC)
Represents a model's ability to discriminate between positive and negative classes. An area of 1.0 represents a perfect model (all predictions are correct) whereas an area on 0.5 represents a worthless model (as good as a coin toss). Rough guide:
* Area of 0.9 - 1.0 = excellent
* Area of 0.8 - 0.9 = good
* Area of 0.7 - 0.8 = fair
* Area of 0.6 - 0.7 = poor
* Area of 0.5 - 0.6 = fail

A Receiver Operating Characteristic (ROC) curve plots the true positive rate (i.e. sensitivity or "recall") against the false positive rate at various threshold settings. See http://gim.unmc.edu/dxtests/roc3.htm

In [6]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()

scoring = 'roc_auc'
results = cross_validation.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

AUC: 0.824 (0.041)


### Confusion Matrix
Returns the number of false negatives, false positives (Type I error), true negatives (Type II error) and true positives. See https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

In [7]:
test_size = 0.3
seed = 8

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, y_train)
predicted = model.predict(X_test)

matrix = confusion_matrix(y_test, predicted)
print(matrix)

[[134  13]
 [ 38  46]]


### Classification Report
The scikit-learn library provides a report to give a quick idea of the model's accuracy using a number of measures.

In [8]:
test_size = 0.3
seed = 8

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, y_train)
predicted = model.predict(X_test)

report = classification_report(y_test, predicted)
print(report)

             precision    recall  f1-score   support

        0.0       0.78      0.91      0.84       147
        1.0       0.78      0.55      0.64        84

avg / total       0.78      0.78      0.77       231



For predicting if the class is 1.0: the precision is 78% and recall is 55% 

## Regression Metrics
This example uses the Boston Home Price dataset as this is a regression problem (all input variables are numeric).

In [9]:
# these examples use the Boston Home Price dataset
url = "housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO',
'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values

In [10]:
# separate array into features (X) and label (y) parts
X = array[:,0:13]
y = array[:,13]

### Mean Absolute Error (MAE)
The average of the absolute differences between predictions and actual values. Gives an idea of the magnitude of the error but not direction (i.e. over or under predicting). A value of zero indicates no error (i.e. perfect predictions). The lower the score, the better. Metric is negated by the `cross_val_score()` function.

In [11]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LinearRegression()

scoring = 'neg_mean_absolute_error'
results = cross_validation.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (results.mean(), results.std()))

MAE: -4.005 (2.084)


### Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)
Mean Squared Error is the average of the squared differences between predictions and actual values. Taking the square root converts the units back to the original units of the output variable (i.e. the Root Mean Squared Error). As with MAE, the lower the score, the better.

In [12]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LinearRegression()

# MSE
scoring = 'neg_mean_squared_error'
results = cross_validation.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
print("MSE: %.3f (%.3f)" % (abs(results.mean()), abs(results.std())))

# RMSE
print("MSE: %.3f (%.3f)" % (math.sqrt(abs(results.mean())), math.sqrt(abs(results.std()))))

MSE: 34.705 (45.574)
MSE: 5.891 (6.751)


### R Squared ($R^2$)
Gives an indication of the goodness of fit ("coefficient of determination") of a set of predictions to the actual values. It's a statistical measure of how close the data are to the fitted regression line. The result is a value between 0 (indicating no fit) and 1 (indicating perfect fit). More info: http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit

In [13]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LinearRegression()

scoring = 'r2'
results = cross_validation.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()))

R^2: 0.203 (0.595)
