In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Evaluating a classification model

sklearn reference [page](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

- We need a way to choose between models: different model types (linear regression, logistic regression, decision trees, SVM, etc), tuning parameters, and features.
- Use a **model evaluation procedure** to estimate how well a model will generalize to out-of-sample data
- Requieres a **model evaluation metric** to quantify the model performance

## Loading the dataset

In [16]:
# Pima indian diabetes dataset
pima = pd.read_csv('Data/pima.csv')

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [19]:
# show first 5 rows
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Each row represent one patient, and the label column indicates 1 if the patient has diabetes, and 0 if the patient does not have diabetes.

In [17]:
pima.label.value_counts() # 

0    500
1    268
Name: label, dtype: int64

In [18]:
pima.columns

Index(['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree',
       'age', 'label'],
      dtype='object')

Columns:

- **glucose**: plasma glucose concentration a 2 hours in an oral glucose tolerance test
- **bp**: diastolic blood pressure (mm Hg)
- **skin**: Triceps skin fold thickness (mm)
- **insulin**: 2-Hour serum insulin (mu U/ml)
- **bmi**: Body mass index (weight in kg/(height in m)^2)
- **pedigree**: diabetes pedigree function (a synthesis of the diabetes history in relatives and the genetic relationship of those relatives to the subject)
- **Age**: age (years)

**Goal:** to predict the diabetes status of a patient given their health measurements.

In [40]:
# define X and y
feature_cols = ['pregnant','glucose','bp','skin','insulin','bmi','age']
X = pima[feature_cols]
y = pima.label

In [41]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [42]:
X_train.shape, X_test.shape

((576, 7), (192, 7))

In [43]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [44]:
# make class predictions for the testing set
y_test_pred = logreg.predict(X_test)

### Classification accuracy

The **classification accuracy** is the percentage of correct predictions

In [45]:
from sklearn import metrics
print(100*metrics.accuracy_score(y_test, y_test_pred))

76.04166666666666


Comparing the **true** and **predicted** responses

In [46]:
# print the first 25 true and predicted responses
print('True:',y_test.to_numpy()[0:25])
print('Pred:',y_test_pred[0:25])

True: [1 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0]
Pred: [1 0 0 0 0 0 1 0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0]


**Null accuracy:** accuracy that could be achieved by always predicting the modst frequent class

In [47]:
y_test.value_counts()

0    135
1     57
Name: label, dtype: int64

In [48]:
# null accuracy (?)
100*82/(82+61)

57.34265734265734

**Conclusion:**
- Classification accuracy is the **easiest classification metric to understand**
- But it does not tell you what **types of errors** your classifier is making

### Confusion matrix

Table that describes the performance of a classification model

In [49]:
print(metrics.confusion_matrix(y_test,y_test_pred))

[[113  22]
 [ 24  33]]


**Basic terminology**

- **True Positives (TP)**: we correctly predicted that the cancer is malign
- **True Negatives (TN)**: we correctly predicted that the cancer is benign
- **False Positives (FP)**: we incorrectly predicted that the cancer is malign
- **False Negatives (FN)**: we incorrectly predicted that the cancer is benign

In [50]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_test_pred)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

### Metrics computed from a confusion matrix

**Sensitivity (aka recall):** When the actual value is positive (1), how often is the prediction correct?

In [51]:
TP/(TP+FN)

0.5789473684210527

In [52]:
metrics.recall_score(y_test,y_test_pred)

0.5789473684210527

**Specificity:** When the actual value is negative (0), how often is the prediction correct?

In [53]:
TN/(TN+FP)

0.837037037037037

**Precision:** When a positive value is predicted, how often is the prediction correct?

In [54]:
TP/(TP+FP)

0.6

In [55]:
metrics.precision_score(y_test,y_test_pred)

0.6

**F1 score**

In [56]:
metrics.f1_score(y_test,y_test_pred)

0.5892857142857143

## Balanced accuracy

In [57]:
metrics.balanced_accuracy_score(y_test,y_test_pred)

0.7079922027290448

**Which metrics should you focus on?**

- Choice of metric depends on your objective
- **Spam filter** (positive class is "spam"): Optimize for **precision or specificity** because false negatives (spam goes to the inbox) are more acceptable than false positives (non-spam is caught by the spam filter)
- **Fraudulent transaction detector** (positive class is "fraud"): Optimize for **sensitivity** because false positives (normal transactions that are flagged as possible fraud) are more acceptable than false negatives (fraudulent transactions that are not detected)