
Accuracy is not always a useful metric

### Class imbalance examples: Emails
- Spam classification
 - 99% of emails are real; 1% of emails are spam
- Could build a classifier that predicts ALL emails as real
 - This model would be correct 99%
 - But horrible at actually classifying spam
 - Fails at its original purpose
- When one class is more frequent -> imbalance
- Need more nuanced metrics
 
## Diagnosing classifications predictions

Given the binary classification, we can draw up a confusion matri

### Confusion Matrix

| | Predicted:<br>Spam Email | Predicted:<br>Real Email |    
|---|---|---|
| **Actual: Spam Email** | True Positive | False Negative |
| **Actual: Real Email** | False Positive |True Negative |
 - Correctly labeled spam email = True Positive
 - Correctly labeled real meail = True Negative
 - Incorrectly labeled spam email = False Negative
 - Incorrectly labeled real email = False Positive

**Class of interest** is called **Positive** class

- **Accuracy** = $\frac{tp+tn}{tp + tn + fp + fn}$
- **Precision** = $\frac{tp}{tp + fp}$<br>
aka Positive Predictive Value (PPV)
- **Recall** = $\frac{tp}{tp+fn}$<br>
aka Sensitivity, hit rate, or True Positive Rate
- **F1 score** = $2 . \frac{precision . recall }{precision + recall}$

High precision -> low false positive rate<br>
High recall -> the classfier predicted more correctly

# Diabetes dataset
PIMA Indians dataset obtained from UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [5]:
df = pd.read_csv('datasets/diabetes.csv')
df.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [21]:
X = df.drop('diabetes', axis=1)
y = df['diabetes']

In [23]:
# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.4, random_state=42)

knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

arguments: test target before predictions

In [24]:
print(confusion_matrix(y_test, y_pred))

[[176  30]
 [ 56  46]]


In [25]:
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.76      0.85      0.80       206
          1       0.61      0.45      0.52       102

avg / total       0.71      0.72      0.71       308



#### Comparing k-NN and Logistic Regression

In [49]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred2 = logreg.predict(X_test)
print(confusion_matrix(y_test, y_pred2))

[[174  32]
 [ 36  66]]


In [50]:
print(classification_report(y_test, y_pred2))

             precision    recall  f1-score   support

          0       0.83      0.84      0.84       206
          1       0.67      0.65      0.66       102

avg / total       0.78      0.78      0.78       308



Looks better than KNN.

## Terminology and derivations from a confusion matrix
- **(number of) actually positive samples (P)**
- **(number of) actually negative samples (N)**
- **(number of) true positives (TP)**<br>
eqv. with hit; correctly classified as positive
- **(number of) true negatives (TN)**<br>
eqv. with correct rejection; correctly classified as negative
- **(number of) false positives (FP)**<br>
eqv. with false alarm, **Type I error**; wrongly classified as positive
- **(number of) false negatives (FN)**<br>
eqv. with miss, **Type II error**; wrongly classified as negative
---
- **sensitivity** or **true positive rate (TPR)**<br>
eqv. with **hit rate**, **recall**<br>
$TPR = TP/P = TP/(TP + FN)$
- **specificity (SPC)** or **true negative rate**<br>
$SPC = TN/N = TN/(TN + FP)$
- **precision** or **positive predictive value (PPV)**<br>
$PPV = TP/(TP + FP)$
- **negative predictive value (NPV)**<br>
$NPV = TN/(TN + FN)$
- **fall-out** or **false positive rate (FPR)**<br>
$FPR = FP/N = FP/(FP + TN) = 1-SPC$
- **false negative rate (FNR)**<br>
$FNR = FN/(TP + FN) = 1 - TPR$
- **false discovery rate (FDR)**<br>
$FDR = FP/(TP + FP) = 1-PPV$