# Metrics for Classification (in progress)

accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.

We will work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes.

We will train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.

In [12]:
# Import necessary modules
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import Imputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

diabetes = pd.read_csv('./Datasets/diabetes.csv')
diabetes.insulin.replace(0, np.nan, inplace=True)
diabetes.triceps.replace(0, np.nan, inplace=True)
diabetes.bmi.replace(0, np.nan, inplace=True)
d_columns = diabetes.columns

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(diabetes)
diabetes = imp.transform(diabetes)
df = pd.DataFrame(diabetes,columns=d_columns)
y = df['diabetes'].values
X = df.drop('diabetes',axis =1)

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print("Confusion matrix: \n", confusion_matrix(y_test, y_pred))
print("\nClassification report: \n", classification_report(y_test, y_pred))

Confusion matrix: 
 [[176  30]
 [ 52  50]]

Classification report: 
              precision    recall  f1-score   support

        0.0       0.77      0.85      0.81       206
        1.0       0.62      0.49      0.55       102

avg / total       0.72      0.73      0.72       308

