I will be using the Diabetes data set to demonstrate how to develop a classifier to predict whether a patient has diabetes or not utilizing the other information in the dataset.

First, load the necessary libraries

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
import itertools

import matplotlib.pylab as plt
%matplotlib inline 

from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, precision_score, accuracy_score, recall_score, f1_score, roc_curve, roc_auc_score
from sklearn import preprocessing

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [5]:
dfDiabetes = pd.read_csv('diabetes.csv')
dfDiabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


This dataset contains various information about diabetics, along with their associated class labels - **Diabetic or not-diabetic**.

  * **2 Classes**: diabetic or not-diabetic
  * **768 instances** (i.e, observations) where each instance represents a diabetic condition
  * **9 features or attributes** 
  * Missing Attributes: None

**Our Goal**: Using this dataset, we will build a model that uses the features to predict whether a patient is diabetic or not-diabetic.

In [6]:
type(dfDiabetes)

pandas.core.frame.DataFrame

Checking for Missing Values:

In [7]:
dfDiabetes.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [8]:
dfDiabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [9]:
dfDiabetes.shape

(768, 9)

Creating Training and Test Datasets

In [10]:
X = dfDiabetes.iloc[:, 0:9].values

In [11]:
print(X.shape)
print(type(X))

(768, 9)
<class 'numpy.ndarray'>


In [12]:
Y = dfDiabetes.iloc[:, 8].values

In [13]:
print(Y.shape)
print(type(Y))

(768,)
<class 'numpy.ndarray'>


In [14]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, random_state=42)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

(384, 9) (384,)
(384, 9) (384,)


In [15]:
unique_elements, counts_elements = np.unique(Y_train, return_counts=True)
print(unique_elements)
print(counts_elements)

[0 1]
[246 138]


Feature Scaling

In [16]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Classification Algorithms

K-Nearest Neighbors

In [15]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1, weights='uniform') # initialize a KNN classifier
knn.fit(X_train, Y_train) # train classifer with training set

In [16]:
Y_train_predicted = knn.predict(X_train)

In [17]:
print("Training Classification accuracy:", knn.score(X_train, Y_train))
print("\n")
print ("Training Confusion matrix: \n" + str(confusion_matrix(Y_train, Y_train_predicted)))
print("\n")
print("Training Classification Report:\n",classification_report(Y_train, Y_train_predicted))

Training Classification accuracy: 1.0


Training Confusion matrix: 
[[246   0]
 [  0 138]]


Training Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       246
           1       1.00      1.00      1.00       138

    accuracy                           1.00       384
   macro avg       1.00      1.00      1.00       384
weighted avg       1.00      1.00      1.00       384

