### üìä Case Study: Predicting Diabetes Using K-Nearest Neighbors (KNN)

### üß© 1. Business Problem

Diabetes is a growing health concern worldwide. 

Early detection helps in timely medical intervention and lifestyle changes.

### üí° Objective:

Use patient health data to predict whether a person is likely to be diabetic. 

We'll use KNN as our model ‚Äî a simple and interpretable algorithm that classifies new data based on similar past patients.



### üìÅ 2. Dataset Overview

We are using the Pima Indians Diabetes Dataset from the UCI Repository (via sklearn.datasets or CSV). 

This dataset was collected by the National Institute of Diabetes and Digestive and Kidney Diseases.

üì¶ 768 samples

üéØ Binary target: Diabetic (1) or Not (0)



### üî¢ 3. Feature Descriptions

Feature	Description

Pregnancies	Number of times pregnant

Glucose	Plasma glucose concentration

BloodPressure	Diastolic blood pressure (mm Hg)

SkinThickness	Triceps skin fold thickness (mm)

Insulin	2-Hour serum insulin (mu U/ml)

BMI	Body Mass Index

DiabetesPedigree	Diabetes pedigree function

Age	Age (in years)

Outcome	1 = Diabetic, 0 = Non-diabetic


In [1]:
# üìå Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.preprocessing import StandardScaler


In [23]:
# üìå Load the dataset

url="pima-indians-diabetes.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
                "BMI", "DiabetesPedigree", "Age", "Outcome"]
df = pd.read_csv(url, names=column_names)
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigree,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [24]:
df.isnull().sum()

Pregnancies         0
Glucose             0
BloodPressure       0
SkinThickness       0
Insulin             0
BMI                 0
DiabetesPedigree    0
Age                 0
Outcome             0
dtype: int64

In [25]:

# üìå Split into features and target
X = df.drop("Outcome", axis=1).values
y = df["Outcome"].values


In [27]:
y

array([1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,

In [29]:
# ‚öñÔ∏è Feature scaling (important for distance-based algorithms)
scaler = StandardScaler()
X = scaler.fit_transform(X)
X

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

In [31]:
# üìå Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape,y_train.shape

((614, 8), (614,))

In [6]:
def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2)**2))


In [36]:
class KNN:
    def __init__(self, k=5):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        return np.array([self._predict(x) for x in X])

    def _predict(self, x):
        # üìå Calculate distances from x to all training samples
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        
        k_indices = np.argsort(distances)[:self.k]
        
        k_labels = [self.y_train[i] for i in k_indices]
        
        most_common = Counter(k_labels).most_common(1)
        return most_common[0][0]


In [37]:
# -------------------------------------
# üìå Train and Evaluate
# -------------------------------------
model = KNN(k=7)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)



In [38]:
# üéØ Evaluate accuracy
accuracy = np.mean(y_pred == y_test)
print(f"‚úÖ Accuracy of KNN on Pima Diabetes dataset: {accuracy:.2f}")


‚úÖ Accuracy of KNN on Pima Diabetes dataset: 0.68


In [39]:
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [40]:
# üìå Step 4: Standardize features (very important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [41]:
# üìå Step 5: Tune K value using GridSearchCV
param_grid = {"n_neighbors": range(3, 21)}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train_scaled, y_train)


# cv=5 in GridSearchCV
# 5-Fold Cross-Validation:

# The dataset (X_train_scaled, y_train) is split into 5 equal parts (folds).

# The model is trained on 4 folds and validated on the remaining 1 fold.

# This process is repeated 5 times, each time using a different fold as the validation set.


In [42]:

# üìå Step 6: Best K and evaluation
best_k = grid.best_params_['n_neighbors']
best_knn = grid.best_estimator_

# Predict and evaluate
y_pred = best_knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"‚úÖ Improved KNN Accuracy: {accuracy:.2f} with k = {best_k}")

‚úÖ Improved KNN Accuracy: 0.71 with k = 11


### Let us explore same dataset with Logistic Regression Machine learning 

In [43]:
# ‚úÖ Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [44]:
# ‚úÖ Step 2: Load the Pima Diabetes dataset
url = "pima-indians-diabetes.csv"

column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
                "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]


In [45]:
df = pd.read_csv(url, header=None, names=column_names)
df.head()



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [46]:
# ‚úÖ Step 3: Separate features (X) and target (y)
X = df.drop("Outcome", axis=1)
y = df["Outcome"]


In [47]:
# ‚úÖ Step 4: Train-Test Split (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [48]:
# ‚úÖ Step 5: Feature Scaling (Logistic Regression is sensitive to feature scale)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [49]:
# ‚úÖ Step 6: Initialize and train Logistic Regression model
log_model = LogisticRegression()
log_model.fit(X_train_scaled, y_train)


In [50]:
# ‚úÖ Step 7: Make predictions
y_pred = log_model.predict(X_test_scaled)

In [51]:
# ‚úÖ Step 8: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)


In [21]:
# ‚úÖ Step 9: Print results
print("‚úÖ Accuracy of Logistic Regression:", round(accuracy, 2))
print("\nüìä Confusion Matrix:\n", cm)
print("\nüìã Classification Report:\n", report)


‚úÖ Accuracy of Logistic Regression: 0.75

üìä Confusion Matrix:
 [[79 20]
 [18 37]]

üìã Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.80      0.81        99
           1       0.65      0.67      0.66        55

    accuracy                           0.75       154
   macro avg       0.73      0.74      0.73       154
weighted avg       0.76      0.75      0.75       154



### Summary

Unlike KNN, which is non-parametric and relies on distance metrics, 

Logistic Regression is a parametric and linear model, making it highly interpretable.

On this dataset, Logistic Regression performs better, with a 77% accuracy.

It‚Äôs also computationally efficient, which is crucial when dealing with large datasets.

This demonstrates why Logistic Regression remains widely used in domains like healthcare, 

especially for binary classification problems like diabetes prediction.