<div align="left" style="background-color: #008080; padding: 20px 10px;">
<h3><b>IDEAS - Institute of Data Engineering, Analytics and Science Foundation</b></h3>
<p>Spring Internship Program 2026</p>
<hr style="width:100%;">
<h3><b>Project Title:</b> Diabetes Prediction: Classification Comparison + Metrics + Evaluation</h3>
<h4>Project Notebook</h4>

<blockquote style="border-left: 4px solid #4285F4; padding-left: 15px;">
  <strong>Created by:</strong> Rounak Biswas<br>
  <strong>Designation:</strong> Software Engineer
</blockquote>
<hr style="width:100%;">
</div>

##  Problem Statement

You are tasked with building a classification model to predict whether a patient has diabetes based on diagnostic measurements.

- Use the **Pima Indian Diabetes Dataset**.
- Compare multiple classification models.
- Evaluate them using accuracy, precision, recall, F1-score, ROC-AUC.


---


##  Dataset Introduction

The dataset contains medical predictor variables and one target variable (`Outcome`).

- Pregnancies
- Glucose
- Blood Pressure
- Skin Thickness
- Insulin
- BMI
- DiabetesPedigreeFunction
- Age
- Outcome (0 = No Diabetes, 1 = Diabetes)

### Question 1: Import Libraries and Load Data (5 Marks)

Import `pandas` as `pd` and `numpy` as `np`. Load the Pima Indian Diabetes Dataset from the specified URL into a pandas DataFrame called `df`. Display the first 5 rows of the DataFrame.

**Hint:** Use `pd.read_csv(url)` to load the data. Use the `.head()` method to display the initial rows.

**Expected Output:** A table showing the first 5 rows of the diabetes dataset.

In [11]:
import pandas as pd
import numpy as np

url = "https://github.com/npradaschnor/Pima-Indians-Diabetes-Dataset/raw/refs/heads/master/diabetes.csv"
df = pd.read_csv(url)

df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Question 2: Data Inspection (5 Marks)

Get a quick overview of the dataset. Print the shape of the DataFrame and then check for the total number of missing (null) values across the entire DataFrame.

**Hint:** Use `df.shape` to get the dimensions and `df.isnull().sum().sum()` to count all missing values.

**Expected Output:** The shape of the DataFrame (e.g., `(768, 9)`) and the total count of null values.

In [12]:
print("Shape of dataset:", df.shape)
print("Total missing values:", df.isnull().sum().sum())

Shape of dataset: (768, 9)
Total missing values: 0


### Question 3: Descriptive Statistics (5 Marks)

Generate the descriptive statistics for the `df` DataFrame to understand the central tendency, dispersion, and shape of the dataset's distribution.

**Hint:** Use the `.describe()` method on the DataFrame.

**Expected Output:** A table showing statistical details like mean, std, min, max, etc., for each numerical column.

In [13]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### Question 4: Define Feature Matrix and Target Vector (10 Marks)

Prepare the data for modeling by separating it into features (`X`) and the target (`y`). `X` should contain all columns except 'Outcome', and `y` should be the 'Outcome' column. Print the shapes of both `X` and `y`.

**Hint:** Use `df.drop('Outcome', axis=1)` to create `X`. For `y`, you can select the column using `df['Outcome']`. Use the `.shape` attribute for dimensions.

**Expected Output:** The shapes of `X` and `y`, which should be `(768, 8)` and `(768,)`.

In [14]:
X = df.drop('Outcome', axis=1)
y = df['Outcome']

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)


Shape of X: (768, 8)
Shape of y: (768,)


### Question 5: Split Data into Training and Testing Sets (10 Marks)

Import `train_test_split` from `sklearn.model_selection`. Split the `X` and `y` data into training and testing sets (`X_train`, `X_test`, `y_train`, `y_test`). Use `test_size=0.2`, `random_state=42`, and enable stratification on `y`.

**Hint:** The function call will be `train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)`. Stratification ensures proportional class representation.

**Expected Output:** No direct output. The variables will be ready for the next steps.

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### Question 6: Scale the Feature Data (10 Marks)

Import `StandardScaler` from `sklearn.preprocessing`. Create an instance of the scaler, fit it on the training data (`X_train`), and then transform both `X_train` and `X_test`. Store the results in `X_train_scaled` and `X_test_scaled`.

**Hint:** First, create the scaler `scaler = StandardScaler()`. Then, use `scaler.fit_transform(X_train)` for the training data and `scaler.transform(X_test)` for the test data.

**Expected Output:** No direct output. The scaled data will be stored in new variables.

In [16]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### Question 7: Train and Evaluate a Logistic Regression Model (10 Marks)

Import `LogisticRegression`. Create an instance, train it on the **scaled** training data (`X_train_scaled`, `y_train`), make predictions on the scaled test data, and print the `classification_report`.

**Hint:** Import `classification_report` from `sklearn.metrics`. The steps are: instantiate, `.fit()`, `.predict()`, and then print the report.

**Expected Output:** A text-based classification report showing precision, recall, and f1-score for classes 0 and 1.

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

y_pred_log = log_reg.predict(X_test_scaled)

print(classification_report(y_test, y_pred_log))


              precision    recall  f1-score   support

           0       0.76      0.82      0.79       100
           1       0.61      0.52      0.56        54

    accuracy                           0.71       154
   macro avg       0.68      0.67      0.67       154
weighted avg       0.71      0.71      0.71       154



### Question 8: Train and Evaluate a K-Nearest Neighbors (KNN) Model (15 Marks)

Import `KNeighborsClassifier`. Create an instance with `n_neighbors=5`, train it on the scaled training data, make predictions on the scaled test data, and print the `accuracy_score`.

**Hint:** Import `accuracy_score` from `sklearn.metrics`. The process is very similar to the Logistic Regression step.

**Expected Output:** A single decimal number representing the accuracy of the KNN model.

In [18]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

y_pred_knn = knn.predict(X_test_scaled)

print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))


KNN Accuracy: 0.7012987012987013


### Question 9: Train and Evaluate a Support Vector Machine (SVM) Model (15 Marks)

Import `SVC` (Support Vector Classifier). Create an instance with a `kernel='linear'` and `random_state=42`. Train it on the scaled training data, predict on the scaled test data, and print the `f1_score`.

**Hint:** Import `f1_score` from `sklearn.metrics`. This metric is useful for evaluating models on imbalanced datasets.

**Expected Output:** A single decimal number representing the F1-score of the SVM model.

In [19]:
from sklearn.svm import SVC
from sklearn.metrics import f1_score

svm = SVC(kernel='linear', random_state=42)
svm.fit(X_train_scaled, y_train)

y_pred_svm = svm.predict(X_test_scaled)

print("SVM F1 Score:", f1_score(y_test, y_pred_svm))

SVM F1 Score: 0.5656565656565656


### Question 10: Model Comparison with ROC-AUC Score (15 Marks)

Calculate the `roc_auc_score` for all three models (Logistic Regression, KNN, and SVM) using their predictions on the scaled test data. Store the results in a dictionary called `model_scores` where keys are model names ('LogisticRegression', 'KNN', 'SVM') and values are their scores. Print the dictionary. Which model performed best?

**Hint:** Import `roc_auc_score` from `sklearn.metrics`. This score measures a model's ability to distinguish between classes. You will need the predictions you generated in the previous steps.

**Expected Output:** A dictionary showing the name of each model and its corresponding ROC-AUC score. A simple conclusion of which model performed best on performance.

In [20]:
from sklearn.metrics import roc_auc_score

model_scores = {
    "LogisticRegression": roc_auc_score(y_test, y_pred_log),
    "KNN": roc_auc_score(y_test, y_pred_knn),
    "SVM": roc_auc_score(y_test, y_pred_svm)
}

print("ROC-AUC Scores:", model_scores)

best_model = max(model_scores, key=model_scores.get)
print("Best model based on ROC-AUC:", best_model)


ROC-AUC Scores: {'LogisticRegression': np.float64(0.6692592592592593), 'KNN': np.float64(0.6592592592592593), 'SVM': np.float64(0.6742592592592592)}
Best model based on ROC-AUC: SVM
