<div align="left" style="background-color: #008080; padding: 20px 10px;">
<h3><b>IDEAS - Institute of Data Engineering, Analytics and Science Foundation</b></h3>
<p>Spring Internship Program 2026</p>
<hr style="width:100%;">
<h3><b>Project Title:</b> Diabetes Prediction: Classification Comparison + Metrics + Evaluation</h3>
<h4>Project Notebook</h4>

<blockquote style="border-left: 4px solid #4285F4; padding-left: 15px;">
  <strong>Created by:</strong> Rounak Biswas<br>
  <strong>Designation:</strong> Software Engineer
</blockquote>
<hr style="width:100%;">
</div>

##  Problem Statement

You are tasked with building a classification model to predict whether a patient has diabetes based on diagnostic measurements.

- Use the **Pima Indian Diabetes Dataset**.
- Compare multiple classification models.
- Evaluate them using accuracy, precision, recall, F1-score, ROC-AUC.


---


##  Dataset Introduction

The dataset contains medical predictor variables and one target variable (`Outcome`).

- Pregnancies
- Glucose
- Blood Pressure
- Skin Thickness
- Insulin
- BMI
- DiabetesPedigreeFunction
- Age
- Outcome (0 = No Diabetes, 1 = Diabetes)

### Question 1: Import Libraries and Load Data (5 Marks)

Import `pandas` as `pd` and `numpy` as `np`. Load the Pima Indian Diabetes Dataset from the specified URL into a pandas DataFrame called `df`. Display the first 5 rows of the DataFrame.

**Hint:** Use `pd.read_csv(url)` to load the data. Use the `.head()` method to display the initial rows.

**Expected Output:** A table showing the first 5 rows of the diabetes dataset.

In [None]:
# Write your answer here

### Question 2: Data Inspection (5 Marks)

Get a quick overview of the dataset. Print the shape of the DataFrame and then check for the total number of missing (null) values across the entire DataFrame.

**Hint:** Use `df.shape` to get the dimensions and `df.isnull().sum().sum()` to count all missing values.

**Expected Output:** The shape of the DataFrame (e.g., `(768, 9)`) and the total count of null values.

In [None]:
# Write your answer here

### Question 3: Descriptive Statistics (5 Marks)

Generate the descriptive statistics for the `df` DataFrame to understand the central tendency, dispersion, and shape of the dataset's distribution.

**Hint:** Use the `.describe()` method on the DataFrame.

**Expected Output:** A table showing statistical details like mean, std, min, max, etc., for each numerical column.

In [None]:
# Write your answer here

### Question 4: Define Feature Matrix and Target Vector (10 Marks)

Prepare the data for modeling by separating it into features (`X`) and the target (`y`). `X` should contain all columns except 'Outcome', and `y` should be the 'Outcome' column. Print the shapes of both `X` and `y`.

**Hint:** Use `df.drop('Outcome', axis=1)` to create `X`. For `y`, you can select the column using `df['Outcome']`. Use the `.shape` attribute for dimensions.

**Expected Output:** The shapes of `X` and `y`, which should be `(768, 8)` and `(768,)`.

In [None]:
# Write your answer here

### Question 5: Split Data into Training and Testing Sets (10 Marks)

Import `train_test_split` from `sklearn.model_selection`. Split the `X` and `y` data into training and testing sets (`X_train`, `X_test`, `y_train`, `y_test`). Use `test_size=0.2`, `random_state=42`, and enable stratification on `y`.

**Hint:** The function call will be `train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)`. Stratification ensures proportional class representation.

**Expected Output:** No direct output. The variables will be ready for the next steps.

In [None]:
# Write your answer here

### Question 6: Scale the Feature Data (10 Marks)

Import `StandardScaler` from `sklearn.preprocessing`. Create an instance of the scaler, fit it on the training data (`X_train`), and then transform both `X_train` and `X_test`. Store the results in `X_train_scaled` and `X_test_scaled`.

**Hint:** First, create the scaler `scaler = StandardScaler()`. Then, use `scaler.fit_transform(X_train)` for the training data and `scaler.transform(X_test)` for the test data.

**Expected Output:** No direct output. The scaled data will be stored in new variables.

In [None]:
# Write your answer here

### Question 7: Train and Evaluate a Logistic Regression Model (10 Marks)

Import `LogisticRegression`. Create an instance, train it on the **scaled** training data (`X_train_scaled`, `y_train`), make predictions on the scaled test data, and print the `classification_report`.

**Hint:** Import `classification_report` from `sklearn.metrics`. The steps are: instantiate, `.fit()`, `.predict()`, and then print the report.

**Expected Output:** A text-based classification report showing precision, recall, and f1-score for classes 0 and 1.

In [None]:
# Write your answer here

### Question 8: Train and Evaluate a K-Nearest Neighbors (KNN) Model (15 Marks)

Import `KNeighborsClassifier`. Create an instance with `n_neighbors=5`, train it on the scaled training data, make predictions on the scaled test data, and print the `accuracy_score`.

**Hint:** Import `accuracy_score` from `sklearn.metrics`. The process is very similar to the Logistic Regression step.

**Expected Output:** A single decimal number representing the accuracy of the KNN model.

In [None]:
# Write your answer here

### Question 9: Train and Evaluate a Support Vector Machine (SVM) Model (15 Marks)

Import `SVC` (Support Vector Classifier). Create an instance with a `kernel='linear'` and `random_state=42`. Train it on the scaled training data, predict on the scaled test data, and print the `f1_score`.

**Hint:** Import `f1_score` from `sklearn.metrics`. This metric is useful for evaluating models on imbalanced datasets.

**Expected Output:** A single decimal number representing the F1-score of the SVM model.

In [None]:
# Write your answer here

### Question 10: Model Comparison with ROC-AUC Score (15 Marks)

Calculate the `roc_auc_score` for all three models (Logistic Regression, KNN, and SVM) using their predictions on the scaled test data. Store the results in a dictionary called `model_scores` where keys are model names ('LogisticRegression', 'KNN', 'SVM') and values are their scores. Print the dictionary. Which model performed best?

**Hint:** Import `roc_auc_score` from `sklearn.metrics`. This score measures a model's ability to distinguish between classes. You will need the predictions you generated in the previous steps.

**Expected Output:** A dictionary showing the name of each model and its corresponding ROC-AUC score. A simple conclusion of which model performed best on performance.

In [None]:
# Write your answer here