Machine Learning - Assignment 2
BITS ID: 2025AA05952

Name: SONU KUMAR BHAGAT

Email: 2025aa05952@wilp.bits-pilani.ac.in

This notebook compares the following six models for a classification task:

Logistic Regression
Decision Tree Classifier
K-Nearest Neighbor Classifier
Naive Bayes Classifier
Random Forest (Ensemble)
XGBoost (Ensemble)
Dataset:
"Early Stage Diabetes Risk Prediction" dataset from UCI https://archive.ics.uci.edu/dataset/529/early+stage+diabetes+risk+prediction+dataset

In [1]:
pip install ucimlrepo



In [2]:
import pandas as pd
import numpy as np
import pickle
import os
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE

Fetch Dataset

In [3]:
print("Fetching dataset from UCI...")
dataset = fetch_ucirepo(id=529)
X_raw = dataset.data.features.copy()
y_raw = dataset.data.targets.copy()

Fetching dataset from UCI...


Data Exploration

In [4]:
print("--- Dataset Information ---")

# 1. Shape and Size
print(f"Dataset Shape: {X_raw.shape}")
print(f"Total Instances: {len(X_raw)}")
print(f"Total Features: {X_raw.shape[1]}")

# 2. Column Names and Types
print("\n--- Column List and Types ---")
print(X_raw.dtypes)

# 3. Check for Null Values
print("\n--- Null Value Check ---")
null_counts = X_raw.isnull().sum()
if null_counts.sum() == 0:
    print("No missing values found in the dataset.")
else:
    print(null_counts[null_counts > 0])


--- Dataset Information ---
Dataset Shape: (520, 16)
Total Instances: 520
Total Features: 16

--- Column List and Types ---
age                    int64
gender                object
polyuria              object
polydipsia            object
sudden_weight_loss    object
weakness              object
polyphagia            object
genital_thrush        object
visual_blurring       object
itching               object
irritability          object
delayed_healing       object
partial_paresis       object
muscle_stiffness      object
alopecia              object
obesity               object
dtype: object

--- Null Value Check ---
No missing values found in the dataset.


Pre-processing

In [5]:
os.makedirs('model', exist_ok=True)
encoders = {}

# Process Features (Converting Yes/No, Male/Female to 0/1)
X = X_raw.copy()
for col in X.columns:
    if X[col].dtype == 'object':
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])
        encoders[col] = le

# Process Target (Positive/Negative to 1/0)
target_le = LabelEncoder()
y = target_le.fit_transform(y_raw.values.ravel())
encoders['target'] = target_le

# Scale Age (Crucial for kNN and Logistic Regression)
scaler = StandardScaler()
X['age'] = scaler.fit_transform(X[['age']]) # Changed 'Age' to 'age'

# Save all encoders and scaler for the Streamlit App to use
with open('model/encoders.pkl', 'wb') as f:
    pickle.dump(encoders, f)
with open('model/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

# 3. Handle Imbalance (SMOTE) & Split
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)

Model dictionary

In [6]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "kNN": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

Train and Evaluate

In [7]:
os.makedirs('model', exist_ok=True)
results = []

# Import necessary metrics here to ensure they are available within the loop scope
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, precision_score, recall_score, matthews_corrcoef

print("Training models...")
for name, model in models.items():
    model.fit(X_train, y_train)
    with open(f'model/{name.replace(" ", "_").lower()}.pkl', 'wb') as f:
        pickle.dump(model, f)

    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] # Get probabilities for the positive class

    metrics = {
        "ML Model Name": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "AUC": roc_auc_score(y_test, y_prob),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1": f1_score(y_test, y_pred),
        "MCC": matthews_corrcoef(y_test, y_pred)
    }
    results.append(metrics)

Training models...


Display comparison table

In [8]:
import pandas as pd
df_results = pd.DataFrame(results)
print(df_results.to_markdown(index=False))

| ML Model Name       |   Accuracy |      AUC |   Precision |   Recall |       F1 |      MCC |
|:--------------------|-----------:|---------:|------------:|---------:|---------:|---------:|
| Logistic Regression |   0.953125 | 0.995357 |    0.983871 | 0.924242 | 0.953125 | 0.908113 |
| Decision Tree       |   0.984375 | 0.98436  |    0.984848 | 0.984848 | 0.984848 | 0.968719 |
| kNN                 |   0.914062 | 0.97544  |    1        | 0.833333 | 0.909091 | 0.841286 |
| Naive Bayes         |   0.898438 | 0.974096 |    0.895522 | 0.909091 | 0.902256 | 0.796675 |
| Random Forest       |   1        | 1        |    1        | 1        | 1        | 1        |
| XGBoost             |   0.96875  | 0.998534 |    0.984375 | 0.954545 | 0.969231 | 0.937958 |


Save sample test.csv for Streamlit

In [9]:
df_full_raw = pd.concat([X_raw, y_raw], axis=1)
target_col = y_raw.columns[0]
df_pos = df_full_raw[df_full_raw[target_col] == 'Positive'].head(25)
df_neg = df_full_raw[df_full_raw[target_col] == 'Negative'].head(25)
raw_test_sample = pd.concat([df_pos, df_neg]).sample(frac=1, random_state=42)
raw_test_sample.to_csv('test_sample.csv', index=False)

print("\n--- Process Complete ---")
print("✅ 6 Models saved in /model/ folder")
print("✅ Preprocessing artifacts (scaler/encoders) saved")
print("✅ Balanced 'test_sample.csv' created for app testing")


--- Process Complete ---
✅ 6 Models saved in /model/ folder
✅ Preprocessing artifacts (scaler/encoders) saved
✅ Balanced 'test_sample.csv' created for app testing
