Machine Learning - Assignment 2
BITS ID: 2025AA05952

Name: SONU KUMAR BHAGAT

Email: 2025aa05952@wilp.bits-pilani.ac.in

This notebook compares the following six models for a classification task:

Logistic Regression
Decision Tree Classifier
K-Nearest Neighbor Classifier
Naive Bayes Classifier
Random Forest (Ensemble)
XGBoost (Ensemble)
Dataset:
"Early Stage Diabetes Risk Prediction" dataset from UCI https://archive.ics.uci.edu/dataset/529/early+stage+diabetes+risk+prediction+dataset

In [1]:
pip install ucimlrepo



In [2]:
import numpy as np
import pickle
import os
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
class_nb = __import__('sklearn.naive_bayes', fromlist=['GaussianNB'])
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, matthews_corrcoef

Fetch Dataset

In [3]:
dataset = fetch_ucirepo(id=529)
X = dataset.data.features.copy()
y = dataset.data.targets.copy()

Data Exploration

In [4]:
print("--- Dataset Information ---")

# 1. Shape and Size
print(f"Dataset Shape: {X.shape}")
print(f"Total Instances: {len(X)}")
print(f"Total Features: {X.shape[1]}")

# 2. Column Names and Types
print("\n--- Column List and Types ---")
print(X.dtypes)

# 3. Check for Null Values
print("\n--- Null Value Check ---")
null_counts = X.isnull().sum()
if null_counts.sum() == 0:
    print("No missing values found in the dataset.")
else:
    print(null_counts[null_counts > 0])


--- Dataset Information ---
Dataset Shape: (520, 16)
Total Instances: 520
Total Features: 16

--- Column List and Types ---
age                    int64
gender                object
polyuria              object
polydipsia            object
sudden_weight_loss    object
weakness              object
polyphagia            object
genital_thrush        object
visual_blurring       object
itching               object
irritability          object
delayed_healing       object
partial_paresis       object
muscle_stiffness      object
alopecia              object
obesity               object
dtype: object

--- Null Value Check ---
No missing values found in the dataset.


Pre-processing

In [5]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from imblearn.over_sampling import SMOTE # Optional but recommended for better marks

# --- ENHANCED PREPROCESSING ---

# 1. Feature Encoding
# The dataset contains 'Yes'/'No' and 'Male'/'Female'. We must convert these to 0/1.
le = LabelEncoder()
for col in X.columns:
    if X[col].dtype == 'object':
        X[col] = le.fit_transform(X[col])

# Encode the Target (Positive/Negative)
y = le.fit_transform(y.values.ravel())

# 2. Feature Scaling
# 'Age' is in the range of 16-90, while others are 0-1.
# Scaling helps kNN and Logistic Regression converge faster and perform better.
scaler = StandardScaler()
X['age'] = scaler.fit_transform(X[['age']])

# 3. Handling Class Imbalance (SMOTE)
# If one class (Positive/Negative) is much smaller, the model will be biased.
# This creates synthetic samples to balance the dataset.
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print(f"Original dataset shape: {X.shape}")
print(f"Resampled dataset shape: {X_resampled.shape}")

# Split the resampled data
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)

Original dataset shape: (520, 16)
Resampled dataset shape: (640, 16)


Model dictionary

In [6]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "kNN": KNeighborsClassifier(),
    "Naive Bayes": class_nb.GaussianNB(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

Train and Evaluate

In [7]:
os.makedirs('model', exist_ok=True)
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else y_pred

    metrics = {
        "ML Model Name": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "AUC": roc_auc_score(y_test, y_prob),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1": f1_score(y_test, y_pred),
        "MCC": matthews_corrcoef(y_test, y_pred)
    }
    results.append(metrics)

Save models

In [8]:
import pickle
import os

# Create a folder to store the models if it doesn't exist
model_dir = 'model'
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

# Assuming 'models' is your dictionary of trained objects from the previous step
for name, model in models.items():
    # Create a clean filename (e.g., "Logistic Regression" -> "logistic_regression.pkl")
    filename = f"{name.replace(' ', '_').lower()}.pkl"
    file_path = os.path.join(model_dir, filename)

    # Save the model
    with open(file_path, 'wb') as f:
        pickle.dump(model, f)

    print(f"✅ Saved: {file_path}")

✅ Saved: model/logistic_regression.pkl
✅ Saved: model/decision_tree.pkl
✅ Saved: model/knn.pkl
✅ Saved: model/naive_bayes.pkl
✅ Saved: model/random_forest.pkl
✅ Saved: model/xgboost.pkl


Display comparison table

In [9]:
import pandas as pd
df_results = pd.DataFrame(results)
print(df_results.to_markdown(index=False))

| ML Model Name       |   Accuracy |      AUC |   Precision |   Recall |       F1 |      MCC |
|:--------------------|-----------:|---------:|------------:|---------:|---------:|---------:|
| Logistic Regression |   0.953125 | 0.995357 |    0.983871 | 0.924242 | 0.953125 | 0.908113 |
| Decision Tree       |   0.984375 | 0.98436  |    0.984848 | 0.984848 | 0.984848 | 0.968719 |
| kNN                 |   0.914062 | 0.97544  |    1        | 0.833333 | 0.909091 | 0.841286 |
| Naive Bayes         |   0.898438 | 0.974096 |    0.895522 | 0.909091 | 0.902256 | 0.796675 |
| Random Forest       |   1        | 1        |    1        | 1        | 1        | 1        |
| XGBoost             |   0.96875  | 0.998534 |    0.984375 | 0.954545 | 0.969231 | 0.937958 |


Save sample test.csv for Streamlit

In [10]:
test_data = X_test.copy()
test_data['target'] = y_test
test_data.to_csv('test_sample.csv', index=False)