<a href="https://colab.research.google.com/github/santoshyadav1983-ai/2025AA05910_ML_assignment2/blob/main/2025AA05910_ML_assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification Models Implementation

**Dataset:** Demographics.csv (Adult Income Dataset)

**Objective:** Implement and compare 6 classification models to predict income levels (<=50K or >50K)

## Models Implemented:
1. Logistic Regression
2. Decision Tree Classifier
3. K-Nearest Neighbors Classifier
4. Naive Bayes Classifier (Gaussian)
5. Ensemble Model - Random Forest
6. Ensemble Model - XGBoost

## 1. Import Required Libraries

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef, roc_auc_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Import models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

print("All libraries imported successfully!")

All libraries imported successfully!


## 2. Load the Dataset

In [4]:
from google.colab import drive
drive.mount('/content/drive')

# Dataset path
dataset_path = "/content/drive/MyDrive/Sem 1/Machine Learning/Assignment 2/dataset/demographics.csv"

print("Loading dataset...")
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num',
                'marital-status', 'occupation', 'relationship', 'race', 'sex',
                'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

df = pd.read_csv(dataset_path, names=column_names, skipinitialspace=True)

print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())

print("\nDataset info:")
print(df.info())

# Replace '?' with NaN to identify missing values
print("\nReplacing '?' with NaN to identify missing values...")
df = df.replace('?', np.nan)

print("\nMissing values by column:")
missing_counts = df.isnull().sum()
print(missing_counts[missing_counts > 0])
print(f"\nTotal rows with missing values: {df.isnull().any(axis=1).sum()}")

print("\nTarget variable distribution:")
print(df['income'].value_counts())

Mounted at /content/drive
Loading dataset...
Dataset shape: (32561, 15)

First few rows:
   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country income  
0          2174        

## 3. Data Preprocessing

This section handles:
- Missing value treatment (replacing '?' with NaN and dropping rows)
- Target variable encoding
- Categorical variable encoding
- Feature identification

In [5]:
print("="*80)
print("DATA PREPROCESSING")
print("="*80)

# Handle missing values - drop rows with missing values
print(f"\nRows before dropping missing values: {len(df)}")
df = df.dropna()
print(f"Rows after dropping missing values: {len(df)}")

# Separate features and target
X = df.drop('income', axis=1)
y = df['income']

# Encode target variable
print("\nEncoding target variable...")
label_encoder_y = LabelEncoder()
y = label_encoder_y.fit_transform(y)
print(f"Classes: {label_encoder_y.classes_}")

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"\nCategorical columns: {categorical_cols}")
print(f"Numerical columns: {numerical_cols}")

# Encode categorical variables
print("\nEncoding categorical variables...")
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le

print("Preprocessing complete!")

DATA PREPROCESSING

Rows before dropping missing values: 32561
Rows after dropping missing values: 30162

Encoding target variable...
Classes: ['<=50K' '>50K']

Categorical columns: ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
Numerical columns: ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

Encoding categorical variables...
Preprocessing complete!


## 4. Train-Test Split and Feature Scaling

Split the data into training (80%) and testing (20%) sets, then standardize numerical features.

In [6]:
# Split the data
print("Splitting data into train and test sets (80-20 split)...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

# Standardize numerical features
print("\nStandardizing numerical features...")
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test_scaled[numerical_cols] = scaler.transform(X_test[numerical_cols])

print("Data preparation complete!")

Splitting data into train and test sets (80-20 split)...
Training set size: (24129, 14)
Test set size: (6033, 14)

Standardizing numerical features...
Data preparation complete!


## 5. Model Evaluation Function

Define a reusable function to train and evaluate models with comprehensive metrics.

In [7]:
# Function to evaluate models
def evaluate_model(model, model_name, X_train, X_test, y_train, y_test):
    """
    Train and evaluate a classification model
    """
    print("\n" + "="*80)
    print(f"{model_name}")
    print("="*80)

    # Train the model
    print(f"\nTraining {model_name}...")
    model.fit(X_train, y_train)

    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # Get prediction probabilities for AUC
    if hasattr(model, 'predict_proba'):
        y_pred_proba = model.predict_proba(X_test)[:, 1]
    else:
        y_pred_proba = y_pred_test

    # Calculate metrics
    train_accuracy = accuracy_score(y_train, y_pred_train)
    test_accuracy = accuracy_score(y_test, y_pred_test)
    auc = roc_auc_score(y_test, y_pred_proba)
    precision = precision_score(y_test, y_pred_test, average='weighted')
    recall = recall_score(y_test, y_pred_test, average='weighted')
    f1 = f1_score(y_test, y_pred_test, average='weighted')
    mcc = matthews_corrcoef(y_test, y_pred_test)

    print("-" * 80)

    # Display metrics in tabular format
    print("\nPerformance Metrics Table:")
    print("-" * 50)
    print(f"{'Metric':<25} {'Value':>20}")
    print("-" * 50)
    print(f"{'Training Accuracy':<25} {train_accuracy:>20.4f}")
    print(f"{'Test Accuracy':<25} {test_accuracy:>20.4f}")
    print(f"{'AUC':<25} {auc:>20.4f}")
    print(f"{'Precision':<25} {precision:>20.4f}")
    print(f"{'Recall':<25} {recall:>20.4f}")
    print(f"{'F1-Score':<25} {f1:>20.4f}")
    print(f"{'MCC Score':<25} {mcc:>20.4f}")
    print("-" * 50)

    print("\nClassification Report:")
    print(classification_report(y_test, y_pred_test, target_names=label_encoder_y.classes_))

    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred_test)
    cm_df = pd.DataFrame(cm,
                         index=[f'Actual {label_encoder_y.classes_[0]}', f'Actual {label_encoder_y.classes_[1]}'],
                         columns=[f'Predicted {label_encoder_y.classes_[0]}', f'Predicted {label_encoder_y.classes_[1]}'])
    print(cm_df)

    return {
        'ML Model Name': model_name,
        'Accuracy': test_accuracy,
        'AUC': auc,
        'Precision': precision,
        'Recall': recall,
        'F1': f1,
        'MCC': mcc
    }

# Store results
results = []

print("Evaluation function defined successfully!")

Evaluation function defined successfully!


## 6. Model 1: Logistic Regression

Logistic Regression is a linear model for binary classification that uses the logistic function to predict probabilities.

In [8]:
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_results = evaluate_model(lr_model, "LOGISTIC REGRESSION",
                            X_train_scaled, X_test_scaled, y_train, y_test)
results.append(lr_results)


LOGISTIC REGRESSION

Training LOGISTIC REGRESSION...
--------------------------------------------------------------------------------

Performance Metrics Table:
--------------------------------------------------
Metric                                   Value
--------------------------------------------------
Training Accuracy                       0.8201
Test Accuracy                           0.8177
AUC                                     0.8500
Precision                               0.8062
Recall                                  0.8177
F1-Score                                0.8019
MCC Score                               0.4617
--------------------------------------------------

Classification Report:
              precision    recall  f1-score   support

       <=50K       0.84      0.94      0.89      4531
        >50K       0.71      0.45      0.55      1502

    accuracy                           0.82      6033
   macro avg       0.78      0.69      0.72      6033
weighted avg

## 7. Model 2: Decision Tree Classifier

Decision Tree is a tree-structured model that makes decisions by splitting data based on feature values.

In [9]:
dt_model = DecisionTreeClassifier(random_state=42)
dt_results = evaluate_model(dt_model, "DECISION TREE CLASSIFIER",
                            X_train, X_test, y_train, y_test)
results.append(dt_results)


DECISION TREE CLASSIFIER

Training DECISION TREE CLASSIFIER...
--------------------------------------------------------------------------------

Performance Metrics Table:
--------------------------------------------------
Metric                                   Value
--------------------------------------------------
Training Accuracy                       1.0000
Test Accuracy                           0.8066
AUC                                     0.7404
Precision                               0.8062
Recall                                  0.8066
F1-Score                                0.8064
MCC Score                               0.4817
--------------------------------------------------

Classification Report:
              precision    recall  f1-score   support

       <=50K       0.87      0.87      0.87      4531
        >50K       0.61      0.61      0.61      1502

    accuracy                           0.81      6033
   macro avg       0.74      0.74      0.74      6033
we

## 8. Model 3: K-Nearest Neighbors Classifier

KNN classifies samples based on the majority class of their k nearest neighbors in the feature space.

In [10]:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_results = evaluate_model(knn_model, "K-NEAREST NEIGHBORS CLASSIFIER",
                             X_train_scaled, X_test_scaled, y_train, y_test)
results.append(knn_results)


K-NEAREST NEIGHBORS CLASSIFIER

Training K-NEAREST NEIGHBORS CLASSIFIER...
--------------------------------------------------------------------------------

Performance Metrics Table:
--------------------------------------------------
Metric                                   Value
--------------------------------------------------
Training Accuracy                       0.8720
Test Accuracy                           0.8253
AUC                                     0.8433
Precision                               0.8191
Recall                                  0.8253
F1-Score                                0.8212
MCC Score                               0.5142
--------------------------------------------------

Classification Report:
              precision    recall  f1-score   support

       <=50K       0.87      0.90      0.89      4531
        >50K       0.67      0.59      0.63      1502

    accuracy                           0.83      6033
   macro avg       0.77      0.75      0.76 

## 9. Model 4: Naive Bayes Classifier (Gaussian)

Gaussian Naive Bayes assumes features follow a normal distribution and applies Bayes' theorem with strong independence assumptions.

In [11]:
nb_model = GaussianNB()
nb_results = evaluate_model(nb_model, "NAIVE BAYES CLASSIFIER (GAUSSIAN)",
                            X_train_scaled, X_test_scaled, y_train, y_test)
results.append(nb_results)


NAIVE BAYES CLASSIFIER (GAUSSIAN)

Training NAIVE BAYES CLASSIFIER (GAUSSIAN)...
--------------------------------------------------------------------------------

Performance Metrics Table:
--------------------------------------------------
Metric                                   Value
--------------------------------------------------
Training Accuracy                       0.7960
Test Accuracy                           0.7978
AUC                                     0.8498
Precision                               0.7830
Recall                                  0.7978
F1-Score                                0.7697
MCC Score                               0.3798
--------------------------------------------------

Classification Report:
              precision    recall  f1-score   support

       <=50K       0.81      0.95      0.88      4531
        >50K       0.70      0.33      0.45      1502

    accuracy                           0.80      6033
   macro avg       0.75      0.64     

## 10. Model 5: Random Forest Classifier (Ensemble)

Random Forest is an ensemble method that builds multiple decision trees and combines their predictions through voting.

In [12]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_results = evaluate_model(rf_model, "RANDOM FOREST CLASSIFIER",
                            X_train, X_test, y_train, y_test)
results.append(rf_results)


RANDOM FOREST CLASSIFIER

Training RANDOM FOREST CLASSIFIER...
--------------------------------------------------------------------------------

Performance Metrics Table:
--------------------------------------------------
Metric                                   Value
--------------------------------------------------
Training Accuracy                       1.0000
Test Accuracy                           0.8541
AUC                                     0.9027
Precision                               0.8489
Recall                                  0.8541
F1-Score                                0.8500
MCC Score                               0.5927
--------------------------------------------------

Classification Report:
              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.91      4531
        >50K       0.74      0.63      0.68      1502

    accuracy                           0.85      6033
   macro avg       0.81      0.78      0.79      6033
we

## 11. Model 6: XGBoost Classifier (Ensemble)

XGBoost is a gradient boosting ensemble method that builds trees sequentially, each correcting errors from previous trees.

In [13]:
xgb_model = XGBClassifier(random_state=42, eval_metric='logloss')
xgb_results = evaluate_model(xgb_model, "XGBOOST CLASSIFIER",
                             X_train, X_test, y_train, y_test)
results.append(xgb_results)


XGBOOST CLASSIFIER

Training XGBOOST CLASSIFIER...
--------------------------------------------------------------------------------

Performance Metrics Table:
--------------------------------------------------
Metric                                   Value
--------------------------------------------------
Training Accuracy                       0.9113
Test Accuracy                           0.8616
AUC                                     0.9204
Precision                               0.8567
Recall                                  0.8616
F1-Score                                0.8574
MCC Score                               0.6131
--------------------------------------------------

Classification Report:
              precision    recall  f1-score   support

       <=50K       0.89      0.93      0.91      4531
        >50K       0.76      0.64      0.70      1502

    accuracy                           0.86      6033
   macro avg       0.83      0.79      0.80      6033
weighted avg  

## 12. Model Comparison Summary

Compare all models based on their performance metrics to identify the best performer.

In [16]:
print("\n" + "="*80)
print("SUMMARY - MODEL COMPARISON")
print("="*80)

# Create results dataframe
results_df = pd.DataFrame(results)

# Display the table
print("-" * 120)
print(f"{'ML Model Name':<40} {'Accuracy':>10} {'AUC':>10} {'Precision':>10} {'Recall':>10} {'F1':>10} {'MCC':>10}")
print("-" * 120)
for _, row in results_df.iterrows():
    print(f"{row['ML Model Name']:<40} {row['Accuracy']:>10.4f} {row['AUC']:>10.4f} {row['Precision']:>10.4f} {row['Recall']:>10.4f} {row['F1']:>10.4f} {row['MCC']:>10.4f}")
print("-" * 120)

# Find best model based on test accuracy
best_model_idx = results_df['Accuracy'].idxmax()
best_model = results_df.loc[best_model_idx, 'ML Model Name']
best_accuracy = results_df.loc[best_model_idx, 'Accuracy']
best_auc = results_df.loc[best_model_idx, 'AUC']

print(f"\n{'='*80}")
print(f"Best Model: {best_model}")
print(f"Test Accuracy: {best_accuracy:.4f}")
print(f"AUC Score: {best_auc:.4f}")
print(f"{'='*80}")


SUMMARY - MODEL COMPARISON
------------------------------------------------------------------------------------------------------------------------
ML Model Name                              Accuracy        AUC  Precision     Recall         F1        MCC
------------------------------------------------------------------------------------------------------------------------
LOGISTIC REGRESSION                          0.8177     0.8500     0.8062     0.8177     0.8019     0.4617
DECISION TREE CLASSIFIER                     0.8066     0.7404     0.8062     0.8066     0.8064     0.4817
K-NEAREST NEIGHBORS CLASSIFIER               0.8253     0.8433     0.8191     0.8253     0.8212     0.5142
NAIVE BAYES CLASSIFIER (GAUSSIAN)            0.7978     0.8498     0.7830     0.7978     0.7697     0.3798
RANDOM FOREST CLASSIFIER                     0.8541     0.9027     0.8489     0.8541     0.8500     0.5927
XGBOOST CLASSIFIER                           0.8616     0.9204     0.8567     0.8616    

## 13. Model Performance Observations

Detailed observations on the performance of each model on the Demographics dataset.

In [17]:
print("="*80)
print("MODEL PERFORMANCE OBSERVATIONS")
print("="*80)

# Sort models by accuracy for better comparison
results_sorted = results_df.sort_values('Accuracy', ascending=False)

print("\n1. OVERALL PERFORMANCE RANKING (by Accuracy):")
print("-" * 80)
for idx, (i, row) in enumerate(results_sorted.iterrows(), 1):
    print(f"{idx}. {row['ML Model Name']}: {row['Accuracy']:.4f}")

print("\n2. DETAILED OBSERVATIONS:")
print("-" * 80)

for _, row in results_df.iterrows():
    model_name = row['ML Model Name']
    accuracy = row['Accuracy']
    auc = row['AUC']
    precision = row['Precision']
    recall = row['Recall']
    f1 = row['F1']
    mcc = row['MCC']

    print(f"\n{model_name}:")

    # Performance level
    if accuracy >= 0.85:
        print(f"  ✓ Excellent performance with {accuracy:.2%} accuracy")
    elif accuracy >= 0.80:
        print(f"  ✓ Good performance with {accuracy:.2%} accuracy")
    elif accuracy >= 0.75:
        print(f"  ○ Moderate performance with {accuracy:.2%} accuracy")
    else:
        print(f"  ✗ Lower performance with {accuracy:.2%} accuracy")

    # AUC interpretation
    if auc >= 0.90:
        print(f"  ✓ Excellent discrimination capability (AUC: {auc:.4f})")
    elif auc >= 0.80:
        print(f"  ✓ Good discrimination capability (AUC: {auc:.4f})")
    else:
        print(f"  ○ Moderate discrimination capability (AUC: {auc:.4f})")

    # Precision-Recall balance
    if abs(precision - recall) < 0.02:
        print(f"  ✓ Well-balanced precision ({precision:.4f}) and recall ({recall:.4f})")
    elif precision > recall:
        print(f"  ○ Higher precision ({precision:.4f}) than recall ({recall:.4f}) - fewer false positives")
    else:
        print(f"  ○ Higher recall ({recall:.4f}) than precision ({precision:.4f}) - fewer false negatives")

    # F1 Score
    print(f"  • F1-Score: {f1:.4f} - Overall balance between precision and recall")

    # MCC interpretation
    if mcc >= 0.50:
        print(f"  ✓ Strong correlation (MCC: {mcc:.4f})")
    elif mcc >= 0.30:
        print(f"  ○ Moderate correlation (MCC: {mcc:.4f})")
    else:
        print(f"  ✗ Weak correlation (MCC: {mcc:.4f})")

print("\n3. KEY INSIGHTS:")
print("-" * 80)

# Best overall model
best_model_name = results_sorted.iloc[0]['ML Model Name']
best_accuracy = results_sorted.iloc[0]['Accuracy']
best_f1 = results_sorted.iloc[0]['F1']

print(f"\n• Best Overall Model: {best_model_name}")
print(f"  - Achieved highest accuracy of {best_accuracy:.2%}")
print(f"  - F1-Score: {best_f1:.4f}")

# Best AUC
best_auc_idx = results_df['AUC'].idxmax()
best_auc_model = results_df.loc[best_auc_idx, 'ML Model Name']
best_auc_score = results_df.loc[best_auc_idx, 'AUC']
print(f"\n• Best AUC Score: {best_auc_model} ({best_auc_score:.4f})")

# Best MCC
best_mcc_idx = results_df['MCC'].idxmax()
best_mcc_model = results_df.loc[best_mcc_idx, 'ML Model Name']
best_mcc_score = results_df.loc[best_mcc_idx, 'MCC']
print(f"\n• Best MCC Score: {best_mcc_model} ({best_mcc_score:.4f})")

# Ensemble vs Traditional comparison
ensemble_models = results_df[results_df['ML Model Name'].str.contains('FOREST|XGBOOST')]
traditional_models = results_df[~results_df['ML Model Name'].str.contains('FOREST|XGBOOST')]

if len(ensemble_models) > 0 and len(traditional_models) > 0:
    avg_ensemble_acc = ensemble_models['Accuracy'].mean()
    avg_traditional_acc = traditional_models['Accuracy'].mean()
    print(f"\n• Ensemble Models Average Accuracy: {avg_ensemble_acc:.2%}")
    print(f"• Traditional Models Average Accuracy: {avg_traditional_acc:.2%}")

    if avg_ensemble_acc > avg_traditional_acc:
        improvement = ((avg_ensemble_acc - avg_traditional_acc) / avg_traditional_acc) * 100
        print(f"  → Ensemble models outperform traditional models by {improvement:.1f}%")

# Worst performing model
worst_model_name = results_sorted.iloc[-1]['ML Model Name']
worst_accuracy = results_sorted.iloc[-1]['Accuracy']
print(f"\n• Lowest Performing Model: {worst_model_name} ({worst_accuracy:.2%})")

print("\n4. RECOMMENDATIONS:")
print("-" * 80)
print(f"\n• For deployment: Use {best_model_name} for best overall performance")
print(f"• For interpretability: Consider DECISION TREE or LOGISTIC REGRESSION")
print(f"• For handling imbalanced data: Consider models with high MCC scores")
if best_auc_score >= 0.85:
    print(f"• The {best_auc_model} shows excellent ability to distinguish between classes")

print("\n" + "="*80)

MODEL PERFORMANCE OBSERVATIONS

1. OVERALL PERFORMANCE RANKING (by Accuracy):
--------------------------------------------------------------------------------
1. XGBOOST CLASSIFIER: 0.8616
2. RANDOM FOREST CLASSIFIER: 0.8541
3. K-NEAREST NEIGHBORS CLASSIFIER: 0.8253
4. LOGISTIC REGRESSION: 0.8177
5. DECISION TREE CLASSIFIER: 0.8066
6. NAIVE BAYES CLASSIFIER (GAUSSIAN): 0.7978

2. DETAILED OBSERVATIONS:
--------------------------------------------------------------------------------

LOGISTIC REGRESSION:
  ✓ Good performance with 81.77% accuracy
  ✓ Good discrimination capability (AUC: 0.8500)
  ✓ Well-balanced precision (0.8062) and recall (0.8177)
  • F1-Score: 0.8019 - Overall balance between precision and recall
  ○ Moderate correlation (MCC: 0.4617)

DECISION TREE CLASSIFIER:
  ✓ Good performance with 80.66% accuracy
  ○ Moderate discrimination capability (AUC: 0.7404)
  ✓ Well-balanced precision (0.8062) and recall (0.8066)
  • F1-Score: 0.8064 - Overall balance between precision 