# Adult Dataset: Data Mining Project

### Part 1: Data Acquisition and Preparation
The Adult dataset (also known as the "Census Income" dataset) is a classic for predicting whether an individual earns more than $50,000 per year.

Preprocessing Goals for Task 1:

Handle Missing Values: Records with unknown values (labeled as "?") must be removed.


Feature Selection: For Task 1, we must remove all continuous attributes and keep only categorical ones.

One-Hot Encoding: This transforms categorical text into binary (0 or 1) columns, which is required for scikit-learn algorithms.

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix

# Load column names from adult.names documentation
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

# Load data - handling the whitespace after commas
train_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
test_url  = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"


train_df = pd.read_csv(train_url, names=columns, sep=r',\s*', engine='python')
test_df = pd.read_csv(test_url, names=columns, sep=r',\s*', engine='python', skiprows=1)

# Step 1: Remove unknown values '?'
train_df.replace('?', np.nan, inplace=True)
test_df.replace('?', np.nan, inplace=True)
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

# Step 2: Clean target label in test (remove trailing period)
test_df['income'] = test_df['income'].str.rstrip('.')

# Step 3: Remove continuous attributes for Task 1
continuous_cols = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
train_task1 = train_df.drop(columns=continuous_cols)
test_task1 = test_df.drop(columns=continuous_cols)

# Step 4: One-Hot Encoding
X_train_t1 = pd.get_dummies(train_task1.drop("income", axis=1))
X_test_t1 = pd.get_dummies(test_task1.drop("income", axis=1))
y_train_t1 = train_task1["income"]
y_test_t1 = test_task1["income"]

# Align columns to ensure both have same features
X_train_t1, X_test_t1 = X_train_t1.align(X_test_t1, join='left', axis=1, fill_value=0)

print(f"Cleaned Training size: {len(X_train_t1)}")
print(f"Cleaned Testing size: {len(X_test_t1)}")

Cleaned Training size: 30162
Cleaned Testing size: 15060


### Part 2: Classification Models (Decision Tree & Naive Bayes)

Algorithm Development: Task 1
In this step, we evaluate two different classification strategies on the categorical-only data:

Decision Tree: Uses "Entropy" (Information Gain) to split the data into branches.

Naive Bayes: Uses probability theory (Bayes' Theorem) assuming features are independent.

In [2]:
def evaluate_model(model, X_train, y_train, X_test, y_test, name):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    print(f"\n=== {name} Classification Report ===")
    print(classification_report(y_test, preds))
    
    # Calculate FP Rate: FP / (FP + TN)
    cm = confusion_matrix(y_test, preds)
    # cm[0,0] = TN, cm[0,1] = FP, cm[1,0] = FN, cm[1,1] = TP
    tn, fp, fn, tp = cm.ravel()
    fp_rate = fp / (fp + tn)
    tp_rate = tp / (tp + fn) # Same as Recall for the positive class
    
    print(f"TP Rate: {tp_rate:.4f}")
    print(f"FP Rate: {fp_rate:.4f}")

# Initialize and run models
dt = DecisionTreeClassifier(criterion='entropy', random_state=42)
nb = BernoulliNB()

evaluate_model(dt, X_train_t1, y_train_t1, X_test_t1, y_test_t1, "Decision Tree")
evaluate_model(nb, X_train_t1, y_train_t1, X_test_t1, y_test_t1, "Naive Bayes")


=== Decision Tree Classification Report ===
              precision    recall  f1-score   support

       <=50K       0.86      0.89      0.88     11360
        >50K       0.63      0.56      0.60      3700

    accuracy                           0.81     15060
   macro avg       0.75      0.73      0.74     15060
weighted avg       0.81      0.81      0.81     15060

TP Rate: 0.5627
FP Rate: 0.1059

=== Naive Bayes Classification Report ===
              precision    recall  f1-score   support

       <=50K       0.92      0.74      0.82     11360
        >50K       0.50      0.79      0.61      3700

    accuracy                           0.75     15060
   macro avg       0.71      0.77      0.72     15060
weighted avg       0.81      0.75      0.77     15060

TP Rate: 0.7927
FP Rate: 0.2605


### Part 3: Clustering and kNN (Binarized Numerical Data)

Task 2: Binarization and Clustering
For this task, we include numerical data but transform it:

Numerical to Binary: Any value â‰¥ the mean of that column becomes 1, otherwise 0.

K-Means: We look for patterns by grouping similar records into k clusters.

kNN: We predict the class of the last 10 records by looking at their k nearest neighbors.

In [3]:
# Create copy for Task 2
train_t2 = train_df.copy()
test_t2 = test_df.copy()

# Binarize numerical columns using the training mean
num_cols = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
for col in num_cols:
    mean_val = train_t2[col].astype(float).mean()
    train_t2[col] = (train_t2[col].astype(float) >= mean_val).astype(int)
    test_t2[col] = (test_t2[col].astype(float) >= mean_val).astype(int)

# One-hot encode remaining categorical columns
X_train_t2 = pd.get_dummies(train_t2.drop("income", axis=1))
X_test_t2 = pd.get_dummies(test_t2.drop("income", axis=1))
y_train_t2 = train_t2["income"]
y_test_t2 = test_t2["income"]

X_train_t2, X_test_t2 = X_train_t2.align(X_test_t2, join='left', axis=1, fill_value=0)

# K-Means Clustering
from sklearn.cluster import KMeans
for k in [3, 5, 10]:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_train_t2)
    print(f"\nK-Means Centroids for k={k}:")
    print(km.cluster_centers_)

# kNN for last 10 records
from sklearn.neighbors import KNeighborsClassifier
X_last_10 = X_test_t2.tail(10)
y_last_10 = y_test_t2.tail(10)

for k in [3, 5, 10]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_t2, y_train_t2)
    acc = knn.score(X_last_10, y_last_10)
    print(f"kNN Accuracy (k={k}) on last 10 records: {acc:.2f}")


K-Means Centroids for k=3:
[[ 4.20363934e-01  4.21386220e-01  3.11694950e-01  5.37722347e-02
   3.44510325e-02  1.69699448e-01  3.15886322e-02  8.42363525e-02
   7.81230832e-01  1.28808015e-02  4.00736046e-02  4.94786342e-02
   5.11142916e-04  2.55571458e-02  3.79268043e-02  1.24718871e-02
   4.39582907e-03  7.05377223e-03  1.34941730e-02  1.21652014e-02
   4.03802903e-02  4.65140053e-02  1.55591903e-01  8.28051523e-03
   3.17521979e-01  5.20343488e-02  1.43120016e-03  8.89388673e-03
   2.56287058e-01  2.58536087e-01  1.22674300e-03  1.51298303e-01
   1.93212022e-02  4.40809650e-01  5.86792067e-02  7.01288080e-02
   2.56798201e-01  2.71050543e-19  2.20813740e-02  1.16847270e-01
   6.64485790e-03  1.67654876e-02  5.55101206e-02  1.79717849e-01
   1.38008587e-02  1.52422817e-01  7.76937232e-03  1.27581272e-01
   3.48599468e-02  9.20057248e-03  1.02228583e-04  3.64547127e-01
   3.94602331e-02  2.00470251e-01  2.51789000e-01  1.43631159e-01
   1.09384584e-02  3.00552034e-02  1.43017788e-0

### Part 4: Advanced Classifiers (SVM & Neural Networks)

Task 3 & 4: Deep Learning and Vector Machines
Using the binarized training data from Task 2, we now apply:

SVM (Support Vector Machine): Finds the best hyperplane to separate classes.

Neural Network (MLP): A multi-layer perceptron that mimics biological neural processes to find complex patterns.

In [4]:
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# SVM Accuracy
svm = SVC(kernel='linear', random_state=42)
svm.fit(X_train_t2, y_train_t2)
print(f"SVM Accuracy: {svm.score(X_test_t2, y_test_t2):.4f}")

# Neural Network Accuracy
mlp = MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=1000, random_state=42)
mlp.fit(X_train_t2, y_train_t2)
print(f"Neural Network Accuracy: {mlp.score(X_test_t2, y_test_t2):.4f}")

SVM Accuracy: 0.8434
Neural Network Accuracy: 0.8164
