# Adult Dataset: Data Mining Project

### Part 1: Data Acquisition and Preparation
The Adult dataset (also known as the "Census Income" dataset) is a classic for predicting whether an individual earns more than $50,000 per year.

Preprocessing Goals for Task 1:

Handle Missing Values: Records with unknown values (labeled as "?") must be removed.


Feature Selection: For Task 1, we must remove all continuous attributes and keep only categorical ones.

One-Hot Encoding: This transforms categorical text into binary (0 or 1) columns, which is required for scikit-learn algorithms.

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix

# Load column names from adult.names documentation
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

# Load data - handling the whitespace after commas
train_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
test_url  = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"


train_df = pd.read_csv(train_url, names=columns, sep=r',\s*', engine='python')
test_df = pd.read_csv(test_url, names=columns, sep=r',\s*', engine='python', skiprows=1)

# Step 1: Remove unknown values '?'
train_df.replace('?', np.nan, inplace=True)
test_df.replace('?', np.nan, inplace=True)
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

# Step 2: Clean target label in test (remove trailing period)
test_df['income'] = test_df['income'].str.rstrip('.')

# Step 3: Remove continuous attributes for Task 1
continuous_cols = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
train_task1 = train_df.drop(columns=continuous_cols)
test_task1 = test_df.drop(columns=continuous_cols)

# Step 4: One-Hot Encoding
X_train_t1 = pd.get_dummies(train_task1.drop("income", axis=1))
X_test_t1 = pd.get_dummies(test_task1.drop("income", axis=1))
y_train_t1 = train_task1["income"]
y_test_t1 = test_task1["income"]

# Align columns to ensure both have same features
X_train_t1, X_test_t1 = X_train_t1.align(X_test_t1, join='left', axis=1, fill_value=0)

print(f"Cleaned Training size: {len(X_train_t1)}")
print(f"Cleaned Testing size: {len(X_test_t1)}")

Cleaned Training size: 30162
Cleaned Testing size: 15060


### Part 2: Classification Models (Decision Tree & Naive Bayes)

Algorithm Development: Task 1
In this step, we evaluate two different classification strategies on the categorical-only data:

Decision Tree: Uses "Entropy" (Information Gain) to split the data into branches.

Naive Bayes: Uses probability theory (Bayes' Theorem) assuming features are independent.

In [2]:
def evaluate_model(model, X_train, y_train, X_test, y_test, name):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    print(f"\n=== {name} Classification Report ===")
    print(classification_report(y_test, preds))
    
    # Calculate FP Rate: FP / (FP + TN)
    cm = confusion_matrix(y_test, preds)
    # cm[0,0] = TN, cm[0,1] = FP, cm[1,0] = FN, cm[1,1] = TP
    tn, fp, fn, tp = cm.ravel()
    fp_rate = fp / (fp + tn)
    tp_rate = tp / (tp + fn) # Same as Recall for the positive class
    
    print(f"TP Rate: {tp_rate:.4f}")
    print(f"FP Rate: {fp_rate:.4f}")

# Initialize and run models
dt = DecisionTreeClassifier(criterion='entropy', random_state=42)
nb = BernoulliNB()

evaluate_model(dt, X_train_t1, y_train_t1, X_test_t1, y_test_t1, "Decision Tree")
evaluate_model(nb, X_train_t1, y_train_t1, X_test_t1, y_test_t1, "Naive Bayes")


=== Decision Tree Classification Report ===
              precision    recall  f1-score   support

       <=50K       0.86      0.89      0.88     11360
        >50K       0.63      0.56      0.60      3700

    accuracy                           0.81     15060
   macro avg       0.75      0.73      0.74     15060
weighted avg       0.81      0.81      0.81     15060

TP Rate: 0.5627
FP Rate: 0.1059

=== Naive Bayes Classification Report ===
              precision    recall  f1-score   support

       <=50K       0.92      0.74      0.82     11360
        >50K       0.50      0.79      0.61      3700

    accuracy                           0.75     15060
   macro avg       0.71      0.77      0.72     15060
weighted avg       0.81      0.75      0.77     15060

TP Rate: 0.7927
FP Rate: 0.2605
