# 💾 UCI Adult Dataset - Machine Learning Project


## Introduction

This machine learning project, conducted under the guidance of Professor Matteo Zignani from Università degli Studi di Milano, focuses on a supervised binary classification task using the Adult dataset. The dataset, originally from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/2/adult), is typically provided pre-split into training and test sets. To have greater control over the data splitting process, we used a [version of the dataset available on Kaggle](https://www.kaggle.com/datasets/wenruliu/adult-income-dataset), which provides the data in a single, unsplit format. This approach allowed us to perform our own train-test split.

The goal of the project is to classify whether an individual earns more than $50,000 per year by leveraging preprocessing techniques and classification models to uncover patterns and relationships within the data, ultimately building an accurate predictive model.

#### Imports

In [1]:
# Base
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler, LabelBinarizer
from sklearn.compose import ColumnTransformer

# Model selection
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_validate
from sklearn.decomposition import PCA
from sklearn.linear_model import Perceptron
from imblearn.pipeline import Pipeline as IMBPipeline
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score

# Custom utils
from utils.constants import country_mapping
from utils.configs import sampler_configs, dim_reduction_configs, classifier_configs

## Analyse the data

Let's load the training set into a Pandas dataframe and look more into detail the different columns we have.

In [2]:
df = pd.read_csv("../data/adult.csv", header=0, skipinitialspace=True, na_values="?")
df.shape

(48842, 15)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        46043 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       46033 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   47985 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


We can see that some columns have null values, let's look into more detail which ones and how many.

In [4]:
df.isnull().sum()

age                   0
workclass          2799
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64

We will need to deal with missing values for the columns `workclass`, `occupation` and `native-country`.


In [5]:
# Look at some example rows
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [6]:
# Change 'race' for any other feature you'd like to inspect.
print(df['workclass'].value_counts())

workclass
Private             33906
Self-emp-not-inc     3862
Local-gov            3136
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: count, dtype: int64


## Definition of the Data Transformation Pipeline

Here we have a schema of the data transformation pipeline that we want to create. The columns that are not present are going to be dropped.

![Data transformatino Pipeline](./images/data-transformation-pipeline.jpg)

In this project, we utilize five custom transformers to handle specific feature engineering tasks. These transformers ensure that our preprocessing pipeline remains clean, modular, and reusable.

Below is an explanation of each custom transformer and their purpose.

### Custom Transformers in the Project

In this project we are going to use 4 different custom transformer for some features.

#### 1. `MaritalStatusBinarizer`

Converts the **marital-status** feature into a binary feature (1 for married, 0 for not married) to simplify analysis and modeling.

In [7]:
def binarize_marital_status(X):
    mapping = {
        'Married-civ-spouse': 1,
        'Married-spouse-absent': 1,
        'Married-AF-spouse': 1,
        'Never-married': 0,
        'Divorced': 0,
        'Separated': 0,
        'Widowed': 0
    }

    # Convert dataframe to numpy array
    X = X.values.ravel()

    return np.array([[mapping.get(value, 0)] for value in X])

# Pipeline for Marital Status
pipeline_marital_status = Pipeline([
    ('marital-status-binarizer', FunctionTransformer(binarize_marital_status,
    validate=False,
))
])


#### 2. `CombineCapitalFeatures`

Combines the **capital-gain** and **capital-loss** features into a single **net-capital** feature.

In [8]:
class CombineCapitalFeatures(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Convert X into a NumPy array
        X = np.asarray(X)
        
        # Compute net-capital (difference between gain and loss)
        net_capital = X[:, 0] - X[:, 1]
        
        # Return as a 2D array (Scikit-learn expects this format)
        return net_capital.reshape(-1, 1)

# Pipeline for combining and scaling capital features
pipeline_capital_features = Pipeline([
    ('combine_capital', CombineCapitalFeatures()),
    ('scaler', StandardScaler()),
])

#### 3. `HoursBinner`

Bins the **hours-per-week** feature into categories: Part-time, Full-time, and Overtime.

In [9]:
class HoursBinner(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = pd.Series(X.values.ravel())
        
        # Define bins and labels
        bins = [0, 30, 50, float('inf')]  # Inf handles any hours > 50
        labels = ['Part-time', 'Full-time', 'Overtime']
        
        # Apply binning
        binned = pd.cut(X, bins=bins, labels=labels, right=False)
        
        # Return as a 2D array
        return np.array(binned).reshape(-1, 1)

# Pipeline for Hours Binning
pipeline_hours_binner = Pipeline([
    ('binning', HoursBinner()),
    ('onehot', OneHotEncoder())
])


#### 4. `CountryGrouper`

Groups the **native-country** feature into broader regions (e.g., North America, Europe) to reduce the number of categories and simplify the model.

In [10]:
class CountryGrouper(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Flatten X to a 1D array for processing
        X_flat = np.ravel(X)

        # Apply mapping with a default value of "Other" for unknown countries
        grouped = [country_mapping.get(country, 'Other') for country in X_flat]

        # Return as a 2D array
        return np.array(grouped).reshape(-1, 1)

# Pipeline for Country Grouping and Encoding
pipeline_country = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('group', CountryGrouper()),
    ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

### Column Transformer

We finish by creating the pipelines for the remaining features that have more than one transformation and defining the Column Transformer for **all** the features.

In [11]:
# Pipelines for the remaning features
pipeline_workclass_occupation = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine all pipelines in a ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('scaler', StandardScaler(), ['age', 'educational-num']),
    ('workclass_occupation', pipeline_workclass_occupation, ['workclass', 'occupation']),
    ('marital_status', pipeline_marital_status, ['marital-status']),
    ('onehot', OneHotEncoder(handle_unknown='ignore'), ['relationship', 'race', 'gender']),
    ('capital', pipeline_capital_features, ['capital-gain', 'capital-loss']),
    ('hours_per_week', pipeline_hours_binner, ['hours-per-week']),
    ('native_country', pipeline_country, ['native-country']),
],
remainder='drop',
verbose_feature_names_out=False,
sparse_threshold=0) # Dense matrices instead of sparse

## Training and test split

Now, we can split the entire dataset into training and test sets, putting 20% of the instances in the test set. We will first binarize the income into 0/1 values for better readibility, yada yada

In [12]:
lb = LabelBinarizer()
df['income'] = lb.fit_transform(df['income']) 

In [13]:
X = df.drop(columns=['income'])
y = df['income']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Selection


In [18]:
# Define the base pipeline
model_pipeline = IMBPipeline([  #IMB(alanced) pipeline allow us to add sampler configs
    ('preprocessing', preprocessor),
    ('sampler', SMOTE()),
    ('dim_reduction', PCA(n_components=0.8)),
    ('classifier', Perceptron())
])

In [19]:
# Combine all configurations
import itertools

all_configs = [
    dict(itertools.chain(*(config.items() for config in combination)))
    for combination in itertools.product(sampler_configs, dim_reduction_configs, classifier_configs)
]

print(f"Number of configurations: {len(all_configs)}")


Number of configurations: 30


In [None]:
# RandomizedSearchCV
rs = RandomizedSearchCV(
    model_pipeline,
    param_distributions=all_configs,
    n_iter=50,  # Adjust the number of iterations
    scoring='f1',
    n_jobs=-1,
    cv=2,
    random_state=42
)

# Fit the RandomizedSearchCV
rs.fit(X_train, y_train)

# Inspect the best configuration
print("Best Params:", rs.best_params_)
print("Best F1 Score:", rs.best_score_)


In [None]:

# Outer loop with cross-validation
scores = cross_validate(
    rs,
    X_train,
    y_train,
    cv=5,
    scoring='f1',
    return_estimator=True
)

# Inspect results
for index, estimator in enumerate(scores['estimator']):
    print(f"Fold {index + 1}:")
    print("Sampler:", estimator.best_estimator_.get_params()['sampler'])
    print("Dim Reduction:", estimator.best_estimator_.get_params()['dim_reduction'])
    print("Classifier:", estimator.best_estimator_.get_params()['classifier'])
    print("Validation F1:", scores['test_score'][index])
    print('-' * 10)

In [2257]:

for estimator in scores['estimator']:
    best_model = estimator.best_estimator_
    best_model.fit(X_train, y_train)
    
    # Predictions
    train_pred = best_model.predict(X_train)
    test_pred = best_model.predict(X_test)
    
    # F1 Scores
    f1_train = f1_score(y_train, train_pred)
    f1_test = f1_score(y_test, test_pred)
    
    print(f"F1 on training set: {f1_train:.3f}")
    print(f"F1 on test set: {f1_test:.3f}")
    print('-' * 10)


F1 on training set: 0.718
F1 on test set: 0.676
----------
F1 on training set: 0.718
F1 on test set: 0.679
----------
F1 on training set: 0.716
F1 on test set: 0.677
----------
F1 on training set: 0.717
F1 on test set: 0.681
----------
F1 on training set: 0.716
F1 on test set: 0.676
----------
