In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler


In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age","workclass","fnlwgt","education","education-num",
    "marital-status","occupation","relationship","race","sex",
    "capital-gain","capital-loss","hours-per-week","native-country","income"
]
df = pd.read_csv(url, names=columns, na_values=" ?", skipinitialspace=True)

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
categorical_cols = df.select_dtypes(include=['object']).columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

categorical_cols, numerical_cols

(Index(['workclass', 'education', 'marital-status', 'occupation',
        'relationship', 'race', 'sex', 'native-country', 'income'],
       dtype='object'),
 Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
        'hours-per-week'],
       dtype='object'))

The dataset contains both categorical and numerical features. Categorical features require encoding, while numerical features require scaling.

In [5]:
le = LabelEncoder()
df['income'] = le.fit_transform(df['income'])

Label encoding was applied to the target variable since it contains ordered class labels.

In [6]:
df_encoded = pd.get_dummies(df, columns=categorical_cols.drop('income'), drop_first=True)

One-Hot Encoding was applied to categorical features without any inherent order to avoid introducing bias.

In [7]:
scaler = StandardScaler()
df_encoded[numerical_cols] = scaler.fit_transform(df_encoded[numerical_cols])

Numerical features were scaled using StandardScaler to ensure all features contribute equally during model training.

In [8]:
df.shape, df_encoded.shape

((32561, 15), (32561, 101))

After encoding and scaling, the dataset contains more features due to one-hot encoding and is fully numerical, making it model-ready.

In [9]:
df_encoded.to_csv("processed_adult_income.csv", index=False)

The fully preprocessed dataset was saved for future machine learning tasks.