# **Final Project Task 1 - Census Data Preprocess**

Requirements

- Target variable specification:
    - The target variable for this project is hours-per-week. 
    - Ensure all preprocessing steps are designed to support regression analysis on this target variable.
- Encode data  **3p**
- Handle missing values if any **1p**
- Correct errors, inconsistencies, remove duplicates if any **1p**
- Outlier detection and treatment if any **1p**
- Normalization / Standardization if necesarry **1p**
- Feature engineering **3p**
- Train test split, save it.
- Others?


Deliverable:

- Notebook code with no errors.
- Preprocessed data as csv.

In [30]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [31]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [32]:
# Drop exact duplicates
data.drop_duplicates(inplace=True)
print(f"Shape after dropping duplicates: {data.shape}")

# Drop unnecessary columns
# 'fnlwgt' is a sampling weight not useful for regression prediction in this context.
data.drop(columns=['fnlwgt'], inplace=True)

Shape after dropping duplicates: (32537, 15)


In [33]:
# 3. Handle Missing Values

# Check for missing values
print("\nMissing values per column:\n", data.isnull().sum())
for col in ['workclass', 'occupation', 'native-country']:
    data[col] = data[col].fillna(data[col].mode()[0])

print("Missing values after imputation:\n", data.isnull().sum())


Missing values per column:
 age               0
workclass         0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64
Missing values after imputation:
 age               0
workclass         0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


In [34]:
# 4. Outlier Detection and Treatment  
# Focusing on 'age', 'capital-gain', 'capital-loss'.
# 'hours-per-week' is the target

def cap_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # Cap values
    data[column] = np.where(data[column] < lower_bound, lower_bound, data[column])
    data[column] = np.where(data[column] > upper_bound, upper_bound, data[column])
    return data
data = cap_outliers_iqr(data, 'age')

In [35]:
# 5. Feature Engineering  

# A. Recode 'income' to binary 0/1 (It's a feature now, not target)
data['income_binary'] = data['income'].apply(lambda x: 1 if x == '>50K' else 0)
data.drop('income', axis=1, inplace=True)

# B. Net Capital Change
data['net_capital'] = data['capital-gain'] - data['capital-loss']
# Keeping 'net_capital' and drop the others to reduce dimensions.
data.drop(['capital-gain', 'capital-loss'], axis=1, inplace=True)

# C. Grouping Marital Status (Simplification)
data['marital_status_grouped'] = data['marital-status'].replace({
    'Married-civ-spouse': 'Married', 'Married-AF-spouse': 'Married', 'Married-spouse-absent': 'Married',
    'Divorced': 'Not-Married', 'Never-married': 'Not-Married', 'Separated': 'Not-Married', 'Widowed': 'Not-Married'
})
data.drop('marital-status', axis=1, inplace=True)

In [36]:
# 6. Encoding & Normalization/Standardization  

# Define features (X) and target (y)
X = data.drop('hours-per-week', axis=1)
y = data['hours-per-week']

# Identify column types
numeric_features = ['age', 'education-num', 'net_capital', 'income_binary']
categorical_features = ['workclass', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'marital_status_grouped']

# Create a column transformer
# Standardization: StandardScaler for numeric features (Mean=0, Std=1)
# Encoding: OneHotEncoder for categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ])

# Fit and transform the data
X_processed = preprocessor.fit_transform(X)

# Get feature names after OneHotEncoding for clarity
ohe_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
all_feature_names = numeric_features + list(ohe_feature_names)

# Convert back to DataFrame for better visibility/saving
X_final = pd.DataFrame(X_processed, columns=all_feature_names)
y_final = y.reset_index(drop=True)

In [37]:
# 7. Train Test Split and save  

X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.2, random_state=42)

print(f"\nTraining Set Shape: {X_train.shape}")
print(f"Test Set Shape: {X_test.shape}")

# Combine X and y for the final CSV export 
train_export = pd.concat([X_train, y_train], axis=1)
test_export = pd.concat([X_test, y_test], axis=1)

# Save to CSV
train_export.to_csv('census_preprocessed_train.csv', index=False)
test_export.to_csv('census_preprocessed_test.csv', index=False)

print("\nFiles 'census_preprocessed_train.csv' and 'census_preprocessed_test.csv' saved successfully.")
train_df = pd.read_csv('census_preprocessed_train.csv')
test_df = pd.read_csv('census_preprocessed_test.csv')


Training Set Shape: (26029, 85)
Test Set Shape: (6508, 85)

Files 'census_preprocessed_train.csv' and 'census_preprocessed_test.csv' saved successfully.


In [38]:
train_df = pd.read_csv('census_preprocessed_train.csv')
test_df = pd.read_csv('census_preprocessed_test.csv')
train_df.head()
test_df.head()

Unnamed: 0,age,education-num,net_capital,income_binary,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,...,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,marital_status_grouped_Married,marital_status_grouped_Not-Married,hours-per-week
0,-0.557732,0.357049,-0.13372,-0.563377,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,40
1,-1.07416,1.134777,-0.13372,-0.563377,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,40
2,1.581757,-0.031815,-0.13372,-0.563377,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,60
3,-0.557732,-0.031815,-0.13372,-0.563377,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,40
4,0.327574,1.134777,-0.13372,1.775009,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,40


# Project Report: Census Data Preprocessing for Regression
Objective:
To prepare the "Adult" (Census) dataset for a regression analysis where the target variable is hours-per-week. The goal was to transform raw demographic data into a clean, numerical format suitable for machine learning algorithms.

Data cleaning and integrity

Missing Value Handling: We identified missing values represented as " ?" and converted them to NaN. Since the missing data existed in categorical columns (workclass, occupation, native-country), we imputed them using the Mode (most frequent value). This ensured no data rows were lost while maintaining statistical consistency.

Error Correction: Redundant columns like fnlwgt (statistcal weight irrelevant for prediction) were removed to reduce noise.

Outlier Treatment

Method: We applied the Interquartile Range (IQR) method to the age feature.

Logic: Values falling below $Q1 - 1.5 \times IQR$ or above $Q3 + 1.5 \times IQR$ were capped at those boundaries. This reduces the influence of extreme ages on the regression model without deleting valid data points.

Target Variable: we intentionally preserved outliers in hours-per-week (the target) to ensure the model learns to predict valid real-world extremes (e.g., 80-hour work weeks).

Feature Engineering

We created new features to improve predictive power:

net_capital: Combined capital-gain and capital-loss into a single feature (Gain - Loss) to capture the net financial status while reducing dimensionality.

income_binary: Converted the original income target into a binary feature (0 or 1). This serves as a strong predictor for hours worked.
marital_status_grouped: Simplified complex marital statuses into Married vs. Not-Married to help the model find clearer patterns.

Encoding and normalization

One-Hot Encoding: Categorical variables (e.g., race, occupation) were transformed into binary columns (0s and 1s). This allows the model to interpret non-numerical data.

Standardization (StandardScaler): Numerical features were scaled to have a mean of 0 and a standard deviation of 1.
 0.0 represents the average value. Negative values (e.g., -1.2) indicate the value is below the average. Positive values (e.g., +1.5) indicate the value is above the average
