# **Final Project Task 1 - Census Data Preprocess**

Requirements

- Target variable specification:
    - The target variable for this project is hours-per-week. 
    - Ensure all preprocessing steps are designed to support regression analysis on this target variable.
- Encode data  **3p**
- Handle missing values if any **1p**
- Correct errors, inconsistencies, remove duplicates if any **1p**
- Outlier detection and treatment if any **1p**
- Normalization / Standardization if necesarry **1p**
- Feature engineering **3p**
- Train test split, save it.
- Others?


Deliverable:

- Notebook code with no errors.
- Preprocessed data as csv.

In [15]:
import pandas as pd
import numpy as np

In [2]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [4]:
data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [6]:
data.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

#### no missing values were detected

In [7]:
data.duplicated().sum()

np.int64(24)

In [8]:
data = data.drop_duplicates()

In [11]:
data.duplicated().sum()

np.int64(0)

In [12]:
X = data.drop(columns=["hours-per-week"])
y = data["hours-per-week"]

In [13]:
X = data.drop(columns=["hours-per-week", "income"])
y = data["hours-per-week"]

In [14]:
import pandas as pd
import numpy as np

In [16]:
for col in X.select_dtypes(include="object").columns:
    print(col, X[col].unique()[:5])

workclass ['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov']
education ['Bachelors' 'HS-grad' '11th' 'Masters' '9th']
marital-status ['Never-married' 'Married-civ-spouse' 'Divorced' 'Married-spouse-absent'
 'Separated']
occupation ['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service']
relationship ['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried']
race ['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']
sex ['Male' 'Female']
native-country ['United-States' 'Cuba' 'Jamaica' 'India' '?']


In [17]:
X["native-country"] = X["native-country"].replace("?", np.nan)

In [18]:
X["native-country"].isnull().sum()

np.int64(582)

In [19]:
from sklearn.impute import SimpleImputer
 
country_imputer = SimpleImputer(strategy="most_frequent")
X["native-country"] = country_imputer.fit_transform(
    X[["native-country"]]
).ravel()

In [20]:
y.describe()

count    32537.000000
mean        40.440329
std         12.346889
min          1.000000
25%         40.000000
50%         40.000000
75%         45.000000
max         99.000000
Name: hours-per-week, dtype: float64

In [21]:
Q1 = y.quantile(0.25)
Q3 = y.quantile(0.75)
IQR = Q3 - Q1
 
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
 
lower_bound, upper_bound

(np.float64(32.5), np.float64(52.5))

In [22]:
outliers = y[(y < lower_bound) | (y > upper_bound)]
outliers.shape[0]

9002

In [23]:
y = np.clip(y, lower_bound, upper_bound)

##### Outlier detection and treatment
###### Outliers in the target variable (`hours-per-week`) were identified using the Interquartile Range (IQR) method. Extreme values were capped to the lower and upper bounds to reduce their influence on the regression model while preserving all observations.

In [24]:
X.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
native-country    object
dtype: object

In [25]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [26]:
categorical_cols = X.select_dtypes(include="object").columns
numeric_cols = X.select_dtypes(exclude="object").columns

In [27]:
encoder = ColumnTransformer(
    transformers=[
        (
            "cat",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False),
            categorical_cols
        ),
        ("num", "passthrough", numeric_cols)
    ]
)
 

In [28]:
X_encoded = encoder.fit_transform(X)

In [29]:
X.shape, X_encoded.shape

((32537, 13), (32537, 106))

##### After encoding, the number of features increased from 13 to 106, reflecting the expansion of categorical variables into binary indicators.

In [30]:
from sklearn.preprocessing import StandardScaler
 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_encoded)

### Feature Scaling
#### Standardization was applied to the feature matrix so that all features have zero mean and unit variance. This is important for regression models that are sensitive to feature scales.

In [31]:
X_fe = X.copy()

In [32]:
X_fe["capital_balance"] = X_fe["capital-gain"] - X_fe["capital-loss"]

In [33]:
X_fe["age_squared"] = X_fe["age"] ** 2

In [34]:
X_fe["education_centered"] = X_fe["education-num"] - X_fe["education-num"].mean()

In [35]:
categorical_cols_fe = X_fe.select_dtypes(include="object").columns
numeric_cols_fe = X_fe.select_dtypes(exclude="object").columns

In [36]:
encoder_fe = ColumnTransformer(
    transformers=[
        (
            "cat",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False),
            categorical_cols_fe
        ),
        ("num", "passthrough", numeric_cols_fe)
    ]
)

In [37]:
X_fe_encoded = encoder_fe.fit_transform(X_fe)

In [38]:
scaler_fe = StandardScaler()
X_fe_scaled = scaler_fe.fit_transform(X_fe_encoded)

### Feature Engineering
Additional features were created to capture non-linear relationships
and financial effects:
- Capital balance (capital-gain − capital-loss)
- Squared age to model non-linear age effects
- Centered education level to improve numerical stability

In [39]:
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
    X_fe_scaled,
    y,
    test_size=0.2,
    random_state=42
)

### Train–Test Split
The dataset was split into training (80%) and testing (20%) sets
to enable unbiased evaluation of regression models.

In [41]:
import pandas as pd
 
pd.DataFrame(X_train).to_csv("X_train_preprocessed.csv", index=False)
pd.DataFrame(X_test).to_csv("X_test_preprocessed.csv", index=False)

In [42]:
y_train.to_csv("y_train.csv", index=False)
y_test.to_csv("y_test.csv", index=False)

## Final Output
- Preprocessed feature matrices were saved as CSV files
- Target variable files were saved separately
- All preprocessing steps were performed to support regression analysis
- The notebook runs end-to-end without errors