# **Final Project Task 1 - Census Data Preprocess**

Requirements

- Target variable specification:
    - The target variable for this project is hours-per-week. 
    - Ensure all preprocessing steps are designed to support regression analysis on this target variable.
- Encode data  **3p**
- Handle missing values if any **1p**
- Correct errors, inconsistencies, remove duplicates if any **1p**
- Outlier detection and treatment if any **1p**
- Normalization / Standardization if necesarry **1p**
- Feature engineering **3p**
- Train test split, save it.
- Others?


Deliverable:

- Notebook code with no errors.
- Preprocessed data as csv.

## 1. Importul bibliotecilor și setări inițiale

Sunt importate bibliotecile necesare pentru:
- manipularea datelor (`pandas`, `numpy`)
- preprocesare și pipeline-uri (`scikit-learn`)

Se setează opțiuni de afișare pentru o lizibilitate mai bună a tabelelor.

In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Setare seed pentru reproductibilitate
RANDOM_STATE = 42

# Opțiuni de afișare pentru lizibilitate
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)


## 2. Încărcarea setului de date

In [2]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## 3. Inspecția inițială a datelor

În acest pas sunt verificate:
- dimensiunea setului de date și tipurile de variabile
- numărul de valori lipsă (NaN) pe fiecare coloană
- existența observațiilor duplicate

In [4]:
print("Dimensiune (rânduri, coloane):", data.shape)
display(data.dtypes)

# Valori lipsă
missing = data.isna().sum().sort_values(ascending=False)
display(missing[missing > 0])

# Duplicate
dup = data.duplicated().sum()
print("Număr rânduri duplicate:", dup)

# Statistici rapide pentru target (regresie)
display(data["hours-per-week"].describe())


Dimensiune (rânduri, coloane): (32561, 15)


age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

Series([], dtype: int64)

Număr rânduri duplicate: 24


count    32561.000000
mean        40.437456
std         12.347429
min          1.000000
25%         40.000000
50%         40.000000
75%         45.000000
max         99.000000
Name: hours-per-week, dtype: float64

## 4. Curățare de bază

Se aplică:
- eliminarea rândurilor duplicate (dacă există)
- eliminarea variabilei `income` (ținta este `hours-per-week`)
- eliminarea spațiilor din variabilele categorice (pentru consistență în encoding)


In [5]:
df = data.copy()

# 1) Eliminăm duplicatele
df = df.drop_duplicates()

# 2) Eliminăm 'income' (nu e folosit în regresie)
df = df.drop(columns=["income"])

# 3) Curățăm spațiile din string-uri (categorice)
for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].astype(str).str.strip()

print("Dimensiune după curățare:", df.shape)
df.head()


Dimensiune după curățare: (32537, 14)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


## 5. Feature engineering

Se construiesc variabile noi care pot ajuta un model de regresie:
- `capital_total` = capital-gain − capital-loss
- `experience_estimate` = max(age − 18, 0) (proxy pentru experiență)
- `has_capital` = 1 dacă există capital-gain sau capital-loss, altfel 0

Aceste variabile sunt construite din predictori (X).


In [6]:
df["capital_total"] = df["capital-gain"].fillna(0) - df["capital-loss"].fillna(0)
df["experience_estimate"] = (df["age"] - 18).clip(lower=0)
df["has_capital"] = ((df["capital-gain"].fillna(0) > 0) | (df["capital-loss"].fillna(0) > 0)).astype(int)

df[["capital-gain","capital-loss","capital_total","experience_estimate","has_capital"]].head()


Unnamed: 0,capital-gain,capital-loss,capital_total,experience_estimate,has_capital
0,2174,0,2174,21,1
1,0,0,0,32,0
2,0,0,0,20,0
3,0,0,0,35,0
4,0,0,0,10,0


## 6. Outlier detection & treatment (IQR)

Pentru variabilele numerice se folosește regula IQR:
- capăm (winsorizăm) valorile în afara intervalului [Q1 − 1.5×IQR, Q3 + 1.5×IQR]
Scopul este reducerea influenței valorilor extreme asupra regresiei.

In [7]:
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()

for col in num_cols:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    df[col] = df[col].clip(lower, upper)

df[num_cols].describe().T.head(10)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,32537.0,38.559855,13.554847,17.0,28.0,37.0,48.0,78.0
fnlwgt,32537.0,186824.961736,95118.115529,12285.0,117827.0,178356.0,236993.0,415742.0
education-num,32537.0,10.125165,2.459436,4.5,9.0,10.0,12.0,16.0
capital-gain,32537.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
capital-loss,32537.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hours-per-week,32537.0,41.203246,6.187352,32.5,40.0,40.0,45.0,52.5
capital_total,32537.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
experience_estimate,32537.0,20.571995,13.535966,0.0,10.0,19.0,30.0,60.0
has_capital,32537.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 7. Definirea variabilei țintă și împărțirea train/test

Ținta este `hours-per-week`.
Setul de date este împărțit în train (80%) și test (20%) înainte de a antrena transformările, pentru a evita **data leakage**.

In [8]:
TARGET = "hours-per-week"

X = df.drop(columns=[TARGET])
y = df[TARGET].astype(float)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE
)

print("Train:", X_train.shape, "Test:", X_test.shape)


Train: (26029, 16) Test: (6508, 16)


## 8. Pipeline de preprocesare (encoding + standardizare)

Se folosește `ColumnTransformer`:
- numeric: imputare cu mediană + StandardScaler
- categoric: imputare cu valoarea cea mai frecventă + OneHotEncoder

Pipeline-ul se potrivește (fit) doar pe train și apoi se aplică pe test.

In [9]:
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns.tolist()

numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features)
])


## 9. Transformare și export

In [11]:
X_train_p = preprocessor.fit_transform(X_train)
X_test_p = preprocessor.transform(X_test)

feature_names = preprocessor.get_feature_names_out()

X_train_df = pd.DataFrame(
    X_train_p.toarray() if hasattr(X_train_p, "toarray") else X_train_p,
    columns=feature_names
)
X_test_df = pd.DataFrame(
    X_test_p.toarray() if hasattr(X_test_p, "toarray") else X_test_p,
    columns=feature_names
)

train_df = X_train_df.copy()
train_df[TARGET] = y_train.reset_index(drop=True)

test_df = X_test_df.copy()
test_df[TARGET] = y_test.reset_index(drop=True)

train_path = "preprocessed_train_task1.csv"
test_path = "preprocessed_test_task1.csv"

train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)

print("Salvat:", train_path, train_df.shape)
print("Salvat:", test_path, test_df.shape)


Salvat: preprocessed_train_task1.csv (26029, 111)
Salvat: preprocessed_test_task1.csv (6508, 111)
