# **Final Project Task 1 - Census Data Preprocess**

Requirements

- Target variable specification:
    - The target variable for this project is hours-per-week. 
    - Ensure all preprocessing steps are designed to support regression analysis on this target variable.
- Encode data  **3p**
- Handle missing values if any **1p**
- Correct errors, inconsistencies, remove duplicates if any **1p**
- Outlier detection and treatment if any **1p**
- Normalization / Standardization if necesarry **1p**
- Feature engineering **3p**
- Train test split, save it.
- Others?


Deliverable:

- Notebook code with no errors.
- Preprocessed data as csv.

In [4]:
import pandas as pd

In [7]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Importuri + setări


In [6]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
    

De ce?

pandas/numpy pentru manipulare date

Pipeline/ColumnTransformer ca să faci preprocessing corect (fără leakage)

SimpleImputer pentru lipsuri, OneHotEncoder pentru categorice, StandardScaler pentru numerice.

Inspectare rapidă: dimensiuni, tipuri, missing

In [8]:
print("Shape:", data.shape)
display(data.info())

missing = data.isna().sum().sort_values(ascending=False)
display(missing[missing > 0])


Shape: (32561, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


None

Series([], dtype: int64)

Curățare inconsistențe + duplicate
 Normalizare text (strip spații)

Chiar dacă ai skipinitialspace, e bine să faci strip() pe categorice.

In [9]:
cat_cols_guess = data.select_dtypes(include="object").columns
data[cat_cols_guess] = data[cat_cols_guess].apply(lambda s: s.str.strip())


Eliminare duplicate

In [10]:
dup_count = data.duplicated().sum()
print("Duplicate rows:", dup_count)

data = data.drop_duplicates().reset_index(drop=True)
print("Shape after drop_duplicates:", data.shape)


Duplicate rows: 24
Shape after drop_duplicates: (32537, 15)


De ce?
Duplicatele pot strica distribuțiile și performanța modelului.

Definire target + separare X/y

Targetul cerut: hours-per-week.

In [11]:
target = "hours-per-week"

y = data[target].copy()
X = data.drop(columns=[target]).copy()

y.describe()


count    32537.000000
mean        40.440329
std         12.346889
min          1.000000
25%         40.000000
50%         40.000000
75%         45.000000
max         99.000000
Name: hours-per-week, dtype: float64

De ce?
În regresie, y trebuie separat de X înainte de orice fit pe transformări.

Outlieri (detecție + tratament) — fără să “strici” targetul
 Verificare rapidă outlieri în y

In [12]:
y_quantiles = y.quantile([0.01, 0.05, 0.95, 0.99])
y_quantiles


0.01     8.0
0.05    18.0
0.95    60.0
0.99    80.0
Name: hours-per-week, dtype: float64

Observație: în Adult, hours-per-week are valori mari (ex. 99). Asta e realist, deci de obicei nu tai targetul agresiv.

Outlieri pe numerice din X: winsorization (clipping) în pipeline

Vom face clipping pe numerice la percentila 1%–99% învățată doar din train (important: fără leakage). Asta o implementăm cu un transformer custom.

In [13]:
from sklearn.base import BaseEstimator, TransformerMixin

class QuantileClipper(BaseEstimator, TransformerMixin):
    def __init__(self, lower=0.01, upper=0.99):
        self.lower = lower
        self.upper = upper
        
    def fit(self, X, y=None):
        X = np.asarray(X, dtype=float)
        self.lower_bounds_ = np.nanquantile(X, self.lower, axis=0)
        self.upper_bounds_ = np.nanquantile(X, self.upper, axis=0)
        return self
    
    def transform(self, X):
        X = np.asarray(X, dtype=float)
        return np.clip(X, self.lower_bounds_, self.upper_bounds_)


De ce?

Outlierii în capital-gain sunt foarte extremi. Clipping + transformare log ajută mult.

Feature engineering 

Facem 3 lucruri utile pentru regresie:

Log transform pentru capital-gain și capital-loss (reduce skew).

Bin pentru age (opțional, dar util; modelul poate învăța non-linearități).

Drop coloană redundantă (de ex. education și education-num sunt corelate; păstrăm una).

Mai întâi, alegem coloane numerice/categorice.

In [14]:
numeric_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss"]
categorical_features = [c for c in X.columns if c not in numeric_features]

numeric_features, categorical_features


(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss'],
 ['workclass',
  'education',
  'marital-status',
  'occupation',
  'relationship',
  'race',
  'sex',
  'native-country',
  'income'])

Transformare log pentru capital-gain/loss

Vom aplica log1p DOAR pe cele două coloane. Mai simplu: creăm coloane noi înainte de pipeline (clar și ușor de explicat).

In [15]:
X_fe = X.copy()

X_fe["capital-gain-log"] = np.log1p(X_fe["capital-gain"])
X_fe["capital-loss-log"] = np.log1p(X_fe["capital-loss"])

# opțional: scoatem coloanele originale ca să nu duplicăm informația
X_fe = X_fe.drop(columns=["capital-gain", "capital-loss"])

# opțional: age bins (categoric)
X_fe["age_bin"] = pd.cut(
    X_fe["age"],
    bins=[0, 25, 35, 45, 55, 65, 120],
    labels=["<25", "25-34", "35-44", "45-54", "55-64", "65+"]
)

# opțional: scoatem 'education' pentru că avem deja 'education-num'
if "education" in X_fe.columns:
    X_fe = X_fe.drop(columns=["education"])

X_fe.head()


Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,native-country,income,capital-gain-log,capital-loss-log,age_bin
0,39,State-gov,77516,13,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K,7.684784,0.0,35-44
1,50,Self-emp-not-inc,83311,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K,0.0,0.0,45-54
2,38,Private,215646,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K,0.0,0.0,35-44
3,53,Private,234721,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K,0.0,0.0,45-54
4,28,Private,338409,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K,0.0,0.0,25-34


De ce?

log1p reduce impactul valorilor extreme.

age_bin capturează relații neliniare.

education-num e deja o mapare numerică; education aduce duplicare.

Preprocessing: missing, encoding, standardizare (corect, în Pipeline)

Recalculăm listele de coloane după feature engineering.

In [16]:
numeric_features = X_fe.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_fe.select_dtypes(exclude=[np.number]).columns.tolist()

numeric_features, categorical_features


(['age', 'fnlwgt', 'education-num', 'capital-gain-log', 'capital-loss-log'],
 ['workclass',
  'marital-status',
  'occupation',
  'relationship',
  'race',
  'sex',
  'native-country',
  'income',
  'age_bin'])

Pipeline numeric

imputare mediană

clipping outlieri (QuantileClipper)

standardizare

In [17]:
numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("clipper", QuantileClipper(lower=0.01, upper=0.99)),
    ("scaler", StandardScaler())
])


Pipeline categoric

imputare “most_frequent”

OneHotEncoder

In [18]:
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])


ColumnTransformer

In [19]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("cat", categorical_pipeline, categorical_features)
    ],
    remainder="drop"
)


De ce?

handle_unknown="ignore" te scapă de erori dacă apar categorii noi în test.

Standardizarea e utilă pentru multe modele (linear regression, ridge, lasso, SVR etc.).

Train/Test split (și salvare)

In [20]:
X_train, X_test, y_train, y_test = train_test_split(
    X_fe, y, test_size=0.2, random_state=42
)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


(26029, 14) (6508, 14) (26029,) (6508,)


De ce?
Split înainte de fit pe transformări ca să nu “vezi” statistici din test.

Fit pe train + transform train/test

In [21]:
X_train_proc = preprocessor.fit_transform(X_train)
X_test_proc = preprocessor.transform(X_test)

type(X_train_proc), X_train_proc.shape


(scipy.sparse._csr.csr_matrix, (26029, 99))

De ce?
fit_transform doar pe train, transform pe test.

Salvare CSV (deliverable)

X_train_proc e sparse matrix (din OneHot). Ca să îl salvăm ca CSV, îl convertim la array dens (atenție: poate fi mare). Adult e ok.

Mai întâi scoatem numele feature-urilor după OneHot.


In [22]:
# nume feature-uri
ohe = preprocessor.named_transformers_["cat"].named_steps["onehot"]
cat_feature_names = ohe.get_feature_names_out(categorical_features)

all_feature_names = np.concatenate([numeric_features, cat_feature_names])

# convertim la DataFrame
X_train_df = pd.DataFrame(X_train_proc.toarray() if hasattr(X_train_proc, "toarray") else X_train_proc,
                          columns=all_feature_names)
X_test_df = pd.DataFrame(X_test_proc.toarray() if hasattr(X_test_proc, "toarray") else X_test_proc,
                         columns=all_feature_names)

# adăugăm target separat (sau îl salvăm separat)
train_out = X_train_df.copy()
train_out[target] = y_train.reset_index(drop=True)

test_out = X_test_df.copy()
test_out[target] = y_test.reset_index(drop=True)

train_out.to_csv("adult_preprocessed_train.csv", index=False)
test_out.to_csv("adult_preprocessed_test.csv", index=False)

print("Saved:", "adult_preprocessed_train.csv", "adult_preprocessed_test.csv")


Saved: adult_preprocessed_train.csv adult_preprocessed_test.csv


De ce?

Cerința spune “Train test split, save it” și “Preprocessed data as csv”.

Aici salvezi train/test preprocesat cu target inclus.

Salvare și a datasetului complet preprocesat (pentru analiză)

In [23]:
X_full_proc = preprocessor.fit_transform(X_fe)
X_full_df = pd.DataFrame(X_full_proc.toarray() if hasattr(X_full_proc, "toarray") else X_full_proc,
                         columns=all_feature_names)
full_out = X_full_df.copy()
full_out[target] = y.reset_index(drop=True)

full_out.to_csv("adult_preprocessed_full.csv", index=False)
print("Saved:", "adult_preprocessed_full.csv")


Saved: adult_preprocessed_full.csv
