# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import warnings
warnings.filterwarnings("ignore")

In [2]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
5508,18,Private,201901,11th,7,Never-married,Sales,Own-child,White,Female,0,0,10,United-States,<=50K
27067,53,Self-emp-inc,134854,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,Greece,>50K
16799,29,Private,198825,HS-grad,9,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,38,United-States,<=50K
18568,50,Private,289390,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,47,United-States,<=50K
16395,23,Private,199915,Bachelors,13,Never-married,Adm-clerical,Own-child,White,Female,0,0,35,United-States,<=50K
7732,43,Private,350661,Prof-school,15,Separated,Tech-support,Not-in-family,White,Male,0,0,50,Columbia,>50K
7938,21,Private,117210,HS-grad,9,Never-married,Machine-op-inspct,Own-child,White,Male,0,0,40,United-States,<=50K
3975,24,Private,215251,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,50,United-States,<=50K
11804,42,Private,270721,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,32,United-States,<=50K
16887,63,Private,122442,12th,8,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K


In [3]:
target = "hours-per-week"

X = data.drop(columns=[target])
y = data[target]


In [4]:
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42)

print(len(X_train), len(X_val), len(X_test))


22792 4884 4885


In [5]:
categorical = X.select_dtypes(include="object").columns
numeric = X.select_dtypes(exclude="object").columns

preprocessor_scaled = ColumnTransformer([
    ("num", StandardScaler(), numeric),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical)
])

preprocessor_tree = ColumnTransformer([
    ("num", "passthrough", numeric),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical)
])

In [6]:
def evaluate(model, X, y):
    pred = model.predict(X)
    mae = mean_absolute_error(y, pred)
    mse = mean_squared_error(y, pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y, pred)
    return mae, mse, rmse, r2

In [7]:
models = {
    "SGDRegressor": Pipeline([
        ("prep", preprocessor_scaled),
        ("model", SGDRegressor(max_iter=1000, random_state=42))
    ]),
    
    "LinearRegression": Pipeline([
        ("prep", preprocessor_scaled),
        ("model", LinearRegression())
    ]),
    
    "DecisionTree": Pipeline([
        ("prep", preprocessor_tree),
        ("model", DecisionTreeRegressor(random_state=42))
    ]),
    
    "RandomForest": Pipeline([
        ("prep", preprocessor_tree),
        ("model", RandomForestRegressor(random_state=42))
    ]),
    
    "Ridge": Pipeline([
        ("prep", preprocessor_scaled),
        ("model", Ridge())
    ]),
    
    "Lasso": Pipeline([
        ("prep", preprocessor_scaled),
        ("model", Lasso())
    ])
}

In [8]:
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    
    train_metrics = evaluate(model, X_train, y_train)
    val_metrics = evaluate(model, X_val, y_val)
    test_metrics = evaluate(model, X_test, y_test)
    
    results.append([
        name,
        *train_metrics,
        *val_metrics,
        *test_metrics
    ])

columns = [
    "Model",
    "Train MAE","Train MSE","Train RMSE","Train R2",
    "Val MAE","Val MSE","Val RMSE","Val R2",
    "Test MAE","Test MSE","Test RMSE","Test R2"
]

results_df = pd.DataFrame(results, columns=columns)
results_df.sort_values("Test RMSE")

Unnamed: 0,Model,Train MAE,Train MSE,Train RMSE,Train R2,Val MAE,Val MSE,Val RMSE,Val R2,Test MAE,Test MSE,Test RMSE,Test R2
3,RandomForest,2.834797,16.985586,4.121357,0.887518,7.666588,121.472772,11.021469,0.212247,7.756333,126.644266,11.253633,0.195374
4,Ridge,7.627181,120.614273,10.982453,0.201266,7.746326,121.375465,11.017053,0.212878,7.774384,127.466839,11.290121,0.190147
1,LinearRegression,7.627529,120.613285,10.982408,0.201272,7.746898,121.39359,11.017876,0.21276,7.775416,127.486069,11.290973,0.190025
0,SGDRegressor,7.666585,120.989479,10.999522,0.198781,7.773044,121.546615,11.024818,0.211768,7.809221,127.618487,11.296835,0.189184
5,Lasso,7.50842,143.408327,11.975322,0.050319,7.690179,146.243809,12.093131,0.051606,7.75214,150.902247,12.284228,0.041252
2,DecisionTree,0.010574,0.110938,0.333074,0.999265,10.27068,232.371826,15.243747,-0.506936,10.39826,235.025742,15.330549,-0.493221


In [9]:
sgd_tuned = Pipeline([
    ("prep", preprocessor_scaled),
    ("model", SGDRegressor(
        loss="huber",      # robust to outliers
        alpha=0.0001,
        penalty="l2",
        max_iter=5000,
        learning_rate="adaptive",
        eta0=0.01,
        random_state=42
    ))
])

sgd_tuned.fit(X_train, y_train)

evaluate(sgd_tuned, X_test, y_test)

(7.568210758113829,
 141.0851130829604,
 np.float64(11.877925453670787),
 0.10362455049870023)

In [10]:
results_df.sort_values("Test RMSE")

Unnamed: 0,Model,Train MAE,Train MSE,Train RMSE,Train R2,Val MAE,Val MSE,Val RMSE,Val R2,Test MAE,Test MSE,Test RMSE,Test R2
3,RandomForest,2.834797,16.985586,4.121357,0.887518,7.666588,121.472772,11.021469,0.212247,7.756333,126.644266,11.253633,0.195374
4,Ridge,7.627181,120.614273,10.982453,0.201266,7.746326,121.375465,11.017053,0.212878,7.774384,127.466839,11.290121,0.190147
1,LinearRegression,7.627529,120.613285,10.982408,0.201272,7.746898,121.39359,11.017876,0.21276,7.775416,127.486069,11.290973,0.190025
0,SGDRegressor,7.666585,120.989479,10.999522,0.198781,7.773044,121.546615,11.024818,0.211768,7.809221,127.618487,11.296835,0.189184
5,Lasso,7.50842,143.408327,11.975322,0.050319,7.690179,146.243809,12.093131,0.051606,7.75214,150.902247,12.284228,0.041252
2,DecisionTree,0.010574,0.110938,0.333074,0.999265,10.27068,232.371826,15.243747,-0.506936,10.39826,235.025742,15.330549,-0.493221


# Compararea modelelor

    Random Forest: acest model a obținut cele mai bune rezultate, cu un RMSE de testare de 11,25, captând cu succes modele neliniare pe care alte modele le-au omis.

    Modele liniare standard: Ridge, Linear Regression și SGDRegressor au avut performanțe similare, cu RMSE de testare în jur de 11,29, indicând o relație liniară stabilă, dar limitată în date.

    Lasso Regression: Acesta a avut performanțe slabe, cu un RMSE de testare de 12,28 și un R2 foarte scăzut (0,04), sugerând că penalizarea L1 era prea restrictivă pentru acest set de date.

    Decision tree: acest model a eșuat semnificativ din cauza supraajustării; deși a fost aproape perfect pe datele de antrenament, a avut cea mai slabă performanță de testare, cu un R2 negativ.

    SGD ajustat: utilizarea pierderii Huber a îmbunătățit stabilitatea împotriva valorilor aberante, dar a dus la un RMSE de testare de 11,87, care nu a depășit valoarea de referință a Random Forest.

În timp ce modelele liniare (SGD, Ridge) au încercat să găsească o relație liniară pentru aceste caracteristici, Random Forest a reușit deoarece a putut identifica interacțiuni complexe — de exemplu, faptul că „Vârsta” influențează în mod diferit programul de lucru în funcție de faptul dacă o persoană este „liber profesionist” sau lucrează în sectorul „privat”.

# Interpretarea metrică

    MAE (eroarea medie absolută): eroarea medie în ore. Cele mai bune modele au o eroare de aproximativ 7,7 ore.
    RMSE (eroarea medie pătratică): similară cu MAE, dar penalizează mai sever „eroarele” mari. RMSE de referință este de aproximativ 11,29.
    R2: Ne indică cât de multă variație în orele de lucru este explicată de caracteristici. Scorul este scăzut (aproximativ 0,19 sau 19%), indicând faptul că factorii din afara acestui set de date (cum ar fi alegerea personală sau cultura specifică a companiei) influențează puternic orele de lucru.

# Reglarea și concluziile

    HUBER = Spre deosebire de eroarea pătratică standard, pierderea Huber este mai puțin sensibilă la valorile aberante (persoane care lucrează extrem de multe sau extrem de puține ore), ceea ce ajută modelul să rămână stabil în timpul antrenamentului.

    Rezultat: Modelul SGD ajustat a înregistrat o eroare medie pătratică (RMSE) de 11,87. Deși a înregistrat o îmbunătățire față de unele iterații, Random Forest a rămas cel mai puternic model global datorită capacității sale de a surprinde relații complexe între variabile precum vârsta, educația și ocupația.

În acest task se demonstrează cu succes că, în timp ce modelele liniare sunt stabile, datele conțin relații neliniare care favorizează metodele de tip Random Forest. Cu toate acestea, valoarea R2 scăzută pentru toate modelele sugerează că „ore pe săptămână” este o variabilă țintă „zgomotoasă”, dificil de prevăzut cu precizie ridicată folosind aceste caracteristici specifice.