# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [1]:
import pandas as pd

In [2]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
23965,36,Private,126675,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1977,50,United-States,>50K
26151,54,Local-gov,173050,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,7688,0,40,United-States,>50K
14530,48,Private,169324,HS-grad,9,Never-married,Other-service,Unmarried,Black,Female,0,0,32,Haiti,<=50K
28969,60,Private,252413,Some-college,10,Married-civ-spouse,Craft-repair,Husband,Other,Male,0,0,32,United-States,>50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
28057,35,Private,341102,9th,5,Never-married,Handlers-cleaners,Not-in-family,Black,Male,0,0,40,United-States,<=50K
14023,23,Private,376416,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States,<=50K
23009,31,?,283531,HS-grad,9,Divorced,?,Unmarried,Black,Female,0,0,20,United-States,<=50K
31396,23,Private,118023,Some-college,10,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,45,?,<=50K
27397,32,Local-gov,114733,Bachelors,13,Divorced,Prof-specialty,Unmarried,White,Female,0,0,35,United-States,<=50K


## Task 3 – Modelare de regresie (hours-per-week)

Obiectiv: construirea și compararea mai multor modele de regresie pentru a prezice
hours-per-week (ore lucrate/săptămână), folosind datele preprocesate din Task 1.
Modelele vor fi evaluate pe setul de test folosind MAE, MSE, RMSE și R².

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# 1) Încărcare date preprocesate din Task 1 (NU raw adult.data)
train_df = pd.read_csv("preprocessed_census_train.csv")
test_df  = pd.read_csv("preprocessed_census_test.csv")

TARGET = "target_hours"

X_train_full = train_df.drop(columns=[TARGET])
y_train_full = train_df[TARGET]

X_test = test_df.drop(columns=[TARGET])
y_test = test_df[TARGET]

# 2) Split suplimentar train -> train/validation
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.2, random_state=42
)

X_train.shape, X_val.shape, X_test.shape


((20823, 93), (5206, 93), (6508, 93))

In [4]:
def eval_regression(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    return {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}

results = []  # aici salvăm toate experimentele


### Alegerea metricii pentru comparația modelelor

Pentru comparație principală folosesc RMSE, deoarece penalizează mai puternic erorile
mari și este exprimată în aceleași unități ca variabila țintă (ore/săptămână).
MAE este mai robust la outlieri, iar R² oferă o interpretare a proporției de variație
explicate, însă RMSE este utilă pentru a surprinde impactul predicțiilor foarte greșite.


In [5]:
from sklearn.linear_model import SGDRegressor

sgd_baseline = SGDRegressor(random_state=42)
sgd_baseline.fit(X_train, y_train)

pred_val = sgd_baseline.predict(X_val)
pred_test = sgd_baseline.predict(X_test)

results.append({
    "Experiment": "SGDRegressor_baseline",
    **{f"VAL_{k}": v for k, v in eval_regression(y_val, pred_val).items()},
    **{f"TEST_{k}": v for k, v in eval_regression(y_test, pred_test).items()},
    "Notes": "SGD default (GD-based), baseline"
})


In [6]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

pred_val = lr.predict(X_val)
pred_test = lr.predict(X_test)

results.append({
    "Experiment": "LinearRegression_baseline",
    **{f"VAL_{k}": v for k, v in eval_regression(y_val, pred_val).items()},
    **{f"TEST_{k}": v for k, v in eval_regression(y_test, pred_test).items()},
    "Notes": "OLS baseline, comparison only"
})


In [7]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

pred_val = dt.predict(X_val)
pred_test = dt.predict(X_test)

results.append({
    "Experiment": "DecisionTree_baseline",
    **{f"VAL_{k}": v for k, v in eval_regression(y_val, pred_val).items()},
    **{f"TEST_{k}": v for k, v in eval_regression(y_test, pred_test).items()},
    "Notes": "Tree baseline (no linearity assumption)"
})


In [8]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42, n_estimators=200, n_jobs=-1)
rf.fit(X_train, y_train)

pred_val = rf.predict(X_val)
pred_test = rf.predict(X_test)

results.append({
    "Experiment": "RandomForest_baseline",
    **{f"VAL_{k}": v for k, v in eval_regression(y_val, pred_val).items()},
    **{f"TEST_{k}": v for k, v in eval_regression(y_test, pred_test).items()},
    "Notes": "RF baseline (non-linear, robust)"
})


In [9]:
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=1.0, random_state=42)
ridge.fit(X_train, y_train)

results.append({
    "Experiment": "Ridge_alpha1",
    **{f"VAL_{k}": v for k, v in eval_regression(y_val, ridge.predict(X_val)).items()},
    **{f"TEST_{k}": v for k, v in eval_regression(y_test, ridge.predict(X_test)).items()},
    "Notes": "Regularizare L2 (reduce overfitting)"
})

lasso = Lasso(alpha=0.001, random_state=42, max_iter=10000)
lasso.fit(X_train, y_train)

results.append({
    "Experiment": "Lasso_alpha0.001",
    **{f"VAL_{k}": v for k, v in eval_regression(y_val, lasso.predict(X_val)).items()},
    **{f"TEST_{k}": v for k, v in eval_regression(y_test, lasso.predict(X_test)).items()},
    "Notes": "Regularizare L1 (feature selection implicit)"
})


In [10]:
sgd_huber = SGDRegressor(
    loss="huber",        # mai robust la outlieri decât squared_error
    alpha=1e-4,          # regularizare
    max_iter=2000,
    tol=1e-3,
    random_state=42
)
sgd_huber.fit(X_train, y_train)

results.append({
    "Experiment": "SGD_huber_alpha1e-4",
    **{f"VAL_{k}": v for k, v in eval_regression(y_val, sgd_huber.predict(X_val)).items()},
    **{f"TEST_{k}": v for k, v in eval_regression(y_test, sgd_huber.predict(X_test)).items()},
    "Notes": "SGD with Huber loss (robust) + L2"
})


In [11]:
dt_tuned = DecisionTreeRegressor(
    random_state=42,
    max_depth=12,
    min_samples_leaf=10
)
dt_tuned.fit(X_train, y_train)

results.append({
    "Experiment": "DecisionTree_depth12_leaf10",
    **{f"VAL_{k}": v for k, v in eval_regression(y_val, dt_tuned.predict(X_val)).items()},
    **{f"TEST_{k}": v for k, v in eval_regression(y_test, dt_tuned.predict(X_test)).items()},
    "Notes": "Tree with constraints to reduce overfitting"
})


In [12]:
rf_tuned = RandomForestRegressor(
    random_state=42,
    n_estimators=400,
    max_depth=18,
    min_samples_leaf=5,
    n_jobs=-1
)
rf_tuned.fit(X_train, y_train)

results.append({
    "Experiment": "RandomForest_400_depth18_leaf5",
    **{f"VAL_{k}": v for k, v in eval_regression(y_val, rf_tuned.predict(X_val)).items()},
    **{f"TEST_{k}": v for k, v in eval_regression(y_test, rf_tuned.predict(X_test)).items()},
    "Notes": "RF tuned for better generalization"
})


In [13]:
results_df = pd.DataFrame(results)

# sortăm după metrica principală (RMSE pe setul de validare)
results_df_sorted = results_df.sort_values("VAL_RMSE", ascending=True)

results_df_sorted


Unnamed: 0,Experiment,VAL_MAE,VAL_MSE,VAL_RMSE,VAL_R2,TEST_MAE,TEST_MSE,TEST_RMSE,TEST_R2,Notes
8,RandomForest_400_depth18_leaf5,7.265521,112.5991,10.61127,0.2501085,7.265081,113.332,10.64575,0.2590711,RF tuned for better generalization
7,DecisionTree_depth12_leaf10,7.56938,119.9367,10.95156,0.2012409,7.569996,121.89,11.04038,0.2031213,Tree with constraints to reduce overfitting
3,RandomForest_baseline,7.636464,120.5139,10.97788,0.1973969,7.677958,122.1516,11.05222,0.2014115,"RF baseline (non-linear, robust)"
5,Lasso_alpha0.001,7.768972,124.6357,11.16403,0.1699467,7.804766,125.82,11.21695,0.1774288,Regularizare L1 (feature selection implicit)
4,Ridge_alpha1,7.769813,124.6516,11.16475,0.1698406,7.807437,125.8402,11.21785,0.1772961,Regularizare L2 (reduce overfitting)
1,LinearRegression_baseline,7.769853,124.6521,11.16477,0.1698369,7.807522,125.8408,11.21788,0.1772924,"OLS baseline, comparison only"
6,SGD_huber_alpha1e-4,7.442114,130.6099,11.42847,0.1301593,7.512855,133.4823,11.55346,0.1273346,SGD with Huber loss (robust) + L2
2,DecisionTree_baseline,10.16327,225.5717,15.01905,-0.5022714,10.36978,235.8883,15.35865,-0.5421631,Tree baseline (no linearity assumption)
0,SGDRegressor_baseline,127957500.0,4.610914e+16,214730400.0,-307079400000000.0,545475800.0,5.524847e+20,23504990000.0,-3.611971e+18,"SGD default (GD-based), baseline"


In [14]:
best = results_df_sorted.iloc[0]
best_experiment = best["Experiment"]
best


Experiment        RandomForest_400_depth18_leaf5
VAL_MAE                                 7.265521
VAL_MSE                                112.59905
VAL_RMSE                                10.61127
VAL_R2                                  0.250108
TEST_MAE                                7.265081
TEST_MSE                              113.332003
TEST_RMSE                               10.64575
TEST_R2                                 0.259071
Notes         RF tuned for better generalization
Name: 8, dtype: object

### Concluzie privind modelarea

În cadrul acestui task au fost implementate și evaluate mai multe modele de regresie
pentru estimarea numărului de ore lucrate pe săptămână (hours-per-week), utilizând
datele preprocesate obținute în Task 1. Performanța modelelor a fost evaluată pe seturi
distincte de antrenare, validare și test, folosind metricile MAE, MSE, RMSE și R².

Rezultatele indică faptul că modelele bazate pe arbori de decizie depășesc consistent
modelele liniare și cele bazate pe gradient descent. Dintre toate modelele testate,
RandomForestRegressor cu parametri ajustați (400 de arbori, adâncime maximă 18 și
minimum 5 observații per frunză) a obținut cele mai bune rezultate atât pe setul de
validare, cât și pe setul de test. Modelul a înregistrat un RMSE de aproximativ 10.6 ore
și un coeficient R² de aproximativ 0.26 pe setul de test, indicând o capacitate superioară
de generalizare comparativ cu alternativele evaluate.

Modelele liniare (Linear Regression, Ridge și Lasso) au prezentat performanțe similare
între ele, însă semnificativ mai slabe decât cele ale modelelor neliniare. Acest rezultat
sugerează că relația dintre variabilele explicative și variabila țintă este în mare parte
neliniară și nu poate fi captată eficient printr-un model liniar simplu, chiar și în
prezența regularizării.

SGDRegressor în configurația implicită a prezentat performanțe instabile și erori foarte
ridicate, indicând sensibilitate ridicată la scalare, outlieri și alegerea funcției de
pierdere. Utilizarea pierderii de tip Huber a îmbunătățit stabilitatea modelului, însă
performanța acestuia a rămas inferioară modelelor bazate pe arbori.

În concluzie, RandomForestRegressor reprezintă cea mai potrivită alegere pentru această
problemă de regresie, datorită capacității sale de a modela relații neliniare complexe,
robusteții față de outlieri și performanței superioare observate pe date nevăzute.
Rezultatele sugerează totodată că există limitări inerente în capacitatea predictivă
a setului de date, fapt reflectat de valorile moderate ale coeficientului R².

Direcții viitoare de îmbunătățire includ explorarea unor tehnici avansate de inginerie
a caracteristicilor, selecția automată a variabilelor relevante, precum și utilizarea
unor modele de tip boosting (de exemplu Gradient Boosting sau XGBoost), care ar putea
captura mai eficient structura complexă a datelor.


In [15]:
# dacă best model e tree/rf
if "DecisionTree" in best_experiment:
    importances = pd.Series(dt_tuned.feature_importances_, index=X_train.columns).sort_values(ascending=False)
    importances.head(15)
elif "RandomForest" in best_experiment:
    importances = pd.Series(rf_tuned.feature_importances_, index=X_train.columns).sort_values(ascending=False)
    importances.head(15)
