# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


Scopul acestui notebook este construirea și evaluarea unor modele de regresie care să prezică variabila hours-per-week din setul de date Census / Adult Income, folosind seturile de date deja preprocesate în Task 1:

train_preprocessed.csv

test_preprocessed.csv

Problema este una de regresie deoarece variabila țintă este numerică și continuă.

In [26]:
# Încărcarea librăriilor necesare
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline


from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor


from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


import matplotlib.pyplot as plt
import seaborn as sns

In [27]:
# Folosim datele preprocesate obținute anterior (encoding + curățare)
train_df = pd.read_csv("train_preprocessed.csv")
test_df = pd.read_csv("test_preprocessed.csv")


train_df.head()


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,income_binary,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,hours-per-week
0,-0.18807,-0.627661,-0.458342,0.0,0.0,-0.561303,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,38.0
1,0.992956,-0.76805,-0.458342,0.0,0.0,-0.561303,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,40.0
2,-0.335699,0.237442,-0.051335,0.0,0.0,1.781569,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,40.0
3,1.140584,-0.041966,-0.458342,0.0,0.0,-0.561303,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,40.0
4,-0.630955,-1.093903,-0.458342,0.0,0.0,-0.561303,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,52.5


In [28]:
# Variabila țintă este hours-per-week
target = "hours-per-week"


X = train_df.drop(columns=[target])
y = train_df[target]


X_test = test_df.drop(columns=[target])
y_test = test_df[target]

In [29]:
# Din setul de antrenare extragem un validation set (20%)
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=42
)

#Structura finală:
#Train – antrenare modele
#Validation – selecția modelului și experimente
#Test – evaluare finală

In [30]:
# Funcție pentru evaluarea modelelor
def evaluate_model(model, X_tr, y_tr, X_val, y_val):
    model.fit(X_tr, y_tr)

    y_pred_val = model.predict(X_val)

    mae = mean_absolute_error(y_val, y_pred_val)
    mse = mean_squared_error(y_val, y_pred_val)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_val, y_pred_val)

    return mae, mse, rmse, r2

Au fost calculate următoarele metrici:

MAE (Mean Absolute Error) – ușor de interpretat

MSE (Mean Squared Error) – penalizează erorile mari

RMSE (Root Mean Squared Error) – aceeași unitate ca variabila țintă

R² Score – proporția de variație explicată

Metrică principală aleasă: RMSE
Fiindcă exprimă eroarea medie în ore lucrate pe săptămână, fiind ușor de interpretat și mai informativ decât MAE pentru predicția volumului de muncă

In [31]:
# Linear Regression
lin_reg = LinearRegression()
mae, mse, rmse, r2 = evaluate_model(lin_reg, X_train, y_train, X_val, y_val)


print(mae, rmse, r2)

4.402311867453494 5.505616909671608 0.20335499742382812


In [32]:
# SGDRegressor
# Folosim MSE loss (squared_error) și standardizare a datelor
sgd_reg = Pipeline([
    ("scaler", StandardScaler()),
    ("sgd", SGDRegressor(loss="squared_error", random_state=42))
])

mae, mse, rmse, r2 = evaluate_model(sgd_reg, X_train, y_train, X_val, y_val)
print(mae, rmse, r2)

1182614.20097905 43585736.62073135 -49927708982378.29


In [33]:
# Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state=42)

mae, mse, rmse, r2 = evaluate_model(dt_reg, X_train, y_train, X_val, y_val)
print(mae, rmse, r2)

5.382827506723012 7.582835863387755 -0.5111799007820794


In [34]:
# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
mae, mse, rmse, r2 = evaluate_model(rf_reg, X_train, y_train, X_val, y_val)

print(mae, rmse, r2)

4.30642121499003 5.535430882930745 0.19470366489480717


Modele utilizate și justificare

Linear Regression: simplu, interpretabil, dar presupune relații strict liniare. Folosit doar ca benchmark, conform cerințelor

SGDRegressor: scalabil, flexibil (loss-uri diferite), dar sensibil la scalare și hiperparametri

Decision Tree Regressor: captează relații neliniare, nu necisită scalare. dar are risc mare de overfitting

Random Forest Regressor: performanță bună, stabil, dar mai puțin interpretabil, cost computațional

Ridge & Lasso: utile pentru date cu mai multe variabile

EXPERIMENTARE

In [35]:
# Feature Engineering – Polynomial Features (SGD)
poly_pipeline = Pipeline([
    ("poly_features", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler()),
    ("sgd", SGDRegressor(loss="squared_error", random_state=42))
])

mae, mse, rmse, r2 = evaluate_model(poly_pipeline, X_train, y_train, X_val, y_val)
print(mae, rmse, r2)

519800199665.3733 1964139137740.3965 -1.0139052151570497e+23


In [36]:
# Ridge Regression
ridge_reg = Pipeline([
    ("scaler", StandardScaler()),
    ("ridge", Ridge(alpha=1.0))
])

mae, mse, rmse, r2 = evaluate_model(ridge_reg, X_train, y_train, X_val, y_val)
print(mae, rmse, r2)

4.402309977242478 5.505614977455621 0.20335555659456783


In [37]:
# Lasso Regression
lasso_reg = Pipeline([
    ("scaler", StandardScaler()),
    ("lasso", Lasso(alpha=0.1))
])

mae, mse, rmse, r2 = evaluate_model(lasso_reg, X_train, y_train, X_val, y_val)
print(mae, rmse, r2)

4.399297889076194 5.517668525584432 0.19986352036252175


TABEL PENTRU COMPARAREA REZULTATELOR

In [38]:
results = []


models = {
"Linear Regression": lin_reg,
"SGD Regressor": sgd_reg,
"Decision Tree": dt_reg,
"Random Forest": rf_reg,
"Polynomial SGD": poly_pipeline,
"Ridge": ridge_reg,
"Lasso": lasso_reg
}


results = []

for name, model in models.items():
    mae, mse, rmse, r2 = evaluate_model(
        model, X_train, y_train, X_val, y_val
    )
    results.append([name, mae, mse, rmse, r2])


results_df = pd.DataFrame(
results, columns=["Model", "MAE", "MSE", "RMSE", "R2"]
)


results_df

Unnamed: 0,Model,MAE,MSE,RMSE,R2
0,Linear Regression,4.402312,30.31182,5.505617,0.203355
1,SGD Regressor,1182614.0,1899716000000000.0,43585740.0,-49927710000000.0
2,Decision Tree,5.382828,57.4994,7.582836,-0.5111799
3,Random Forest,4.306421,30.641,5.535431,0.1947037
4,Polynomial SGD,519800200000.0,3.857843e+24,1964139000000.0,-1.013905e+23
5,Ridge,4.40231,30.3118,5.505615,0.2033556
6,Lasso,4.399298,30.44467,5.517669,0.1998635


ANALIZA REZULTATELOR. OBSERVAȚII CHEIE

SGD Regressor și Polynomial SGD au rezultate dezastruoase: 

RMSE uriaș, R² negativ imens: modelul nu converge sau are probleme de scalare / hyperparametri.

Problema poate fi la learning rate prea mare, sau datele nu sunt scalate corespunzător, sau polynomial features prea agresive.

Decision Tree: MAE și RMSE mai mari decât modelele liniare, poate fi overfitting pe training.
R² negativ, performanță slabă pe validation.

La modelele liniare se observă o performanță stabilă: RMSE ~5.5, R² ~0.2 și diferențe mici între ele, regularizarea (Ridge/Lasso) nu aduce îmbunătățiri majore.

Random Forest: RMSE ușor mai bun decât Linear, MAE mai mic, R² similar cu Linear (~0.195), performanță comparabilă, dar captează relații nonliniare.

Evaluarea pe setul de test

Am selectat Random Forest Regressor ca model final pe baza RMSE pe setul de validare.
Pentru evaluarea finală, antrenăm modelul pe întregul set de antrenament (train + validation) și îl testăm pe setul test.

In [43]:
best_model = rf_reg
best_model.fit(X_train, y_train)


y_test_pred = best_model.predict(X_test)


mae_test = mean_absolute_error(y_test, y_test_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
rmse_test = np.sqrt(mse_test)
r2_test = r2_score(y_test, y_test_pred)

mae_test, rmse_test, r2_test, mse_test

(4.275987251953639,
 np.float64(5.47172529255547),
 0.22543519716116545,
 29.939777677191238)

Rezultatele finale (MAE, MSE, RMSE, R²) reflectă performanța reală a modelului pe date noi și ne permit să comparăm eficient diferite abordări.