# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [1]:
import pandas as pd

X_train = pd.read_csv("X_train.csv")
X_test = pd.read_csv("X_test.csv")
y_train = pd.read_csv("y_train.csv").squeeze()
y_test = pd.read_csv("y_test.csv").squeeze()


In [2]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((24111, 84), (6028, 84), (24111,), (6028,))

In [3]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_train_mean = y_train.mean()

y_pred_baseline = np.full(shape=y_test.shape, fill_value=y_train_mean)


In [6]:
mae_baseline = mean_absolute_error(y_test, y_pred_baseline)
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
rmse_baseline = np.sqrt(mean_squared_error(y_test, y_pred_baseline))
r2_baseline = r2_score(y_test, y_pred_baseline)

mae_baseline, mse_baseline, rmse_baseline, r2_baseline

(7.564415313250538,
 144.84646681796283,
 np.float64(12.035217771937608),
 -6.972577415731429e-05)

A baseline model predicting the mean of the training target was used as a benchmark. The R² score which is close to zero indicates that this model does not explain any variance in the target, providing a minimum performance threshold that future models must exceed.

In [7]:
from sklearn.model_selection import train_test_split

X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train,
    y_train,
    test_size=0.2,
    random_state=42
)

In [8]:
X_train_sub.shape, X_val.shape, y_train_sub.shape, y_val.shape

((19288, 84), (4823, 84), (19288,), (4823,))

In [9]:
from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(
    loss="squared_error",
    max_iter=1000,
    tol=1e-3,
    random_state=42
)

sgd_reg.fit(X_train_sub, y_train_sub)

0,1,2
,loss,'squared_error'
,penalty,'l2'
,alpha,0.0001
,l1_ratio,0.15
,fit_intercept,True
,max_iter,1000
,tol,0.001
,shuffle,True
,verbose,0
,epsilon,0.1


In [10]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_val_pred_sgd = sgd_reg.predict(X_val)

mae_sgd = mean_absolute_error(y_val, y_val_pred_sgd)
mse_sgd = mean_squared_error(y_val, y_val_pred_sgd)
rmse_sgd = np.sqrt(mean_squared_error(y_val, y_val_pred_sgd))
r2_sgd = r2_score(y_val, y_val_pred_sgd)

mae_sgd, mse_sgd, rmse_sgd, r2_sgd

(7.318598968481576,
 107.7663905972761,
 np.float64(10.381059223281413),
 0.2288329315380324)

The SGDRegressor substantially outperformed the baseline model, reducing RMSE and achieving a positive R² score, indicating explanatory power.

In [11]:
from sklearn.tree import DecisionTreeRegressor

dt_reg = DecisionTreeRegressor(
    random_state=42
)

dt_reg.fit(X_train_sub, y_train_sub)

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,42
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [12]:
y_val_pred_dt = dt_reg.predict(X_val)

mae_dt = mean_absolute_error(y_val, y_val_pred_dt)
mse_dt = mean_squared_error(y_val, y_val_pred_dt)
rmse_dt = np.sqrt(mean_squared_error(y_val, y_val_pred_dt))
r2_dt = r2_score(y_val, y_val_pred_dt)

mae_dt, mse_dt, rmse_dt, r2_dt

(9.61320754716981,
 202.79945054945054,
 np.float64(14.240767203681498),
 -0.45121551254654824)

RMSE was chosen as the primary comparison metric because it is expressed in the same units as the target variable and penalizes large prediction errors more heavily.

In [13]:
results = pd.DataFrame({
    "Model": ["Baseline (Mean)", "SGDRegressor", "Decision Tree"],
    "MAE": [mae_baseline, mae_sgd, mae_dt],
    "MSE": [mse_baseline, mse_sgd, mse_dt],
    "RMSE": [rmse_baseline, rmse_sgd, rmse_dt],
    "R2": [r2_baseline, r2_sgd, r2_dt]
})

results

Unnamed: 0,Model,MAE,MSE,RMSE,R2
0,Baseline (Mean),7.564415,144.846467,12.035218,-7e-05
1,SGDRegressor,7.318599,107.766391,10.381059,0.228833
2,Decision Tree,9.613208,202.799451,14.240767,-0.451216


In [14]:
sgd_final = SGDRegressor(
    loss="squared_error",
    max_iter=1000,
    tol=1e-3,
    random_state=42
)

sgd_final.fit(X_train, y_train)

0,1,2
,loss,'squared_error'
,penalty,'l2'
,alpha,0.0001
,l1_ratio,0.15
,fit_intercept,True
,max_iter,1000
,tol,0.001
,shuffle,True
,verbose,0
,epsilon,0.1


In [15]:
y_test_pred = sgd_final.predict(X_test)

mae_test = mean_absolute_error(y_test, y_test_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
r2_test = r2_score(y_test, y_test_pred)

mae_test, mse_test, rmse_test, r2_test

(7.37882564477379,
 112.55000321715379,
 np.float64(10.608958630193342),
 0.22291614475679522)

In [16]:
results["Test_MAE"] = [mae_baseline, np.nan, np.nan]
results["Test_MSE"] = [mse_baseline, np.nan, np.nan]
results["Test_RMSE"] = [rmse_baseline, np.nan, np.nan]
results["Test_R2"] = [r2_baseline, np.nan, np.nan]

results.loc[results["Model"] == "SGDRegressor", 
            ["Test_MAE", "Test_MSE", "Test_RMSE", "Test_R2"]] = [
    mae_test, mse_test, rmse_test, r2_test
]

results

Unnamed: 0,Model,MAE,MSE,RMSE,R2,Test_MAE,Test_MSE,Test_RMSE,Test_R2
0,Baseline (Mean),7.564415,144.846467,12.035218,-7e-05,7.564415,144.846467,12.035218,-7e-05
1,SGDRegressor,7.318599,107.766391,10.381059,0.228833,7.378826,112.550003,10.608959,0.222916
2,Decision Tree,9.613208,202.799451,14.240767,-0.451216,,,,


A regression analysis was conducted to predict hours-per-week using preprocessed Census data. A mean-based baseline model was used to establish a minimum performance benchmark.

Among the evaluated models, the SGDRegressor achieved the best performance, outperforming both the baseline and the Decision Tree Regressor. The SGD model achieved an RMSE of approximately 10.6 hours on the test set and explained around 22% of the variance in working hours.

The Decision Tree Regressor showed poor generalization performance with negative R², indicating overfitting when using default parameters. Based on validation and test results, the SGDRegressor was selected as the final model due to its stability, scalability, and gradient descent optimization.

Future improvements could include hyperparameter tuning, additional feature engineering, or exploring ensemble methods such as Random Forests.