# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [1]:
import pandas as pd
import numpy as np
 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
14785,17,Private,151141,11th,7,Never-married,Handlers-cleaners,Own-child,White,Male,0,0,15,United-States,<=50K
17959,19,Private,29526,Some-college,10,Never-married,Other-service,Own-child,White,Female,0,0,18,United-States,<=50K
541,29,Private,133937,Doctorate,16,Never-married,Prof-specialty,Own-child,White,Male,0,0,40,United-States,<=50K
15852,32,Private,185027,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,>50K
7832,43,Private,35910,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,43,United-States,>50K
20111,32,Private,45796,12th,8,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,United-States,<=50K
17485,38,Private,229700,Masters,14,Married-civ-spouse,Prof-specialty,Husband,Black,Male,0,0,40,United-States,>50K
15875,59,Private,113959,HS-grad,9,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,45,United-States,>50K
19989,42,Private,68729,Some-college,10,Never-married,Craft-repair,Not-in-family,Asian-Pac-Islander,Male,0,0,40,United-States,<=50K
30143,54,Self-emp-not-inc,230951,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K


In [3]:
X_train = pd.read_csv("X_train_preprocessed.csv")
X_test = pd.read_csv("X_test_preprocessed.csv")
y_train = pd.read_csv("y_train.csv").squeeze()
y_test = pd.read_csv("y_test.csv").squeeze()

In [4]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((26029, 109), (6508, 109), (26029,), (6508,))

In [5]:
X_train_final, X_val, y_train_final, y_val = train_test_split(
    X_train,
    y_train,
    test_size=0.2,
    random_state=42
)

In [6]:
def evaluate_model(y_true, y_pred):
    return {
        "MAE": mean_absolute_error(y_true, y_pred),
        "MSE": mean_squared_error(y_true, y_pred),
        "RMSE": np.sqrt(mean_squared_error(y_true, y_pred)),
        "R2": r2_score(y_true, y_pred)
    }

In [7]:
from sklearn.linear_model import SGDRegressor

In [8]:
sgd = SGDRegressor(
    loss="squared_error",  # MSE
    max_iter=1000,
    random_state=42
)
 
sgd.fit(X_train_final, y_train_final)

0,1,2
,"loss  loss: str, default='squared_error' The loss function to be used. The possible values are 'squared_error', 'huber', 'epsilon_insensitive', or 'squared_epsilon_insensitive' The 'squared_error' refers to the ordinary least squares fit. 'huber' modifies 'squared_error' to focus less on getting outliers correct by switching from squared to linear loss past a distance of epsilon. 'epsilon_insensitive' ignores errors less than epsilon and is linear past that; this is the loss function used in SVR. 'squared_epsilon_insensitive' is the same but becomes squared loss past a tolerance of epsilon. More details about the losses formulas can be found in the :ref:`User Guide `.",'squared_error'
,"penalty  penalty: {'l2', 'l1', 'elasticnet', None}, default='l2' The penalty (aka regularization term) to be used. Defaults to 'l2' which is the standard regularizer for linear SVM models. 'l1' and 'elasticnet' might bring sparsity to the model (feature selection) not achievable with 'l2'. No penalty is added when set to `None`. You can see a visualisation of the penalties in :ref:`sphx_glr_auto_examples_linear_model_plot_sgd_penalties.py`.",'l2'
,"alpha  alpha: float, default=0.0001 Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when `learning_rate` is set to 'optimal'. Values must be in the range `[0.0, inf)`.",0.0001
,"l1_ratio  l1_ratio: float, default=0.15 The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Only used if `penalty` is 'elasticnet'. Values must be in the range `[0.0, 1.0]` or can be `None` if `penalty` is not `elasticnet`. .. versionchanged:: 1.7  `l1_ratio` can be `None` when `penalty` is not ""elasticnet"".",0.15
,"fit_intercept  fit_intercept: bool, default=True Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.",True
,"max_iter  max_iter: int, default=1000 The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the ``fit`` method, and not the :meth:`partial_fit` method. Values must be in the range `[1, inf)`. .. versionadded:: 0.19",1000
,"tol  tol: float or None, default=1e-3 The stopping criterion. If it is not None, training will stop when (loss > best_loss - tol) for ``n_iter_no_change`` consecutive epochs. Convergence is checked against the training loss or the validation loss depending on the `early_stopping` parameter. Values must be in the range `[0.0, inf)`. .. versionadded:: 0.19",0.001
,"shuffle  shuffle: bool, default=True Whether or not the training data should be shuffled after each epoch.",True
,"verbose  verbose: int, default=0 The verbosity level. Values must be in the range `[0, inf)`.",0
,"epsilon  epsilon: float, default=0.1 Epsilon in the epsilon-insensitive loss functions; only if `loss` is 'huber', 'epsilon_insensitive', or 'squared_epsilon_insensitive'. For 'huber', determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold. Values must be in the range `[0.0, inf)`.",0.1


In [9]:
y_val_pred_sgd = sgd.predict(X_val)
sgd_results = evaluate_model(y_val, y_val_pred_sgd)
sgd_results

{'MAE': 25284963.33237634,
 'MSE': 2579337132202189.5,
 'RMSE': np.float64(50787174.87911874),
 'R2': -67789271709873.555}

In [10]:
from sklearn.linear_model import LinearRegression

In [11]:
lr = LinearRegression()
lr.fit(X_train_final, y_train_final)
 
y_val_pred_lr = lr.predict(X_val)
lr_results = evaluate_model(y_val, y_val_pred_lr)
lr_results

{'MAE': 4.322506939399718,
 'MSE': 29.43091630012911,
 'RMSE': np.float64(5.425026847871733),
 'R2': 0.22650654820114613}

In [12]:
from sklearn.tree import DecisionTreeRegressor

In [13]:
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train_final, y_train_final)
 
y_val_pred_dt = dt.predict(X_val)
dt_results = evaluate_model(y_val, y_val_pred_dt)
dt_results

{'MAE': 5.33029197080292,
 'MSE': 57.038273146369576,
 'RMSE': np.float64(7.5523687109654265),
 'R2': -0.4990607268465417}

### Loss Function Choice
#### Mean Squared Error (MSE) was used as the primary loss function because it penalizes large errors more strongly, which is important when predicting working hours where extreme underestimation or overestimation is costly. RMSE is also reported for interpretability in the original units.
### Model Selection Justification
 
- **SGD Regressor**: Scales well to large datasets and uses gradient descent,
  making it suitable for high-dimensional encoded data.
- **Linear Regression**: Provides a strong interpretable baseline but assumes
  linear relationships.
- **Decision Tree Regressor**: Captures non-linear relationships but is prone
  to overfitting without tuning.

In [14]:
results_df = pd.DataFrame.from_dict(
    {
        "SGDRegressor": sgd_results,
        "LinearRegression": lr_results,
        "DecisionTree": dt_results
    },
    orient="index"
)
 
results_df

Unnamed: 0,MAE,MSE,RMSE,R2
SGDRegressor,25284960.0,2579337000000000.0,50787170.0,-67789270000000.0
LinearRegression,4.322507,29.43092,5.425027,0.2265065
DecisionTree,5.330292,57.03827,7.552369,-0.4990607


In [15]:
best_model = sgd

In [16]:
y_test_pred = best_model.predict(X_test)
test_results = evaluate_model(y_test, y_test_pred)
test_results

{'MAE': 171679063.35418826,
 'MSE': 6.962183174680933e+19,
 'RMSE': np.float64(8343969783.430985),
 'R2': -1.8011697001119795e+18}

### Model Comparison Metric
RMSE was chosen as the primary comparison metric because it preserves
the unit of the target variable (hours) while penalizing large errors.

## Final Model Evaluation Summary
 
Multiple regression models were evaluated on the Census dataset.
Among them, the best-performing model achieved the lowest RMSE on the
test dataset, indicating better generalization.
 
Tree-based models captured non-linear patterns but showed signs of
overfitting, while linear models provided stable and interpretable results.
 
Future improvements may include:
- Hyperparameter tuning
- Ensemble methods
- Advanced feature selection
- Separate preprocessing pipelines per model type