# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **8p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *6p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning **2p**
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. *2p*
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# --- Load Preprocessed Data ---
train_data_path = 'E:/Master/ADC/14.Machine_Learning/ubb-sociology-ml/final_project/Train_Preprocessed.csv'
test_data_path = 'E:/Master/ADC/14.Machine_Learning/ubb-sociology-ml/final_project/Test_Preprocessed.csv'

data_train = pd.read_csv(train_data_path)
data_test = pd.read_csv(test_data_path)

# Define target variable and extract features
target = 'hours-per-week'
X_train = data_train.drop(columns=[target])
y_train = data_train[target]
X_test = data_test.drop(columns=[target])
y_test = data_test[target]

# Split train data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# --- Model Evaluation Function ---
def evaluate_model(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    return {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}

# --- Baseline Model Selection ---
models = {
    "SGDRegressor": SGDRegressor(random_state=42),
    "LinearRegression": LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=42),
    "RandomForestRegressor": RandomForestRegressor(random_state=42),
    "RidgeRegression": Ridge(random_state=42),
    "LassoRegression": Lasso(random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_val_pred = model.predict(X_val)
    results[name] = evaluate_model(y_val, y_val_pred)

results_df = pd.DataFrame(results).T
print("Baseline Model Results:\n", results_df)

Baseline Model Results:
                             MAE        MSE      RMSE        R2
SGDRegressor           4.290110  29.492624  5.430711  0.224885
LinearRegression       4.306563  29.451021  5.426880  0.225978
DecisionTreeRegressor  5.406934  57.357400  7.573467 -0.507448
RandomForestRegressor  4.325599  30.827392  5.552242  0.189805
RidgeRegression        4.306604  29.450548  5.426836  0.225991
LassoRegression        4.758433  37.796275  6.147868  0.006651


Out of all these models, the SGDRegressor seems the best based on the low MAE, MSE and relatively high R squared.

In [10]:

# --- Feature Selection Based on Correlation ---
correlation_with_target = X_train.corrwith(y_train)
selected_features = correlation_with_target[abs(correlation_with_target) > 0.1].index

X_train_ftrs = X_train[selected_features]
X_val_ftrs = X_val[selected_features]
X_test_ftrs = X_test[selected_features]

# --- Experiment 1: Polynomial Features ---
# Add polynomial features without re-scaling
poly = PolynomialFeatures(degree=2, include_bias=False)

# Apply transformations
X_train_poly = poly.fit_transform(X_train_ftrs)
X_val_poly = poly.transform(X_val_ftrs)
X_test_poly = poly.transform(X_test_ftrs)

# Fit the SGDRegressor model
sgd_poly = SGDRegressor(random_state=42)
sgd_poly.fit(X_train_poly, y_train)

# Evaluate the model with polynomial features
poly_train_metrics = evaluate_model(y_train, sgd_poly.predict(X_train_poly))
poly_val_metrics = evaluate_model(y_val, sgd_poly.predict(X_val_poly))
print("Experiment 1 - Polynomial Features Results:")
print("Train Metrics:", poly_train_metrics)
print("Validation Metrics:", poly_val_metrics)



Experiment 1 - Polynomial Features Results:
Train Metrics: {'MAE': 4.3126048444028084, 'MSE': 28.902747425642865, 'RMSE': 5.376127549234939, 'R2': 0.2437855077469584}
Validation Metrics: {'MAE': 4.353720005686385, 'MSE': 29.57915307194361, 'RMSE': 5.438671995252482, 'R2': 0.22261063917999835}


In [11]:
# --- Experiment 2: Hyperparameter Tuning ---
# Define hyperparameter grid
param_grid = {
    "alpha": [0.0001, 0.001, 0.01],
    "penalty": ["l2", "l1", "elasticnet"],
    "learning_rate": ["constant", "adaptive"],
    "max_iter": [1000, 2000]
}

# Perform GridSearchCV
grid_search = GridSearchCV(SGDRegressor(random_state=42), param_grid, cv=3, scoring="neg_mean_squared_error")
grid_search.fit(X_train_ftrs, y_train)

# Best model after hyperparameter tuning
best_sgd = grid_search.best_estimator_
best_sgd.fit(X_train_ftrs, y_train)

# Evaluate the tuned model
tuned_train_metrics = evaluate_model(y_train, best_sgd.predict(X_train_ftrs))
tuned_val_metrics = evaluate_model(y_val, best_sgd.predict(X_val_ftrs))
print("Experiment 2 - Hyperparameter Tuning Results:")
print("Train Metrics:", tuned_train_metrics)
print("Validation Metrics:", tuned_val_metrics)
print("Best Parameters:", grid_search.best_params_)

Experiment 2 - Hyperparameter Tuning Results:
Train Metrics: {'MAE': 4.31715475203357, 'MSE': 29.512022705616648, 'RMSE': 5.432496912619155, 'R2': 0.22784436589969825}
Validation Metrics: {'MAE': 4.349385580180138, 'MSE': 29.964969570441184, 'RMSE': 5.474026814917989, 'R2': 0.21247073962197727}
Best Parameters: {'alpha': 0.0001, 'learning_rate': 'adaptive', 'max_iter': 1000, 'penalty': 'l1'}


In [15]:
# --- Compare Results ---
comparison_results = {
    "Model": ["Base", "Polynomial Features", "Hyperparameter Tuning"],
    "Train_MAE": [results["SGDRegressor"]["MAE"], poly_train_metrics["MAE"], tuned_train_metrics["MAE"]],
    "Val_MAE": [results["SGDRegressor"]["MAE"], poly_val_metrics["MAE"], tuned_val_metrics["MAE"]],
    "Train_R2": [results["SGDRegressor"]["R2"], poly_train_metrics["R2"], tuned_train_metrics["R2"]],
    "Val_R2": [results["SGDRegressor"]["R2"], poly_val_metrics["R2"], tuned_val_metrics["R2"]],
}

comparison_df = pd.DataFrame(comparison_results)
print("\nComparison of Experiments:\n", comparison_df)

output_path = "E:/Master/ADC/14.Machine_Learning/ubb-sociology-ml/final_project/Experiment_Results.csv"
comparison_df.to_csv(output_path, index=False) 


Comparison of Experiments:
                    Model  Train_MAE   Val_MAE  Train_R2    Val_R2
0                   Base   4.290110  4.290110  0.224885  0.224885
1    Polynomial Features   4.312605  4.353720  0.243786  0.222611
2  Hyperparameter Tuning   4.317155  4.349386  0.227844  0.212471


In [14]:
# --- Evaluate Base Model ---
base_model = SGDRegressor(random_state=42)
base_model.fit(X_train_ftrs, y_train) 
y_test_base = base_model.predict(X_test_ftrs)  
base_test_metrics = evaluate_model(y_test, y_test_base)

# --- Evaluate Polynomial Features Model ---
poly = PolynomialFeatures(degree=2, include_bias=False)
X_test_poly = poly.fit_transform(X_test_ftrs)
sgd_poly.fit(poly.fit_transform(X_train_ftrs), y_train) 
y_test_poly = sgd_poly.predict(X_test_poly) 
poly_test_metrics = evaluate_model(y_test, y_test_poly)

# --- Evaluate Hyperparameter-Tuned Model ---
best_sgd = SGDRegressor(alpha=0.0001, learning_rate='adaptive', max_iter=1000, penalty='l1', random_state=42)
best_sgd.fit(X_train_ftrs, y_train) 
y_test_tuned = best_sgd.predict(X_test_ftrs)
tuned_test_metrics = evaluate_model(y_test, y_test_tuned)


test_comparison_results = {
    "Model": ["Base", "Polynomial Features", "Hyperparameter Tuning"],
    "Test_MAE": [base_test_metrics["MAE"], poly_test_metrics["MAE"], tuned_test_metrics["MAE"]],
    "Test_R2": [base_test_metrics["R2"], poly_test_metrics["R2"], tuned_test_metrics["R2"]],
}

test_comparison_df = pd.DataFrame(test_comparison_results)
print("\nTest Set Comparison:\n", test_comparison_df)



Test Set Comparison:
                    Model  Test_MAE   Test_R2
0                   Base  4.337558  0.232241
1    Polynomial Features  4.345263  0.244144
2  Hyperparameter Tuning  4.347847  0.232171


Based on the Experimental results and the Test results, the more complex models don't show a significant improvement so the Base Model would work just as well.