# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - RÂ² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [1]:
import pandas as pd
from sklearn.linear_model import SGDRegressor, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup

In [2]:
X_train = pd.read_csv("X_train.csv")
X_test = pd.read_csv("X_test.csv")
y_train = pd.read_csv("y_train.csv")
y_test = pd.read_csv("y_test.csv")


In [3]:
X_train.shape, y_train.shape


((26029, 28), (26029, 1))

In [4]:
y_train.columns

Index(['hours-per-week'], dtype='object')

#### Create validation set

In [5]:
from sklearn.model_selection import train_test_split

X_train2, X_val, y_train2, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)


#### Verifying to ensure all columns are numerical/float

In [6]:
X_test.dtypes



age                                   int64
fnlwgt                              float64
education-num                       float64
capital-gain                        float64
capital-loss                        float64
income_binary                         int64
sex_Male                            float64
race_Asian-Pac-Islander             float64
race_Black                          float64
race_Other                          float64
race_White                          float64
workclass_group_Not-working         float64
workclass_group_Private             float64
workclass_group_Self-employed       float64
workclass_group_Unknown             float64
education_level_Primary             float64
education_level_Secondary           float64
marital_group_Previously_married    float64
marital_group_Single                float64
occupation_group_Office             float64
occupation_group_Professional       float64
occupation_group_Security           float64
occupation_group_Service        

#### Making sure there are no remaining missing values

In [7]:
X_train.isna().sum()
X_test.isna().sum()


age                                 0
fnlwgt                              0
education-num                       0
capital-gain                        0
capital-loss                        0
income_binary                       0
sex_Male                            0
race_Asian-Pac-Islander             0
race_Black                          0
race_Other                          0
race_White                          0
workclass_group_Not-working         0
workclass_group_Private             0
workclass_group_Self-employed       0
workclass_group_Unknown             0
education_level_Primary             0
education_level_Secondary           0
marital_group_Previously_married    0
marital_group_Single                0
occupation_group_Office             0
occupation_group_Professional       0
occupation_group_Security           0
occupation_group_Service            0
occupation_group_Technical          0
occupation_group_Unknown            0
native_region_United-States         0
age_group_Se

# Model training and setup

In [8]:

models = {
    "SGD": SGDRegressor(max_iter=1000, random_state=42),
    "LinearRegression": LinearRegression(),
    "DecisionTree": DecisionTreeRegressor(random_state=42)
}

results = []

for name, model in models.items():
    model.fit(X_train, y_train.values.ravel())
    preds = model.predict(X_test)

    mae = mean_absolute_error(y_test, preds)
    mse = mean_squared_error(y_test, preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, preds)

    results.append([name, mae, mse, rmse, r2])





In [9]:
results_df = pd.DataFrame(results, columns=["Model","MAE","MSE","RMSE","R2"])
results_df
results_df.sort_values(by="RMSE")

Unnamed: 0,Model,MAE,MSE,RMSE,R2
1,LinearRegression,7.725025,123.1171,11.09581,0.1950992
2,DecisionTree,10.440612,240.7464,15.51601,-0.5739243
0,SGD,163481.617172,9011692000000.0,3001948.0,-58915600000.0


##### RMSE was selected as the primary metric for model comparison because it penalizes large prediction errors and provides results in hours-per-week, providing easier interpretation.

# Experimenting

In [10]:


models_scaled = {
    "SGD_scaled": Pipeline([
        ("scaler", StandardScaler()),
        ("model", SGDRegressor(max_iter=1000, random_state=42))
    ]),
    "LinearRegression_scaled": Pipeline([
        ("scaler", StandardScaler()),
        ("model", LinearRegression())
    ]),
    "DecisionTree": DecisionTreeRegressor(random_state=42)
}


In [11]:
results_scaled = []

for name, model in models_scaled.items():
    model.fit(X_train2, y_train2.values.ravel())
    preds = model.predict(X_test)

    mae = mean_absolute_error(y_test, preds)
    mse = mean_squared_error(y_test, preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, preds)

    results_scaled.append([name, mae, mse, rmse, r2])

results_scaled_df = pd.DataFrame(
    results_scaled, columns=["Model","MAE","MSE","RMSE","R2"]
)


In [14]:
results_scaled_df.sort_values(by="RMSE")

Unnamed: 0,Model,MAE,MSE,RMSE,R2
1,LinearRegression_scaled,7.72886,123.163281,11.097895,0.194797
0,SGD_scaled,7.715535,123.852874,11.128921,0.190289
2,DecisionTree,10.619622,245.486517,15.668009,-0.604913


#### The best-performing model is LinearRegression with StandardScaler, based on the lowest RMSE on the test set.


In [15]:
all_results = pd.concat([results_df, results_scaled_df], ignore_index=True)
all_results.sort_values(by="RMSE")


Unnamed: 0,Model,MAE,MSE,RMSE,R2
1,LinearRegression,7.725025,123.1171,11.09581,0.1950992
4,LinearRegression_scaled,7.72886,123.1633,11.0979,0.1947973
3,SGD_scaled,7.715535,123.8529,11.12892,0.1902889
2,DecisionTree,10.440612,240.7464,15.51601,-0.5739243
5,DecisionTree,10.619622,245.4865,15.66801,-0.6049135
0,SGD,163481.617172,9011692000000.0,3001948.0,-58915600000.0


# Model evaluation

#### Baseline models performed worse than scaled models, especially for SGDRegressor, which improved significantly after standardization.


# Conclusions and Findings

Scaling significantly improved the performance of SGDRegressor and Linear Regression models. 
The best-performing model was LinearRegression with StandardScaler, achieving the lowest RMSE on the test set.

Decision Tree performance was stable but did not outperform the linear models after scaling.

# Future Improvements

Try Random Forest or Ridge/Lasso regression.

Perform hyperparameter tuning on the best model.


