# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - RÂ² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [58]:
import pandas as pd

In [59]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
3758,39,Private,33355,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,7298,0,48,United-States,>50K
2005,27,Private,262478,HS-grad,9,Never-married,Farming-fishing,Own-child,Black,Male,0,0,30,United-States,<=50K
9705,51,Private,104651,Bachelors,13,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,50,United-States,<=50K
12380,19,Private,187161,Some-college,10,Never-married,Sales,Own-child,White,Female,0,0,25,United-States,<=50K
23697,25,Private,174545,HS-grad,9,Never-married,Adm-clerical,Unmarried,White,Female,0,0,46,United-States,<=50K
8157,59,Private,121912,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K
14853,36,Private,398931,HS-grad,9,Divorced,Craft-repair,Not-in-family,White,Male,0,0,40,United-States,<=50K
7583,28,Private,175262,Some-college,10,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,Mexico,<=50K
13279,48,Private,146919,Some-college,10,Divorced,Adm-clerical,Unmarried,White,Female,0,0,45,United-States,<=50K
28062,37,Private,263094,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,7298,0,40,United-States,>50K


In [60]:
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [61]:
X_train = pd.read_csv("X_train.csv")
X_test = pd.read_csv("X_test.csv")
y_train = pd.read_csv("y_train.csv").squeeze()
y_test = pd.read_csv("y_test.csv").squeeze()


In [62]:
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)


Training set divided in a training set and a validation set

In [63]:
def evaluate_model(name, model, X_train, y_train, X_eval, y_eval):
    model.fit(X_train, y_train)
    predictions = model.predict(X_eval)

    mae = mean_absolute_error(y_eval, predictions)
    mse = mean_squared_error(y_eval, predictions)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_eval, predictions)

    return {
        "Model": name,
        "MAE": mae,
        "MSE": mse,
        "RMSE": rmse,
        "R2": r2
    }

In [64]:
sgd = SGDRegressor(random_state=42)

SGD Regressor fit for big datasets

In [65]:
lr = LinearRegression()


Linear Regression used as a reference model

In [66]:
dt = DecisionTreeRegressor(random_state=42)


Decision Tree Regressor can model non-linear relationships

In [67]:
results = []

results.append(evaluate_model("SGD Regressor", sgd, X_train_sub, y_train_sub, X_val, y_val))
results.append(evaluate_model("Linear Regression", lr, X_train_sub, y_train_sub, X_val, y_val))
results.append(evaluate_model("Decision Tree", dt, X_train_sub, y_train_sub, X_val, y_val))

results_df = pd.DataFrame(results)
results_df


Unnamed: 0,Model,MAE,MSE,RMSE,R2
0,SGD Regressor,2381979000.0,1.452363e+19,3810988000.0,-9.078803e+16
1,Linear Regression,7.70563,119.2056,10.91813,0.254839
2,Decision Tree,10.16255,225.597,15.01989,-0.4102195


Best model based on performance is the Linear Regression because it has the smallest RMSE and biggest R^2
Second best model is the Decision Tree, having the RMSE close to the rmse value of the linear regression, but R^2 is a lot smaller.
SGD Regressor is the worst one, having a big negative R^2 value.