# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - RÂ² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [1]:
import pandas as pd

In [2]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
1931,50,?,156008,11th,7,Married-civ-spouse,?,Own-child,Black,Female,0,0,40,United-States,<=50K
2321,45,Local-gov,238386,Some-college,10,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
56,46,Private,216666,5th-6th,3,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,Mexico,<=50K
28959,33,Private,69727,7th-8th,4,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,40,Mexico,<=50K
19160,36,Self-emp-not-inc,48093,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,92,United-States,<=50K
31221,48,Private,191858,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
14736,49,Private,153536,Some-college,10,Divorced,Prof-specialty,Not-in-family,White,Male,14084,0,44,United-States,>50K
2008,28,Private,246595,HS-grad,9,Never-married,Craft-repair,Own-child,White,Male,0,0,70,United-States,<=50K
17467,50,Local-gov,139296,11th,7,Never-married,Craft-repair,Unmarried,White,Male,0,0,40,United-States,<=50K
11205,51,Private,89652,HS-grad,9,Divorced,Adm-clerical,Unmarried,White,Female,4787,0,24,United-States,>50K


In [3]:
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [4]:
X_train = pd.read_csv("X_train.csv")
X_test = pd.read_csv("X_test.csv")
y_train = pd.read_csv("y_train.csv").values.ravel()
y_test = pd.read_csv("y_test.csv").values.ravel()


In [5]:
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)


Training set divided in a training set and a validation set

In [6]:
def evaluate_model(name, model, X_tr, y_tr, X_te, y_te):
    model.fit(X_tr, y_tr)
    preds = model.predict(X_te)

    return {
        "Model": name,
        "MAE": mean_absolute_error(y_te, preds),
        "MSE": mean_squared_error(y_te, preds),
        "RMSE": np.sqrt(mean_squared_error(y_te, preds)),
        "R2": r2_score(y_te, preds)
    }


In [7]:
sgd = SGDRegressor(loss="squared_error", max_iter=1000, random_state=42)


SGD Regressor fit for big datasets

In [8]:
lr = LinearRegression()


Linear Regression used as a reference model

In [9]:
dt = DecisionTreeRegressor(random_state=42)


Decision Tree Regressor can model non-linear relationships

In [10]:
results = []

results.append(evaluate_model("SGD Regressor", sgd, X_train_sub, y_train_sub, X_test, y_test))
results.append(evaluate_model("Linear Regression", lr, X_train_sub, y_train_sub, X_test, y_test))
results.append(evaluate_model("Decision Tree", dt, X_train_sub, y_train_sub, X_test, y_test))

results_df = pd.DataFrame(results)
print(results_df)


               Model       MAE       MSE      RMSE        R2
0      SGD Regressor  0.276938  0.120449  0.347057  0.360035
1  Linear Regression  0.267910  0.118633  0.344431  0.369684
2      Decision Tree  0.191851  0.191851  0.438008 -0.019333


Best model based on performance is the Linear Regression because it has the smallest RMSE and biggest R^2
Second best model is the SGD Regressor, having both RMSE and R^2 close to the linear regression values.
Decision tree is the worst one, having a negative R^2 value and the highest RMSE.