# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **10p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *8p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning - Optional
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. 
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - RÂ² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
2218,34,Private,221324,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,Black,Female,0,0,40,United-States,<=50K
18978,46,Private,206889,Bachelors,13,Widowed,Adm-clerical,Not-in-family,White,Female,0,0,25,United-States,<=50K
16837,47,Private,72880,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,40,United-States,<=50K
31506,48,Private,238360,Bachelors,13,Separated,Adm-clerical,Unmarried,Asian-Pac-Islander,Female,0,0,40,Philippines,<=50K
27470,43,Private,219424,Bachelors,13,Never-married,Exec-managerial,Not-in-family,Black,Female,0,0,50,United-States,>50K
21012,22,Private,139190,HS-grad,9,Never-married,Craft-repair,Own-child,White,Male,0,0,50,United-States,<=50K
19659,30,Private,149507,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Male,0,0,42,United-States,<=50K
17912,37,Self-emp-not-inc,111129,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,45,United-States,<=50K
28912,57,Private,165695,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,?,>50K
606,51,Private,410114,Masters,14,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,>50K


In [3]:
# Remove missing values if any
data.dropna(inplace=True)

X = data.drop("hours-per-week", axis=1)
y = data["hours-per-week"]

In [4]:
num_cols = X.select_dtypes(include=['int64', 'float64']).columns
cat_cols = X.select_dtypes(include=['object']).columns

In [17]:
# Train - Test
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.2, random_state=42)

In [25]:
results_table = []

def run_experiment(name, model, transformer, X_tr, X_ev, y_tr, y_ev):
    # Transforma data
    X_tr_trans = transformer.fit_transform(X_tr)
    X_ev_trans = transformer.transform(X_ev)
    
    # Train
    model.fit(X_tr_trans, y_tr)
    preds = model.predict(X_ev_trans)
    
    # Matrics
    mae = mean_absolute_error(y_ev, preds)
    mse = mean_squared_error(y_ev, preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_ev, preds)
    
    results_table.append({
        "Experiment": name,
        "MAE": round(mae, 2),
        "MSE": round(mse, 2),
        "RMSE": round(rmse, 2),
        "R2 Score": round(r2, 4)
    })

In [19]:
# First Experiment: Baseline (No preprocessing)
pre_base = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])
run_experiment("SGD Baseline", SGDRegressor(random_state=42), pre_base, X_train, X_val, y_train, y_val)
run_experiment("DT Baseline", DecisionTreeRegressor(random_state=42), pre_base, X_train, X_val, y_train, y_val)

In [20]:
# Second Experiment: Feature Engineering (polynomial features)
pre_poly = ColumnTransformer([
    ('poly', PolynomialFeatures(degree=2, include_bias=False), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])
run_experiment("SGD Polynomial", SGDRegressor(random_state=42), pre_poly, X_train, X_val, y_train, y_val)

In [21]:
# Third Experiment: Feature Selection (Select top features based on importance)
pre_red = ColumnTransformer([
    ('num', StandardScaler(), ["age", "education-num"]),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ["sex", "income"])
])
run_experiment("DT Reduced Features", DecisionTreeRegressor(random_state=42, max_depth=10), pre_red, X_train, X_val, y_train, y_val)


In [23]:
# Results
df_results = pd.DataFrame(results_table)
print("Table of experiments")
print(df_results.to_string(index=False))

Table of experiments
         Experiment          MAE          MSE         RMSE      R2 Score
       SGD Baseline 7.610000e+00 1.208100e+02 1.099000e+01  2.066000e-01
        DT Baseline 1.007000e+01 2.228600e+02 1.493000e+01 -4.636000e-01
     SGD Polynomial 2.508050e+29 1.802074e+59 4.245084e+29 -1.183475e+57
DT Reduced Features 7.640000e+00 1.261200e+02 1.123000e+01  1.717000e-01


In [26]:
best_model = DecisionTreeRegressor(random_state=42)
X_train_final = pre_base.fit_transform(X_train_full)
X_test_final = pre_base.transform(X_test)

best_model.fit(X_train_final, y_train_full)
final_preds = best_model.predict(X_test_final)

print("\n#Final Evaluation on Test Set")
print(f"MAE: {mean_absolute_error(y_test, final_preds):.2f}")
print(f"MSE: {mean_squared_error(y_test, final_preds):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, final_preds)):.2f}")
print(f"R2 Score: {r2_score(y_test, final_preds):.4f}")


#Final Evaluation on Test Set
MAE: 10.37
MSE: 232.68
RMSE: 15.25
R2 Score: -0.5108


## Summary

#### Decision Tree Regressor is the model that performed best
#### The final MAE of aproximately 10.37 hours suggests that our model's predictions deviate bt about 10 hours from the working time
#### RMSE > MAE which indicates the presence of outliers (unpredictable work schedules)
#### Low R2 shows that work habits are unpredictable