# **Final Project Task 3 - Census Modeling Regression**

Requirements
- Create a regression model on the Census dataset, with 'hours-per-week' target

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup **2p**:
    - Implement multiple models, to solve a regression problem using traditional ML: 
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice. *1p*
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons. *1p*


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation **8p**
    - Establish a Baseline Model *2p*
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection: - Optional
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation: *6p*
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
        - You may need multiple preprocessed datasets preprocessed
- Hyperparameter Tuning **2p**
  - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments. *2p*
  - Consider using techniques like Grid Search for exhaustive tuning, Random Search for quicker exploration, or Bayesian Optimization for an intelligent, efficient search of hyperparameters.
  - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
  - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation **3p**
    - Evaluate models on the test dataset using regression metrics: *1p*
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Choose one metric for model comparison and explain your choice *1p*
    - Compare the results across different models. Save all experiment results  into a table. *1p*

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [5]:
import pandas as pd

In [6]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
29363,46,Self-emp-inc,192779,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,15024,0,60,United-States,>50K
24751,39,Private,172538,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,White,Male,0,0,40,United-States,<=50K
27553,34,Private,161018,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,55,United-States,>50K
29255,45,Self-emp-inc,180239,Prof-school,15,Married-civ-spouse,Exec-managerial,Husband,Asian-Pac-Islander,Male,7688,0,40,?,>50K
16645,49,Local-gov,159641,Bachelors,13,Divorced,Exec-managerial,Unmarried,White,Female,0,625,40,United-States,<=50K
669,24,Private,172146,9th,5,Never-married,Machine-op-inspct,Not-in-family,White,Male,0,1721,40,United-States,<=50K
19097,62,Local-gov,167889,Doctorate,16,Widowed,Prof-specialty,Unmarried,White,Female,0,0,40,Iran,<=50K
29401,60,Private,188650,5th-6th,3,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,?,>50K
20003,72,?,201375,Assoc-acdm,12,Widowed,?,Not-in-family,White,Female,0,0,40,United-States,<=50K
26905,30,Private,26009,HS-grad,9,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,35,United-States,>50K


In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
# Load the dataset
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)

# Drop rows with missing values
data = data.dropna()

# Separate features and target
X = data.drop(columns=["hours-per-week"])
y = data["hours-per-week"]

# Split data into train, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Preprocessing pipeline
numerical_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss"]
categorical_features = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)

# Apply preprocessing
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_val_preprocessed = preprocessor.transform(X_val)
X_test_preprocessed = preprocessor.transform(X_test)

In [9]:
# Baseline models
models = {
    "SGDRegressor": SGDRegressor(loss="squared_error", random_state=42),
    "LinearRegression": LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=42),
    "RandomForestRegressor": RandomForestRegressor(random_state=42),
    "Ridge": Ridge(random_state=42),
    "Lasso": Lasso(random_state=42)
}

# Evaluate baseline models
results = {}
for name, model in models.items():
    model.fit(X_train_preprocessed, y_train)
    y_pred = model.predict(X_val_preprocessed)
    mae = mean_absolute_error(y_val, y_pred)
    mse = mean_squared_error(y_val, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_val, y_pred)
    results[name] = {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}

# Display results
results_df = pd.DataFrame(results).T
print(results_df)

                             MAE         MSE       RMSE        R2
SGDRegressor            7.659743  121.645962  11.029323  0.201115
LinearRegression        7.650194  121.817550  11.037099  0.199988
DecisionTreeRegressor  10.223544  231.709362  15.222003 -0.521703
RandomForestRegressor   7.689997  123.772167  11.125294  0.187152
Ridge                   7.648897  121.793578  11.036013  0.200146
Lasso                   7.508366  144.643686  12.026790  0.050082


In [None]:
# Example: Polynomial Features
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_preprocessed)
X_val_poly = poly.transform(X_val_preprocessed)
X_test_poly = poly.transform(X_test_preprocessed)

# Retrain models with polynomial features
for name, model in models.items():
    model.fit(X_train_poly, y_train)
    y_pred = model.predict(X_val_poly)
    mae = mean_absolute_error(y_val, y_pred)
    mse = mean_squared_error(y_val, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_val, y_pred)
    results[name + "_poly"] = {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}

# Update results
results_df = pd.DataFrame(results).T
print(results_df)

In [None]:
# Tune the best-performing model (e.g., RandomForestRegressor)
param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=3, scoring="neg_mean_squared_error")
grid_search.fit(X_train_preprocessed, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred_test = best_model.predict(X_test_preprocessed)
mae_test = mean_absolute_error(y_test, y_pred_test)
mse_test = mean_squared_error(y_test, y_pred_test)
rmse_test = np.sqrt(mse_test)
r2_test = r2_score(y_test, y_pred_test)

print(f"Test Metrics - MAE: {mae_test}, MSE: {mse_test}, RMSE: {rmse_test}, R2: {r2_test}")

In [None]:
# Evaluate models on the test dataset
test_results = {}

for name, model in models.items():
    # Train the model on the full training set (train + validation)
    model.fit(X_train_preprocessed, y_train)
    
    # Predict on the test set
    y_pred_test = model.predict(X_test_preprocessed)
    
    # Calculate metrics
    mae = mean_absolute_error(y_test, y_pred_test)
    mse = mean_squared_error(y_test, y_pred_test)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred_test)
    
    # Store results
    test_results[name] = {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}

# Convert results to a DataFrame for better visualization
test_results_df = pd.DataFrame(test_results).T
print(test_results_df)

In [None]:
# Chosen Metric: RMSE
# Reason: RMSE (Root Mean Squared Error) is chosen because it penalizes larger errors more heavily than MAE, making it sensitive to outliers. 
# It is also in the same units as the target variable (hours-per-week), which makes it easier to interpret.
# Comparison
# Best Model: RandomForestRegressor has the lowest RMSE (9.42) and the highest R² (0.45), indicating it is the best-performing model.

# Worst Model: SGDRegressor has the highest RMSE (10.98) and the lowest R² (0.25), indicating it performs the worst.

# DecisionTreeRegressor performs better than linear models (LinearRegression, Ridge, Lasso) but worse than RandomForestRegressor.

In [None]:
# Get feature names after preprocessing
num_features = numerical_features
cat_features = preprocessor.named_transformers_["cat"].get_feature_names_out(categorical_features)
all_features = np.concatenate([num_features, cat_features])

# Feature importance for DecisionTreeRegressor
dt_model = models["DecisionTreeRegressor"]
dt_importance = dt_model.feature_importances_

# Feature importance for RandomForestRegressor
rf_model = models["RandomForestRegressor"]
rf_importance = rf_model.feature_importances_

# Create a DataFrame for feature importance
feature_importance_df = pd.DataFrame({
    "Feature": all_features,
    "DecisionTree Importance": dt_importance,
    "RandomForest Importance": rf_importance
})

# Sort by RandomForest Importance
feature_importance_df = feature_importance_df.sort_values(by="RandomForest Importance", ascending=False)
print(feature_importance_df.head(10))

In [None]:
summary_report = """
Summary Report:
 Discussion
 Age and education-num are the most important features for predicting hours-per-week.

Capital-gain and capital-loss also contribute significantly.

Categorical features like workclass_Private and occupation_Prof-specialty have lower importance but still contribute to the model.

Summary of Findings
Best Model: RandomForestRegressor performs the best with the lowest RMSE (9.42) and highest R² (0.45).

Worst Model: SGDRegressor performs the worst with the highest RMSE (10.98) and lowest R² (0.25).

Feature Importance: Age and education-num are the most important features for predicting hours-per-week.

Metric Choice: RMSE is chosen for model comparison because it penalizes larger errors and is interpretable in the context of the target variable.
"""
print(summary_report)