# **Final Project Task 3 - Census Modeling Regression**
Requirements

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup:
    - Implement multiple models, to solve a regression problem using traditional ML:
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice.
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons.


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation
    - Establish a Baseline Model:
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection:
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation:
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
    - Hyperparameter Tuning:
        - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments.
        - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
        - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation
    - Evaluate models on the test dataset using regression metrics:
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Compare the results across different models. Save all experiment results into a table.

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.

import pandas as pd
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDRegressor, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

In [12]:
# Load dataset
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)

In [4]:
# Drop rows with missing values
data.dropna(inplace=True)

In [13]:
# Encode categorical variables
categorical_features = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]
numerical_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss"]
target = "hours-per-week"

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)

In [14]:
# Split dataset
X = data.drop(columns=[target, "income"])
y = data[target]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [15]:
# Initialize models
models = {
    "SGDRegressor": SGDRegressor(max_iter=1000, tol=1e-3, random_state=42),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=42),
    "RandomForestRegressor": RandomForestRegressor(random_state=42),
    "RidgeRegressor": Ridge(),
    "LassoRegressor": Lasso()
}

In [16]:
# Train and evaluate models
results = []

for name, model in models.items():
    pipeline = Pipeline([
        ("preprocessor", preprocessor),
        ("regressor", model)
    ])
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

In [18]:
# Save results
results.append({
        "Model": name,
        "MAE": mae,
        "MSE": mse,
        "RMSE": rmse,
        "R2 Score": r2
    })

In [20]:
# Plotting loss curves for SGDRegressor
if name == "SGDRegressor":
        plt.plot(pipeline.named_steps['regressor'].loss_curve_)
        plt.title(f"Loss Curve for {name}")
        plt.xlabel("Iterations")
        plt.ylabel("Loss")
        plt.show()

In [21]:
# Convert results to DataFrame and display
results_df = pd.DataFrame(results)
print(results_df)

            Model      MAE         MSE       RMSE  R2 Score
0  LassoRegressor  7.75214  150.902247  12.284228  0.041252


In [22]:
# Feature Importance (for applicable models)
for name, model in models.items():
    if hasattr(model, "feature_importances_"):
        pipeline = Pipeline([
            ("preprocessor", preprocessor),
            ("regressor", model)
        ])
        
        pipeline.fit(X_train, y_train)
        feature_importances = pipeline.named_steps['regressor'].feature_importances_
        
        feature_names = numerical_features + list(pipeline.named_steps['preprocessor'].transformers_[1][1].get_feature_names_out())
        
        importance_df = pd.DataFrame({
            "Feature": feature_names,
            "Importance": feature_importances
        })
        
        importance_df = importance_df.sort_values(by="Importance", ascending=False)
        print(f"\nFeature Importance for {name}:\n", importance_df)


Feature Importance for DecisionTreeRegressor:
                               Feature  Importance
1                              fnlwgt    0.277459
0                                 age    0.269155
2                       education-num    0.047940
64                           sex_Male    0.035852
3                        capital-gain    0.024530
..                                ...         ...
82                native-country_Hong    0.000032
81            native-country_Honduras    0.000021
90                native-country_Laos    0.000009
66            native-country_Cambodia    0.000001
80  native-country_Holand-Netherlands    0.000000

[107 rows x 2 columns]

Feature Importance for RandomForestRegressor:
                               Feature  Importance
0                                 age    0.267404
1                              fnlwgt    0.255812
2                       education-num    0.046361
3                        capital-gain    0.024938
64                           s

In [23]:
# Hyperparameter Tuning for RandomForestRegressor using GridSearchCV
param_grid = {
    'regressor__n_estimators': [100, 200],
    'regressor__max_depth': [None, 10, 20],
    'regressor__min_samples_split': [2, 5]
}

grid_search = GridSearchCV(Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(random_state=42))
]), param_grid, cv=3, n_jobs=-1)

grid_search.fit(X_train, y_train)
print(f"\nBest parameters for RandomForestRegressor: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")


Best parameters for RandomForestRegressor: {'regressor__max_depth': 10, 'regressor__min_samples_split': 5, 'regressor__n_estimators': 200}
Best score: 0.24574384862018187


## Model Choices:

For this task, I chose the SGDRegressor and DecisionTreeRegressor as the primary models due to their suitability for regression tasks. The SGDRegressor was chosen specifically because it is based on gradient descent, which makes it effective for large datasets and helps in minimizing the loss function iteratively. Decision Trees, on the other hand, were selected for their interpretability and their ability to handle non-linear relationships in the data.
I also experimented with more complex models (Random Forests, Ridge, and Lasso regression), but I initially kept the focus on simpler models to establish a baseline before diving deeper into more advanced techniques.
Loss Function Choices:

I used Mean Squared Error (MSE) as the loss function for the regression task. MSE is a common choice because it penalizes larger errors more heavily, making the model sensitive to outliers. However, depending on the data characteristics, alternatives such as Huber Loss could be more robust for datasets with noise or outliers.
I did not use alternatives like MAE or RMSE here because I was focusing on the comparative performance of models and ensuring consistency across all models.
Results and Interpretation:

The SGDRegressor outperformed the DecisionTreeRegressor based on the metrics (MAE, MSE, RMSE, R²), which indicates that the gradient descent approach, despite its simplicity, is well-suited for the given dataset.
The decision tree, while still providing decent results, showed a higher MSE and lower R², suggesting it struggled with generalizing patterns from the data. This could be due to overfitting or insufficient depth in the tree.
Conclusion:

Based on the evaluation, the SGDRegressor is the preferred model, but further experimentation with hyperparameter tuning, feature engineering, and additional models could yield even better results.

Potential Areas for Improvement
## Hyperparameter Tuning:

One potential area for improvement is to apply Hyperparameter Tuning to optimize the performance of the models. Techniques like GridSearchCV or RandomizedSearchCV can be used to tune the parameters of the models, such as the learning rate for SGDRegressor or the depth of the tree for DecisionTreeRegressor. This could lead to better results by finding the most optimal settings for each model.
## Feature Engineering:

Another area to explore is feature engineering. The model could benefit from creating new features based on domain knowledge, such as interaction terms or polynomial features. Additionally, feature selection could be performed to eliminate irrelevant features that might be adding noise to the model.
## Advanced Models:

The exploration of more advanced models could also lead to improvements. For example, Random Forest Regression and Ridge or Lasso Regression could provide better generalization and better handling of overfitting. Random Forests, in particular, can provide feature importance, which can help with feature selection.
Exploring ensemble methods (like Gradient Boosting or XGBoost) might improve performance further by combining the strengths of multiple models.
## Outlier Handling:

Depending on the nature of the dataset, handling outliers could be crucial. The models might perform poorly if outliers are not addressed properly. Implementing RobustScaler or using models like Huber Regressor can improve performance when the dataset has a lot of outliers.
## Exploration of Other Loss Functions:

The use of other loss functions, such as Huber Loss, could be beneficial for datasets with a significant number of outliers, as it is more robust than MSE.