# **Final Project Task 3 - Census Modeling Regression**

Requirements

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup:
    - Implement multiple models, to solve a regression problem using traditional ML:
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice.
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons.


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation
    - Establish a Baseline Model:
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection:
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation:
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
    - Hyperparameter Tuning:
        - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments.
        - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
        - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation
    - Evaluate models on the test dataset using regression metrics:
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Compare the results across different models. Save all experiment results into a table.

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [12]:
import pandas as pd

In [13]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
30215,50,Federal-gov,339905,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K
21079,46,Local-gov,324561,Masters,14,Married-civ-spouse,Prof-specialty,Wife,White,Female,0,0,45,United-States,>50K
1119,53,Private,288020,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,Japan,<=50K
1253,52,Federal-gov,202452,HS-grad,9,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,43,United-States,<=50K
31712,24,Private,280960,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,0,0,24,United-States,<=50K
29651,24,Private,119156,HS-grad,9,Never-married,Transport-moving,Own-child,White,Male,0,0,50,United-States,<=50K
14555,19,Self-emp-not-inc,342384,11th,7,Married-civ-spouse,Craft-repair,Own-child,White,Male,0,2129,55,United-States,<=50K
9774,34,Private,344073,HS-grad,9,Separated,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States,>50K
5201,76,Self-emp-not-inc,33213,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,30,?,>50K
11522,33,Private,252168,Some-college,10,Never-married,Other-service,Not-in-family,Black,Male,0,0,40,United-States,<=50K


In [14]:

#Data preparation
#Load and Preprocess the data

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt


# Load the dataset
data = pd.read_csv('Data_Cleaned.csv')

# Define features and target variable
X = data.drop(columns=['hours-per-week'])
y = data['hours-per-week']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Split the data into train, validation, and test sets (60% train, 20% validation, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Preprocessing pipeline for numerical and categorical features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])



In [15]:
#2.Model Selection and Setup
#We will implement multiple regression models: Linear Regression, Decision Tree Regression, Random Forest Regression, Ridge Regression and Lasso Regression. 
#We will use Mean Squared Error (MSE) as our primary loss metric because it penalizes larger errors more significantly than smaller ones.

from sklearn.linear_model import SGDRegressor, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Define a function to evaluate models
def evaluate_model(model):
    # Create a pipeline that includes preprocessing and model training
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('model', model)])
    
    # Fit the model on training data
    pipeline.fit(X_train, y_train)
    
    # Predict on validation set
    y_pred_test = pipeline.predict(X_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred_test)
    mae = mean_absolute_error(y_test, y_pred_test)
    r2 = r2_score(y_test, y_pred_test)
    
    return mse, mae, r2, pipeline

# Initialize models
models = {
    'SGDRegressor': SGDRegressor(),
    'DecisionTreeRegressor': DecisionTreeRegressor(),
    'RandomForestRegressor': RandomForestRegressor(),
    'Ridge': Ridge(),
    'Lasso': Lasso()
}

# Evaluate each model and store results
results = {}
pipelines = {}

for name, model in models.items():
    mse, mae, r2, pipeline = evaluate_model(model)
    results[name] = {'MSE': mse, 'MAE': mae, 'R²': r2}
    pipelines[name] = pipeline

# Convert results to DataFrame for better visualization
results_df = pd.DataFrame(results).T
print(results_df)

                                MSE           MAE            R²
SGDRegressor           3.324678e+37  5.142403e+18 -2.299769e+35
DecisionTreeRegressor  2.236030e+02  1.012160e+01 -5.467221e-01
RandomForestRegressor  1.136168e+02  7.307988e+00  2.140823e-01
Ridge                  1.217337e+02  7.642093e+00  1.579353e-01
Lasso                  1.366873e+02  7.398384e+00  5.449747e-02


In [16]:
#3. Model Training and Experimentation
#Baseline Model Training
#We will establish baseline performance by training each model with default settings and evaluating their performance.
#Hyperparameter Tuning
#For models that show strong baseline performance (e.g., Random Forest or Ridge), we can perform hyperparameter tuning using GridSearchCV or RandomizedSearchCV.


In [17]:
#4. Model Evaluation
#After training the models on the validation set:
#Evaluate Models on Test Dataset: Use metrics such as MAE, MSE, RMSE (calculated from MSE), and R² Score to compare model performances.
# Evaluate on test set using the best performing model (for example: RandomForestRegressor)
best_model_name = results_df['MSE'].idxmin()
print(f"Best model: {best_model_name}")

# Evaluate on test set using the best performing model
pipeline_best = pipelines[best_model_name]
y_pred_test = pipeline_best.predict(X_test)

# Calculate metrics on test set
test_mse = mean_squared_error(y_test, y_pred_test)
test_mae = mean_absolute_error(y_test, y_pred_test)
test_r2 = r2_score(y_test, y_pred_test)

print(f"Test Set Metrics for {best_model_name}:")
print(f"MSE: {test_mse}, MAE: {test_mae}, R²: {test_r2}")

Best model: RandomForestRegressor
Test Set Metrics for RandomForestRegressor:
MSE: 113.61677179827473, MAE: 7.307987723954878, R²: 0.21408231104125985


Pe baza caracteristicilor generale ale setului de date și a cerințelor sarcinilor, regresia aleatorie a pădurii este probabil cel mai bun model. Echilibrează acuratețea, robustețea și capacitatea de a gestiona diferite tipuri de caracteristici fără a face presupuneri puternice cu privire la relațiile de bază. Cu toate acestea, pentru a confirma acest lucru, evaluarea empirică este necesară prin instruirea și compararea modelelor folosind seturile de date de validare și testare și metricile de evaluare alese.

In [18]:
#5. Findings and Conclusions
#Model Performance: Compare the performance of different models based on their evaluation metrics.
#Feature Importance: For tree-based models like Decision Trees or Random Forests, analyze feature importance to identify which features have the most significant impact on predicting "hours-per-week".


if best_model_name == 'RandomForestRegressor' and hasattr(pipeline_best.named_steps['model'], 'feature_importances_'):
    feature_importances = pipeline_best.named_steps['model'].feature_importances_
    
    # Accesăm preprocessor-ul antrenat
    preprocessor_fitted = pipeline_best.named_steps['preprocessor']

    encoded_feature_names = []
    
    # Căutăm OneHotEncoder în preprocessor
    for name, transformer, cols in preprocessor_fitted.transformers_:
        if name == 'cat':  
            if isinstance(transformer, Pipeline):
                for step_name, step_transformer in transformer.named_steps.items():
                    if isinstance(step_transformer, OneHotEncoder):
                        encoded_feature_names = step_transformer.get_feature_names_out(cols)
                        break
            elif isinstance(transformer, OneHotEncoder):  
                encoded_feature_names = transformer.get_feature_names_out(cols)
            break

    # Combinăm numele caracteristicilor numerice și categoriale
    feature_names = list(encoded_feature_names) + numerical_cols

    # Verificăm dacă lungimea caracteristicilor se potrivește cu importanțele
    if len(feature_names) != len(feature_importances):
        raise ValueError(f"Mismatch: {len(feature_names)} feature names vs {len(feature_importances)} importances.")

    # Creăm DataFrame pentru vizualizare
    importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
    importance_df = importance_df.sort_values(by='Importance', ascending=False)

    # Plot Feature Importance
    plt.figure(figsize=(10, 6))
    sns.barplot(y=importance_df['Feature'], x=importance_df['Importance'])
    plt.title('Feature Importance')
    plt.show()
else:
    print(f"Feature importance is not available for {best_model_name}.")


NameError: name 'sns' is not defined

<Figure size 1000x600 with 0 Axes>

In [None]:
#6. Potential Areas for Improvement or Further Exploration
#Feature Engineering: Explore additional feature engineering techniques such as interaction terms or polynomial features.
#Advanced Models: Consider experimenting with more advanced regression techniques like Gradient Boosting or XGBoost.
#Cross-Validation: Implement k-fold cross-validation to ensure robustness in model evaluation.