<a href="https://colab.research.google.com/github/yuvalira/Final-Project-Adversarial-Attack-on-Tabular-Classification/blob/main/GBT/GBT_ModelTraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Standard library
import os
import time
import random

# Data loading
from huggingface_hub import hf_hub_download

# Data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sn

# Scikit-learn: Preprocessing, Modeling, Evaluation
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    roc_curve
)
from sklearn.ensemble import GradientBoostingClassifier

#**Classical Model**
In this section, we implement a classical tree-based machine learning model to serve as a performance baseline for comparison against Large Language Models (LLMs) on a tabular classification task.

Scikit-learn Gradient Boosted Trees (GBT)

This project uses the Gradient Boosted Trees (GBT) implementation from the popular scikit-learn machine learning library.  
GBT is a powerful ensemble learning method that builds models by combining many weak learners (decision trees) into a strong predictive model.  
We selected this approach because it integrates well with tabular data, provides strong performance, and avoids the current compatibility limitations of TensorFlow Decision Forests with Keras 3.

There are other existing implementations of GBT, including TensorFlow Decision Forests [1], XGBoost [2], and LightGBM [3].  
We chose to implement the model using scikit-learn [0] due to its simplicity, stability, and excellent compatibility with research workflows in Python.

References:

[0] Scikit-learn: https://scikit-learn.org/stable/  
[1] TensorFlow Decision Forests: https://www.tensorflow.org/decision_forests  
[2] XGBoost: https://xgboost.readthedocs.io/  
[3] LightGBM: https://lightgbm.readthedocs.io/


### 1. Setup

In [None]:
!pip install -U scikit-learn

### 2. Data Pre-Processing

Reading data-frames from Huggingface

In [None]:
repo_id = "" # Enter dataset repo

# Read train dataset
csv_filename = "train.csv"
csv_path = hf_hub_download(repo_id=repo_id, filename=csv_filename)
train_data = pd.read_csv(csv_path)

# Read validation dataset
csv_filename = "val.csv"
csv_path = hf_hub_download(repo_id=repo_id, filename=csv_filename)
val_data = pd.read_csv(csv_path)

# Read test dataset
csv_filename = "test.csv"
csv_path = hf_hub_download(repo_id=repo_id, filename=csv_filename)
test_data = pd.read_csv(csv_path)

First, we'll track which columns have numerical values and which are categorical. We'll also pay special attention to the target column and weight column.

In [None]:
# Identify numerical columns (int, float)
numerical_columns = train_data.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Identify categorical columns (object, category)
categorical_columns = train_data.select_dtypes(include=['object', 'category']).columns.tolist()

# Print the feature names (column names)
print("Numerical Features:", numerical_columns)
print("Categorical Features:", categorical_columns)

In [None]:
# Define the columns based on the categories you mentioned
TARGET_COLUMN_NAME = ""  # The column we're predicting
TARGET_LABELS = []  # Enter target labels

# Numeric features based on the output
NUMERIC_FEATURE_NAMES = [] # Enter numeric feature

# Categorical features based on the output
CATEGORICAL_FEATURE_NAMES = [] # Enter categorical features

Now, we Create copies of dataframes to work on

In [None]:
# Create copies to work on
train_data = train_data.copy()
val_data = val_data.copy()
test_data = test_data.copy()

Now we will show the shapes of the training and test dataframes

In [None]:
print(f"Train data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")
print(f"Validation data shape: {val_data.shape}")

The target column (`income`) is first converted from string labels (`'<=50K'`, `'>50K'`) to numeric values (0, 1) for compatibility with scikit-learn.

In addition, all categorical feature columns are cast to string type to ensure consistency and avoid potential dtype issues during encoding.

This prepares the dataset for subsequent Label Encoding.

In [None]:
def prepare_dataframe(df):
    # Convert the target labels from string to integer.
    df[TARGET_COLUMN_NAME] = df[TARGET_COLUMN_NAME].map(
        TARGET_LABELS.index
    )
    # Cast the categorical features to string.
    for feature_name in CATEGORICAL_FEATURE_NAMES:
        df[feature_name] = df[feature_name].astype(str)


prepare_dataframe(train_data)
prepare_dataframe(val_data)
prepare_dataframe(test_data)

Before training the Gradient Boosted Trees (GBT) model, it is necessary to convert all categorical (non-numeric) features into numerical format, since scikit-learn models cannot process string data directly.

In this step, we apply Label Encoding to all categorical features. Label Encoding assigns a unique integer value to each category within a feature.

To avoid data leakage, we:

1. fit the encoder only on the training set.

2. apply (transform) the same encoding to the validation and test sets

In [None]:
# Save encoders to reuse
encoders = {}

for col in CATEGORICAL_FEATURE_NAMES:
    # Combine all values for fitting
    combined_values = pd.concat([train_data[col], val_data[col], test_data[col]], axis=0)

    le = LabelEncoder()
    le.fit(combined_values)  # Fit on all available data (to cover all categories)

    # Transform individually to preserve dataset structure
    train_data[col] = le.transform(train_data[col])
    val_data[col] = le.transform(val_data[col])
    test_data[col] = le.transform(test_data[col])

    encoders[col] = le  # Save for reuse

print("All categorical features encoded successfully.")

### 3. Model Training & Evaluation
We'll train a **Gradient Boosted Trees (GBT)** model using the `GradientBoostingClassifier` from the scikit-learn library.

This model is chosen because:
- It improves predictive performance over standard decision trees.
- It reduces overfitting by sequentially training trees, each correcting the errors of the previous one.
- It provides flexibility to handle both numerical and encoded categorical features (after preprocessing).
- It integrates seamlessly with the scikit-learn ecosystem and works efficiently for tabular data.

In [None]:
# Gradient Boosted Trees (GBT) model for classification with configured hyperparameters

model_gbt = GradientBoostingClassifier(
    n_estimators=250,       # Number of boosting stages (trees)
    max_depth=5,            # Maximum depth of each tree
    min_samples_leaf=6,     # Minimum number of samples required at a leaf node
    subsample=0.65,         # Fraction of samples to be used for fitting each base learner
    random_state=42         # For reproducibility
)

print("GBT model initialized successfully.")


The run_experiment() method separates features and target labels from the datasets, trains the scikit-learn Gradient Boosting Classifier on the training set, and evaluates its performance on the validation and test sets.

The function calculates standard classification metrics, including accuracy, precision, recall, F1-score, and ROC-AUC.

Training time is also measured to assess computational efficiency.

In [None]:
def run_experiment(model, train_data, val_data, test_data):
    # Exclude target
    exclude_columns = [TARGET_COLUMN_NAME]

    # Separate features and labels
    X_train = train_data.drop(columns=exclude_columns)
    y_train = train_data[TARGET_COLUMN_NAME]

    X_val = val_data.drop(columns=exclude_columns)
    y_val = val_data[TARGET_COLUMN_NAME]

    X_test = test_data.drop(columns=exclude_columns)
    y_test = test_data[TARGET_COLUMN_NAME]

    # Train the model
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()

    training_time = end_time - start_time

    # Evaluate on test set
    y_pred = model.predict(X_test)

    # Correctly Predicted
    correct_predictions = np.sum(y_test == y_pred)
    total_predictions = len(y_test)
    print(f"Correctly Predicted: {correct_predictions}/{total_predictions}")

    # Calculate metrics
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

    print(f"Model evaluation complete:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"ROC AUC: {auc:.4f}")
    print(f"Training Time: {training_time:.4f} seconds")

    return model


In [None]:
model_gbt = run_experiment(model_gbt, train_data, val_data, test_data)

We extract and display the relative importance of each feature using the trained model's `feature_importances_` attribute.

In [None]:
# Feature importance analysis for sklearn Gradient Boosting
feature_names = train_data.drop(columns=[TARGET_COLUMN_NAME]).columns
importances = model_gbt.feature_importances_

feature_importance_dict = dict(zip(feature_names, importances))
sorted_importances = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)

print(" Feature Importances:")
for feature, importance in sorted_importances:
    print(f"{feature}: {importance:.4f}")

Receiver Operating Characteristic (ROC) Curve:

ROC curve is a graphical representation of a classifier's performance across different classification thresholds.  
It plots the **True Positive Rate (Sensitivity)** against the **False Positive Rate (1 - Specificity)** at various threshold settings.  
The ROC curve helps to evaluate the trade-off between sensitivity and specificity.  
A model with a curve closer to the top-left corner indicates better performance.  
The **Area Under the Curve (AUC)** summarizes the overall ability of the model to distinguish between the classes, where an AUC of 1.0 represents a perfect classifier and 0.5 represents random guessing.

In [None]:
# Prepare test data
X_test = test_data.drop(columns=[TARGET_COLUMN_NAME])
y_true = test_data[TARGET_COLUMN_NAME].values

# Predict probabilities for the positive class
y_proba = model_gbt.predict_proba(X_test)[:, 1]

# Compute ROC AUC score
roc_auc = roc_auc_score(y_true, y_proba)
print(f"ROC AUC: {roc_auc:.4f}")

# Compute ROC curve points
fpr, tpr, thresholds = roc_curve(y_true, y_proba)

# Plot the ROC curve
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.4f})", linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label="Random Classifier (AUC = 0.5)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.grid(True)
plt.tight_layout()
plt.show()