# CLASS BALANCING

In this notebook, we move from data preparation to the core modeling phase. Our goal is to train a robust classification model to predict vessel delays.

However, before training, we must investigate the Target Distribution. If one class (e.g., "Delayed") is significantly more frequent than the other, the model will develop a bias, leading to poor decision-making and "alert fatigue" in the port's operations.

## SETUP

In [1]:
import pandas as pd
import numpy as np
import os
from imblearn.combine import SMOTETomek
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, precision_score, classification_report, confusion_matrix

# Configuration and paths
PROJECT_PATH = '/Users/rober/smartport-ai-risk-early-warning'
DATA_PATH = os.path.join(PROJECT_PATH, '02_Data/03_Working/work_fs.csv')

# Load and handle NaNs (SMOTE requirement)
df = pd.read_csv(DATA_PATH)

# --- STEP: DATA CLEANING & TRAIN/TEST SPLIT ---
initial_rows = df.shape[0]
df = df.dropna()
final_rows = df.shape[0]

print(f"--- Data Cleaning Summary ---")
print(f"Original rows: {initial_rows}")
print(f"Rows after dropna: {final_rows}")
print(f"Total rows removed: {initial_rows - final_rows}")

# Define Features and Target
target = "delay_flag"
X = df.drop(columns=[target])
y = df[target]

# Split BEFORE balancing to keep the test set representative of real-world data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("\n--- Initial Distribution (Training Set) ---")
print(y_train.value_counts(normalize=True))

--- Data Cleaning Summary ---
Original rows: 116481
Rows after dropna: 116468
Total rows removed: 13

--- Initial Distribution (Training Set) ---
delay_flag
1.0    0.860279
0.0    0.139721
Name: proportion, dtype: float64


As we can see, the dataset is highly imbalanced. Approximately 86% of the records are marked as "Delayed".

- **Risk**: Relying on this baseline would result in an unusable system for the Port Authority, as it would likely fail to differentiate between routine operations and high-risk delay events, essentially providing no predictive value. If we train the model now, it will learn that saying "Delayed" is correct 86% of the time, even without looking at the features. 

- **Solution**: We need to balance the classes so the model learns the specific patterns that lead to each state.

## METHOD 1: NO BALANCING (JUST TO SEE HOW BIAS IS RECALL)

We train a fast model with original unbalanced data (86/14) to demonstrate how bias affects detection performance.

In [2]:
model_base = LogisticRegression(n_jobs=-1, max_iter=500, solver='lbfgs')
model_base.fit(X_train, y_train)
y_pred_base = model_base.predict(X_test)

print(f"Baseline Recall: {recall_score(y_test, y_pred_base):.4f}")

Baseline Recall: 1.0000


The baseline model, trained on the original unbalanced dataset (86/14 distribution), exhibits a Recall of 1.0000. While this perfect score initially appears ideal, it is a classic indicator of major model bias due to the class imbalance.

This confirms the absolute necessity of applying SMOTE-Tomek in the following step to rebalance the classes and force the model to learn the specific features that characterize maritime delays.

## METHOD 2: SMOTE-Tomek (BALANCED)

We use SMOTE-Tomek, a hybrid technique that:

- SMOTE: Synthetically generates new examples of the minority class.

- Tomek Links: Removes overlapping examples between classes to make the decision boundary cleaner.

In [4]:
smt = SMOTETomek(random_state=42)
X_resampled, y_resampled = smt.fit_resample(X_train, y_train)

print(f"\nBalanced class distribution (Training Set):")
print(y_resampled.value_counts(normalize=True))

# Train model with balanced data
model_smt = LogisticRegression(n_jobs=-1, max_iter=500, solver='lbfgs')
model_smt.fit(X_resampled, y_resampled)
y_pred_smt = model_smt.predict(X_test)

print(f"Balanced Recall: {recall_score(y_test, y_pred_smt):.4f}")


Balanced class distribution (Training Set):
delay_flag
0.0    0.5
1.0    0.5
Name: proportion, dtype: float64
Balanced Recall: 0.6254


## EVALUATE THE BALANCED MODEL

In [5]:
comparison_results = [
    {
        "Method": "Baseline (No Balancing)",
        "Recall (Detection)": recall_score(y_test, y_pred_base),
        "Precision": precision_score(y_test, y_pred_base),
        "Decision": "Unusable - Too much bias"
    },
    {
        "Method": "SMOTE-Tomek (Balanced)",
        "Recall (Detection)": recall_score(y_test, y_pred_smt),
        "Precision": precision_score(y_test, y_pred_smt),
        "Decision": "üèÜ BEST CHOICE - High Sensitivity"
    }
]

df_comp = pd.DataFrame(comparison_results)
df_comp[['Recall (Detection)', 'Precision']] = (df_comp[['Recall (Detection)', 'Precision']] * 100).round(1).astype(str) + '%'
display(df_comp)

print("\n--- Detailed Classification Report (Balanced Model) ---")
print(classification_report(y_test, y_pred_smt))

Unnamed: 0,Method,Recall (Detection),Precision,Decision
0,Baseline (No Balancing),100.0%,86.0%,Unusable - Too much bias
1,SMOTE-Tomek (Balanced),62.5%,91.3%,üèÜ BEST CHOICE - High Sensitivity



--- Detailed Classification Report (Balanced Model) ---
              precision    recall  f1-score   support

         0.0       0.21      0.63      0.32      4882
         1.0       0.91      0.63      0.74     30059

    accuracy                           0.63     34941
   macro avg       0.56      0.63      0.53     34941
weighted avg       0.82      0.63      0.68     34941



## SAVE DATASET AFTER CLASS BALANCING

We export the resampled datasets in .pickle format to preserve data types for the final XGBoost training phase.

In [6]:
X_resampled.to_pickle(os.path.join(PROJECT_PATH, '02_Data/03_Working/X_balanced.pickle'))
y_resampled.to_pickle(os.path.join(PROJECT_PATH, '02_Data/03_Working/y_balanced.pickle'))

print(f"‚úî SUCCESS: Balanced datasets saved at {PROJECT_PATH}/02_Data/03_Working/")

