## Class-Weighted Logistic Regression

This notebook evaluates a cost-sensitive Logistic Regression model by applying class weighting to address class imbalance in flight delay prediction. The goal is to improve recall for delayed flights while preserving interpretability and comparability with the baseline model.


In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

In [2]:
X_train_scaled = pd.read_csv("../data/processed/X_train_scaled.csv")
X_test_scaled = pd.read_csv("../data/processed/X_test_scaled.csv")

y_train = pd.read_csv("../data/processed/y_train.csv").squeeze()
y_test = pd.read_csv("../data/processed/y_test.csv").squeeze()

feature_names = X_train_scaled.columns


In [3]:
log_reg_bal = LogisticRegression(
    class_weight="balanced",
    max_iter=1000,
    n_jobs=-1
)

log_reg_bal.fit(X_train_scaled, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [4]:
import joblib
joblib.dump(
    log_reg_bal,
    "../models/log_reg_class_weighted.joblib"
)


['../models/log_reg_class_weighted.joblib']

In [6]:
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

y_pred_bal = log_reg_bal.predict(X_test_scaled)
y_pred_proba_bal = log_reg_bal.predict_proba(X_test_scaled)[:, 1]


In [7]:
cm_bal = confusion_matrix(y_test, y_pred_bal)
cm_bal


array([[772229,  96822],
       [ 69413,  86745]])

In [8]:
print(classification_report(y_test, y_pred_bal))


              precision    recall  f1-score   support

         0.0       0.92      0.89      0.90    869051
         1.0       0.47      0.56      0.51    156158

    accuracy                           0.84   1025209
   macro avg       0.70      0.72      0.71   1025209
weighted avg       0.85      0.84      0.84   1025209



In [9]:
roc_auc_bal = roc_auc_score(y_test, y_pred_proba_bal)
roc_auc_bal


0.7830469222719344

### Comparison with Class-Weighted Logistic Regression

Applying class weighting increased recall for delayed flights from 41% to 56%, significantly reducing missed delays. This improvement came at the cost of increased false positives and reduced precision, reflecting a more aggressive prediction strategy. ROC–AUC remained stable, indicating that overall ranking ability was preserved. The results highlight how loss weighting can be used to align model behavior with operational priorities without changing features or model structure.


In [11]:
from sklearn.metrics import precision_score, recall_score, f1_score

def evaluate_threshold(y_true, y_proba, threshold):
    y_pred_thr = (y_proba >= threshold).astype(int)
    return {
        "threshold": threshold,
        "precision": precision_score(y_true, y_pred_thr),
        "recall": recall_score(y_true, y_pred_thr),
        "f1": f1_score(y_true, y_pred_thr)
    }
thresholds = [0.5, 0.55, 0.6, 0.65, 0.7]

results_bal = [
    evaluate_threshold(y_test, y_pred_proba_bal, t)
    for t in thresholds
]


pd.DataFrame(results_bal)


Unnamed: 0,threshold,precision,recall,f1
0,0.5,0.472552,0.555495,0.510678
1,0.55,0.546607,0.512654,0.529086
2,0.6,0.607307,0.479233,0.535721
3,0.65,0.650967,0.453861,0.534832
4,0.7,0.681863,0.433029,0.529677


### Threshold Tuning Results

Threshold tuning was performed on the class-weighted Logistic Regression model to balance recall and precision for delayed flight prediction. As the decision threshold increased, precision improved while recall decreased, reflecting a controlled trade-off between false positives and missed delays.

A threshold of **0.60** was selected as the optimal operating point. At this threshold, the model achieves higher recall than the baseline model while maintaining acceptable precision, resulting in the highest F1-score among the evaluated thresholds. This operating point provides a practical balance between minimizing missed delays and avoiding excessive false delay alerts.


Note: The baseline model required a lower decision threshold (0.35) to compensate for conservative probability estimates. After introducing class-weighted training, predicted probabilities shifted upward, making a higher threshold (0.60) more appropriate. Thresholds are therefore model-specific and should be tuned after training.


## Probabilistic Calibration

In [12]:
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import brier_score_loss


In [13]:
calibrated_model = CalibratedClassifierCV(
    estimator=log_reg_bal,
    method="sigmoid",
    cv=3
)


In [14]:
calibrated_model.fit(X_train_scaled, y_train)


0,1,2
,estimator,"LogisticRegre...00, n_jobs=-1)"
,method,'sigmoid'
,cv,3
,n_jobs,
,ensemble,'auto'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [15]:
y_proba_cal = calibrated_model.predict_proba(X_test_scaled)[:, 1]


In [16]:
brier_uncal = brier_score_loss(y_test, y_pred_proba_bal)
brier_cal = brier_score_loss(y_test, y_proba_cal)

brier_uncal, brier_cal


(0.14946114177826647, 0.09616130992778794)

In [17]:
y_pred_cal = (y_proba_cal >= 0.60).astype(int)


In [18]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

confusion_matrix(y_test, y_pred_cal)


array([[850507,  18544],
       [100243,  55915]])

In [19]:
print(classification_report(y_test, y_pred_cal))

              precision    recall  f1-score   support

         0.0       0.89      0.98      0.93    869051
         1.0       0.75      0.36      0.48    156158

    accuracy                           0.88   1025209
   macro avg       0.82      0.67      0.71   1025209
weighted avg       0.87      0.88      0.87   1025209



In [20]:
roc_auc_score(y_test, y_proba_cal)

0.7828435396673289

### Probability Calibration

Probability calibration was applied using Platt scaling (sigmoid calibration) to improve the reliability of predicted delay probabilities. While ROC–AUC remained largely unchanged, calibration improves the interpretability and stability of probability estimates, making threshold-based decision-making more robust for operational use.


## Final Model Selection

After evaluating baseline Logistic Regression, class-weighted training, threshold tuning, and probability calibration, the final model selected is an uncalibrated class-weighted Logistic Regression with a decision threshold of 0.60.

This configuration provides a balanced trade-off between recall and precision, significantly reducing missed delayed flights while avoiding excessive false delay alerts. Probability calibration was analyzed but not selected due to its adverse impact on recall at the chosen operating threshold. This model is frozen as the final linear baseline for subsequent comparisons.


In [23]:
import joblib

joblib.dump(
    log_reg_bal,
    "../models/log_reg_final_class_weighted.joblib"
)


['../models/log_reg_final_class_weighted.joblib']

In [24]:
final_threshold = 0.60

joblib.dump(
    final_threshold,
    "../models/log_reg_final_threshold.joblib"
)


['../models/log_reg_final_threshold.joblib']