## Feature Scaling For Logistic Regression
- using StandardScaler which is best for logistic regression

In [2]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

In [3]:
# Load train/test splits
X_train = pd.read_csv("../data/processed/X_train.csv")
X_test  = pd.read_csv("../data/processed/X_test.csv")

y_train = pd.read_csv("../data/processed/y_train.csv").squeeze()
y_test  = pd.read_csv("../data/processed/y_test.csv").squeeze()


In [4]:
scaler = StandardScaler()

In [5]:
X_train_scaled = scaler.fit_transform(X_train)


In [6]:
X_test_scaled = scaler.transform(X_test)


In [7]:
X_train_scaled = pd.DataFrame(
    X_train_scaled,
    columns=X_train.columns,
    index=X_train.index
)

X_test_scaled = pd.DataFrame(
    X_test_scaled,
    columns=X_test.columns,
    index=X_test.index
)
joblib.dump(scaler, "../models/scaler.joblib")

NameError: name 'joblib' is not defined

In [21]:
X_train_scaled.mean().head()

MONTH               0.002748
DAY_OF_MONTH       -0.000877
CRS_ELAPSED_TIME    0.004421
DISTANCE            0.005313
DEP_1hrpre_num     -0.000138
dtype: float64

In [22]:
X_train_scaled.std().head()

MONTH               0.998392
DAY_OF_MONTH        0.997236
CRS_ELAPSED_TIME    1.003114
DISTANCE            1.003258
DEP_1hrpre_num      1.001172
dtype: float64

In [39]:
# Clean training set
train_mask = y_train.notna()
X_train_scaled = X_train_scaled.loc[train_mask]
y_train = y_train.loc[train_mask]

# Clean test set
test_mask = y_test.notna()
X_test_scaled = X_test_scaled.loc[test_mask]
y_test = y_test.loc[test_mask]



X_train_scaled.to_csv("../data/processed/X_train_scaled.csv", index=False)
X_test_scaled.to_csv("../data/processed/X_test_scaled.csv", index=False)


y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)



In [41]:
pd.Series(X_train_scaled.columns).to_csv(
    "../data/processed/feature_names.csv",
    index=False,
    header=False
)


### Saving Final Model-Ready Datasets

After handling missing target values and applying feature scaling, the final train and test datasets were saved. These datasets represent the exact inputs used by the baseline Logistic Regression model and will be reused for model interpretation and further experimentation to ensure consistency and reproducibility.


In [24]:
y_train.isna().sum(), y_test.isna().sum()


(np.int64(0), np.int64(0))

## Trainig the baseline model Using Logistic Regression


In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)


In [26]:
log_reg = LogisticRegression(
    max_iter=1000,
    n_jobs=-1
)


In [27]:
y_train.isna().sum(), y_test.isna().sum()


(np.int64(0), np.int64(0))

In [28]:
log_reg.fit(X_train_scaled, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [42]:
import joblib

joblib.dump(log_reg,"../models/logistic_regression_baseline.pkl")

['../models/logistic_regression_baseline.pkl']

### Evualuation

In [36]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score

In [31]:
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

y_pred[:10], y_pred_proba[:10]


(array([0., 0., 0., 0., 0., 1., 0., 0., 1., 0.]),
 array([0.094313  , 0.13801995, 0.08572669, 0.10727584, 0.08599282,
        0.85164689, 0.0998856 , 0.10463178, 0.77952828, 0.09394212]))

In [33]:
confusion_matrix(y_test, y_pred)



array([[846309,  22742],
       [ 92575,  63583]])

In [34]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.90      0.97      0.94    869051
         1.0       0.74      0.41      0.52    156158

    accuracy                           0.89   1025209
   macro avg       0.82      0.69      0.73   1025209
weighted avg       0.88      0.89      0.87   1025209



In [35]:
roc_auc_score(y_test, y_pred_proba)

0.7789038250809166

### Threshold Selection & Error Trade-off Interpretation

The baseline Logistic Regression model outputs probabilities representing the estimated risk of flight delay. The default classification threshold of 0.5 yields high precision but relatively low recall for delayed flights, indicating a conservative prediction behavior that prioritizes avoiding false alarms over capturing all delays.

Given the operational context, missing a delayed flight can lead to last-minute disruptions and reactive decision-making, while false delay warnings may cause unnecessary rescheduling and passenger confusion. As both error types carry costs, the objective is to increase recall for delayed flights while maintaining acceptable precision.

To better understand this trade-off, multiple probability thresholds were evaluated. Lowering the threshold increases recall by identifying more delayed flights at the expense of additional false positives. This analysis allows the decision threshold to be selected based on operational tolerance rather than relying on an arbitrary default value.


In [38]:
def evaluate_threshold(y_true, y_proba, threshold):
    y_pred_thr = (y_proba >= threshold).astype(int)
    return {
        "threshold": threshold,
        "precision": precision_score(y_true, y_pred_thr),
        "recall": recall_score(y_true, y_pred_thr),
        "f1": f1_score(y_true, y_pred_thr)
    }

thresholds = [0.5, 0.45, 0.4, 0.35, 0.3]

results = [evaluate_threshold(y_test, y_pred_proba, t) for t in thresholds]

pd.DataFrame(results)
    

Unnamed: 0,threshold,precision,recall,f1
0,0.5,0.736554,0.407171,0.524433
1,0.45,0.715274,0.420632,0.529739
2,0.4,0.689872,0.436539,0.534718
3,0.35,0.653436,0.456704,0.537638
4,0.3,0.59779,0.486354,0.536345


### Threshold Adjustment Rationale

The default probability threshold of 0.5 resulted in high precision but low recall for delayed flights, indicating that the model was conservative and missed a substantial number of actual delays. Given the operational objective of detecting delays early and avoiding last-minute disruptions, recall was prioritized over strict precision.

Lowering the classification threshold increases recall by flagging more flights as potentially delayed, at the cost of additional false positives. Threshold values between 0.35 and 0.40 provided a reasonable balance, improving delay detection while maintaining acceptable precision. This adjustment reflects a policy decision based on operational tolerance rather than a change to the underlying model.
