## Model Interpretation â€” Logistic Regression Baseline

This notebook analyzes the trained Logistic Regression model used for flight delay prediction.  
The objective is to interpret feature coefficients, understand which factors increase or reduce delay risk, and validate whether the learned relationships align with domain intuition.  
No retraining or hyperparameter tuning is performed in this notebook.


In [1]:
import pandas as pd
import numpy as np


In [2]:
X_train_scaled = pd.read_csv("../data/processed/X_train_scaled.csv")
X_test_scaled = pd.read_csv("../data/processed/X_test_scaled.csv")

y_train = pd.read_csv("../data/processed/y_train.csv").squeeze()
y_test = pd.read_csv("../data/processed/y_test.csv").squeeze()

feature_names = X_train_scaled.columns


In [3]:
import joblib

log_reg = joblib.load("../models/logistic_regression_baseline.pkl")


In [4]:
coef_df = pd.DataFrame({
    "feature": feature_names,
    "coefficient": log_reg.coef_[0]
})

coef_df["abs_coefficient"] = coef_df["coefficient"].abs()
coef_df = coef_df.sort_values("abs_coefficient", ascending=False)

coef_df.head(15)


Unnamed: 0,feature,coefficient,abs_coefficient
17,late_airjet_when_turnaround_within_180,1.045774,1.045774
18,affected_turnaround_lessthan120,-0.724197,0.724197
61,dep_hour,0.466358,0.466358
22,has_turnaround,-0.334572,0.334572
21,affected_turnaround_lessthan45,0.301052,0.301052
19,affected_turnaround_lessthan90,0.225574,0.225574
39,OP_CARRIER_AA,0.202905,0.202905
20,affected_turnaround_lessthan60,0.195278,0.195278
38,MKT_CARRIER_WN,0.19292,0.19292
55,OP_CARRIER_WN,0.19292,0.19292


### Model Interpretation Summary

Analysis of the Logistic Regression coefficients indicates that **operational factors** play a dominant role in predicting flight delays. The strongest positive contributor to delay risk is the presence of a late incoming aircraft combined with a limited turnaround window, highlighting the cascading effect of upstream delays. Extremely tight turnaround thresholds further increase delay probability, while moderate and well-planned turnaround schedules appear to mitigate risk.

Time-based features also show meaningful influence. Flights scheduled later in the day exhibit higher delay risk, consistent with the accumulation of delays across daily operations. Airport congestion indicators, such as elevated arrival volumes prior to departure, further contribute to increased delay likelihood.

Weather-related features, including precipitation and perceived temperature, have a measurable but comparatively smaller impact. This suggests that while adverse weather affects operations, its influence is secondary to scheduling and operational constraints in this dataset.

Carrier-specific effects are present but moderate, reflecting differences in network structure and operational strategies rather than isolated performance issues.

Overall, the learned coefficients align well with domain expectations, providing confidence that the feature engineering and preprocessing steps successfully captured meaningful patterns without introducing leakage or spurious relationships.
