## Time based Test-Train Split

In [1]:
import pandas as pd

In [2]:
X = pd.read_csv("../data/processed/X_features.csv")
y = pd.read_csv("../data/processed/y_target.csv").squeeze()
flight_dates = pd.read_csv("../data/processed/flight_dates.csv", parse_dates=["FL_DATE"])

In [4]:
X.shape

(5559463, 67)

In [5]:
flight_dates.size

5559463

In [6]:
y.size

5559463

In [7]:
flight_dates["FL_DATE"].min(), flight_dates["FL_DATE"].max()


(Timestamp('2023-01-01 00:00:00'), Timestamp('2023-11-30 00:00:00'))

In [8]:
flight_dates["FL_DATE"].dt.to_period("M").value_counts().sort_index()


FL_DATE
2023-01    483189
2023-02    452628
2023-03    512573
2023-04    498023
2023-05    513203
2023-06    505769
2023-07    528232
2023-08    535350
2023-09    503005
2023-10    527896
2023-11    499595
Freq: M, Name: count, dtype: int64

In [9]:
# Define split date
split_date = pd.Timestamp("2023-10-01")

# Create masks
train_mask = flight_dates["FL_DATE"] < split_date
test_mask  = flight_dates["FL_DATE"] >= split_date

# Apply split
X_train = X.loc[train_mask]
X_test  = X.loc[test_mask]

y_train = y.loc[train_mask]
y_test  = y.loc[test_mask]

# Check sizes
X_train.shape, X_test.shape


((4531972, 67), (1027491, 67))

In [10]:
flight_dates.loc[train_mask, "FL_DATE"].max(), flight_dates.loc[test_mask, "FL_DATE"].min()


(Timestamp('2023-09-30 00:00:00'), Timestamp('2023-10-01 00:00:00'))

In [11]:
# Save train/test splits
X_train.to_csv("../data/processed/X_train.csv", index=False)
X_test.to_csv("../data/processed/X_test.csv", index=False)

y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)

print("Train/test splits saved successfully.")


Train/test splits saved successfully.


### Time-Based Train/Test Split

The dataset spans January to November 2023 with consistent monthly coverage.  
To simulate real-world prediction conditions, a chronological split was applied:

- Training set: January 2023 – September 2023  
- Test set: October 2023 – November 2023  

This ensures that the model is trained only on past data and evaluated on future, unseen flights, avoiding temporal leakage.
