# 03 - Feature Engineering

**Objective**: Create engineered features to improve model predictive power.

**Features Created**:
- Temporal features: DepHour, DepPartOfDay
- Route features: Route, StateRoute
- Categorical groupings: DistanceGroup
- Airline standardization

In [1]:
# Load cleaned data
import pandas as pd
import numpy as np

data_path = 'data/flight_data_2018_2024_cleaned.csv'
df = pd.read_csv(data_path, low_memory=False)
df.columns = df.columns.str.strip()

print(f"Loaded data shape: {df.shape}")
print(f"Target distribution:\n{df['DELAYED'].value_counts(normalize=True)}")

# Extract departure hour from CRSDepTime
df['DepHour'] = df['CRSDepTime'].astype(str).str.zfill(4).str[:2].astype(int)
print(f"\nDepHour range: {df['DepHour'].min()} - {df['DepHour'].max()}")


Loaded data shape: (582425, 46)
Target distribution:
DELAYED
0    0.616778
1    0.383222
Name: proportion, dtype: float64

DepHour range: 0 - 23


In [2]:
def part_of_day(hour):
    if 5 <= hour < 12: return "Morning"
    elif 12 <= hour < 17: return "Afternoon"
    elif 17 <= hour < 21: return "Evening"
    else: return "Night"

df['DepPartOfDay'] = df['DepHour'].apply(part_of_day)


In [3]:
df['Route'] = df['Origin'] + "_" + df['Dest']


In [4]:
df['StateRoute'] = df['OriginState'] + "_" + df['DestState']


In [5]:
# Standardize airline column (handle whitespace)
if 'Operating_Airline ' in df.columns:
    df['Airline'] = df['Operating_Airline '].str.strip()
elif 'Operating_Airline' in df.columns:
    df['Airline'] = df['Operating_Airline'].str.strip()
else:
    print("Warning: Operating_Airline column not found")

print(f"Unique airlines: {df['Airline'].nunique()}")


Unique airlines: 21


In [6]:
df['DistanceGroup'] = pd.cut(df['Distance'], 
                             bins=[0, 300, 800, 1500, 2500, 5000], 
                             labels=['Short', 'Medium', 'Long', 'VeryLong', 'UltraLong'])


In [7]:
cat_cols = [
    'Airline', 'Origin', 'Dest', 'Route', 'StateRoute',
    'DepPartOfDay', 'Month', 'DayOfWeek'
]


In [8]:
# Save engineered dataset
output_path = "data/flight_data_2018_2024_engineered.csv"
df.to_csv(output_path, index=False)

print(f"\nEngineered dataset saved to: {output_path}")
print(f"Final shape: {df.shape}")
print(f"\nNew features created:")
print(f"  - DepHour: {df['DepHour'].nunique()} unique values")
print(f"  - DepPartOfDay: {df['DepPartOfDay'].nunique()} categories")
print(f"  - Route: {df['Route'].nunique()} unique routes")
print(f"  - StateRoute: {df['StateRoute'].nunique()} unique state routes")
if 'DistanceGroup' in df.columns:
    print(f"  - DistanceGroup: {df['DistanceGroup'].nunique()} groups")
print(f"  - Airline: {df['Airline'].nunique()} airlines")



Engineered dataset saved to: data/flight_data_2018_2024_engineered.csv
Final shape: (582425, 51)

New features created:
  - DepHour: 24 unique values
  - DepPartOfDay: 4 categories
  - Route: 5880 unique routes
  - StateRoute: 1252 unique state routes
  - DistanceGroup: 5 groups
  - Airline: 21 airlines
