# ✈️ Saudi Flight Delays — **Light Edition (2023)**

This notebook analyzes a compact dataset of Saudi flight operations for **2023**, focusing on delay patterns and operational insights.

**What you'll see:**
- Data loading & cleaning
- KPIs and descriptive stats
- Visuals (matplotlib-only): status mix, average delays by airline/airport/weather, hourly/day patterns
- Simple predictive baseline (linear regression) for delay minutes
- Actionable recommendations (editable section)

**How to run:** Put `saudi_flight_delays_light.csv` in the same folder and run cells top-to-bottom.

In [ ]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 160)

## 1) Load Data

In [ ]:
df = pd.read_csv('saudi_flight_delays_light.csv')
df.head()

## 2) Clean & Feature Engineering
- Convert date/time
- Derive hour of scheduled/actual departure, weekend flag
- Ensure delay is numeric (NaN for cancelled)

In [ ]:
df['date'] = pd.to_datetime(df['date'])
df['scheduled_hour'] = pd.to_datetime(df['scheduled_departure'], format='%H:%M:%S').dt.hour
df['actual_hour'] = pd.to_datetime(df['actual_departure'], format='%H:%M:%S', errors='coerce').dt.hour
df['delay_minutes'] = pd.to_numeric(df['delay_minutes'], errors='coerce')
df['is_weekend'] = df['day_of_week'].isin(['Friday','Saturday'])
df['month'] = df['date'].dt.to_period('M').dt.to_timestamp()
df.head()

## 3) Quick KPIs

In [ ]:
total_flights = len(df)
status_mix = df['status'].value_counts(normalize=True).rename(lambda x: f"{x} %").mul(100).round(1)
avg_delay = df['delay_minutes'].mean()
print('Total flights:', total_flights)
print(status_mix)
print('Average delay (min):', round(float(avg_delay),1))

## 4) Visual Analysis (matplotlib)

In [ ]:
# Status distribution
counts = df['status'].value_counts()
plt.figure()
plt.bar(counts.index, counts.values)
plt.title('Flight Status Distribution')
plt.xlabel('Status')
plt.ylabel('Count')
plt.tight_layout(); plt.show()

# Average delay by airline
delay_airline = df.groupby('airline')['delay_minutes'].mean().sort_values()
plt.figure()
plt.bar(delay_airline.index, delay_airline.values)
plt.title('Average Delay by Airline (min)')
plt.xticks(rotation=20)
plt.ylabel('Minutes')
plt.tight_layout(); plt.show()

# Average delay by weather
delay_weather = df.groupby('weather_condition')['delay_minutes'].mean().sort_values()
plt.figure()
plt.bar(delay_weather.index, delay_weather.values)
plt.title('Average Delay by Weather (min)')
plt.xticks(rotation=20)
plt.ylabel('Minutes')
plt.tight_layout(); plt.show()

In [ ]:
# Heatmap: mean delay by day of week vs scheduled hour
pivot = df.pivot_table(index='day_of_week', columns='scheduled_hour', values='delay_minutes', aggfunc='mean')
pivot = pivot.reindex(['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'])
plt.figure()
plt.imshow(pivot, aspect='auto')
plt.title('Mean Delay (min) by Day vs Hour')
plt.xlabel('Scheduled Hour')
plt.ylabel('Day of Week')
plt.colorbar(label='Minutes')
plt.tight_layout(); plt.show()
pivot.head()

## 5) Simple Predictive Baseline

In [ ]:
model_df = df.dropna(subset=['delay_minutes']).copy()
features = pd.get_dummies(model_df[['airline','origin_airport','weather_condition','scheduled_hour']], drop_first=True)
target = model_df['delay_minutes']
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
mae = mean_absolute_error(y_test, pred)
print('MAE (minutes):', round(float(mae), 2))

## 6) Recommendations (edit below)
- **Weather buffers:** Add schedule buffers for Rain/Fog/Sandstorm periods.
- **Peak hours staffing:** Allocate extra ground staff during hours showing highest mean delays.
- **Airport ops:** Focus improvements on airports with highest average delays.
- **Proactive comms:** Notify passengers when predicted delay exceeds a threshold.