Week 1: Project Initialization and Dataset Setup
Step 1. Define goals, KPIs, workflow

Goal: Predict flight delays more accurately using cleaned and preprocessed data.

KPIs:

- Model accuracy / AUC on delay prediction
- Reduction of nulls (target = 0 nulls in cleaned dataset)
- Faster preprocessing by saving a reusable cleaned CSV

Workflow:

- Load raw CSV
- Explore schema, datatypes, and memory usage
- Handle nulls and optimize
- Save cleaned CSV

Step 1:Import Libraries

In [0]:
# Step 1: imports
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col, trim, lit

spark = SparkSession.builder.getOrCreate()
print("✅ Imports ready, Spark session available.")


✅ Imports ready, Spark session available.


Step 2: Load Dataset

In [0]:
# Path to raw dataset
raw_path = "/Volumes/workspace/default/airlines/Flight_delay.csv"

# Load
df = pd.read_csv(raw_path)

print("✅ Raw dataset loaded")
print("Rows:", df.shape[0], "Columns:", df.shape[1])
df.head()


✅ Raw dataset loaded
Rows: 484551 Columns: 29


Unnamed: 0,DayOfWeek,Date,DepTime,ArrTime,CRSArrTime,UniqueCarrier,Airline,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Org_Airport,Dest,Dest_Airport,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,4,03-01-2019,1829,1959,1925,WN,Southwest Airlines Co.,3920,N464WN,90,90,77,34,34,IND,Indianapolis International Airport,BWI,Baltimore-Washington International Airport,515,3,10,0,N,0,2,0,0,0,32
1,4,03-01-2019,1937,2037,1940,WN,Southwest Airlines Co.,509,N763SW,240,250,230,57,67,IND,Indianapolis International Airport,LAS,McCarran International Airport,1591,3,7,0,N,0,10,0,0,0,47
2,4,03-01-2019,1644,1845,1725,WN,Southwest Airlines Co.,1333,N334SW,121,135,107,80,94,IND,Indianapolis International Airport,MCO,Orlando International Airport,828,6,8,0,N,0,8,0,0,0,72
3,4,03-01-2019,1452,1640,1625,WN,Southwest Airlines Co.,675,N286WN,228,240,213,15,27,IND,Indianapolis International Airport,PHX,Phoenix Sky Harbor International Airport,1489,7,8,0,N,0,3,0,0,0,12
4,4,03-01-2019,1323,1526,1510,WN,Southwest Airlines Co.,4,N674AA,123,135,110,16,28,IND,Indianapolis International Airport,TPA,Tampa International Airport,838,4,9,0,N,0,0,0,0,0,16


Step 3: Explore schema & datatypes

In [0]:
print("\n📌 Data Types:")
print(df.dtypes)

print("\n📌 Missing values:")
print(df.isnull().sum())

print("\n📌 Memory usage (MB):", round(df.memory_usage().sum() / 1024**2, 2))



📌 Data Types:
DayOfWeek             int64
Date                 object
DepTime               int64
ArrTime               int64
CRSArrTime            int64
UniqueCarrier        object
Airline              object
FlightNum             int64
TailNum              object
ActualElapsedTime     int64
CRSElapsedTime        int64
AirTime               int64
ArrDelay              int64
DepDelay              int64
Origin               object
Org_Airport          object
Dest                 object
Dest_Airport         object
Distance              int64
TaxiIn                int64
TaxiOut               int64
Cancelled             int64
CancellationCode     object
Diverted              int64
CarrierDelay          int64
WeatherDelay          int64
NASDelay              int64
SecurityDelay         int64
LateAircraftDelay     int64
dtype: object

📌 Missing values:
DayOfWeek               0
Date                    0
DepTime                 0
ArrTime                 0
CRSArrTime              0
UniqueCarr

Step 4: Sampling (optional for faster testing)

In [0]:
# Take a 1% sample for quick checks
df_sample = df.sample(frac=0.01, random_state=42)
print("✅ Sample size:", df_sample.shape)


✅ Sample size: (4846, 29)


Step 5: Memory optimization (convert types)

In [0]:
# Convert int64 → int32, float64 → float32 where possible
for col in df.select_dtypes(include=["int64"]).columns:
    df[col] = pd.to_numeric(df[col], downcast="integer")

for col in df.select_dtypes(include=["float64"]).columns:
    df[col] = pd.to_numeric(df[col], downcast="float")

print("✅ Memory optimized. New size (MB):", round(df.memory_usage().sum() / 1024**2, 2))


✅ Memory optimized. New size (MB): 50.37


Week 2: Preprocessing and Feature Engineering

Step 6. Handle nulls rigorously

In [0]:
# Categorical → "Unknown"
for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].fillna("Unknown")

# Numeric → median
for col in df.select_dtypes(include=["int32","int64","float32","float64"]).columns:
    df[col] = df[col].fillna(df[col].median())

print("✅ Nulls handled. Remaining null count:", df.isnull().sum().sum())


✅ Nulls handled. Remaining null count: 0


Step 7: Format datetime

In [0]:
# Peek at raw values before conversion
print("📌 First 5 raw Date values:", df["Date"].head().tolist())

# Try to parse with known format (adjust if needed)
try:
    # If your dataset has YYYY-MM-DD (e.g., 2019-03-01)
    df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d", errors="coerce")
except:
    # If your dataset has DD/MM/YYYY (e.g., 01/03/2019)
    df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y", errors="coerce")

# Double check result
print("\n📌 Converted Date column sample:")
print(df["Date"].head())
print("Date column dtype:", df["Date"].dtype)


📌 First 5 raw Date values: [Timestamp('2019-03-01 00:00:00'), Timestamp('2019-03-01 00:00:00'), Timestamp('2019-03-01 00:00:00'), Timestamp('2019-03-01 00:00:00'), Timestamp('2019-03-01 00:00:00')]

📌 Converted Date column sample:
0   2019-03-01
1   2019-03-01
2   2019-03-01
3   2019-03-01
4   2019-03-01
Name: Date, dtype: datetime64[ns]
Date column dtype: datetime64[ns]


Step 8:Feature Engineering

In [0]:
# Month, Day of Week, Day Number
df["Month"] = df["Date"].dt.month
df["DayOfWeek"] = df["Date"].dt.dayofweek   # Monday=0, Sunday=6
df["DayNumber"] = df["Date"].dt.day

# Convert DepTime (HHMM format) into Hour (HH)
df["Hour"] = (df["DepTime"] // 100).astype(int)

# Create Route as Origin-Dest
df["Route"] = df["Origin"] + "-" + df["Dest"]

# ✅ Double check
print("📌 Sample with new features:")
print(df[["Date", "Month", "DayOfWeek", "DayNumber", "Hour", "Origin", "Dest", "Route"]].head())


📌 Sample with new features:
        Date  Month  DayOfWeek  DayNumber  Hour Origin Dest    Route
0 2019-03-01      3          4          1    18    IND  BWI  IND-BWI
1 2019-03-01      3          4          1    19    IND  LAS  IND-LAS
2 2019-03-01      3          4          1    16    IND  MCO  IND-MCO
3 2019-03-01      3          4          1    14    IND  PHX  IND-PHX
4 2019-03-01      3          4          1    13    IND  TPA  IND-TPA


Step 9:Save Preprocessed Data

In [0]:
# Step 9: Save the cleaned dataset
output_path = "/Volumes/workspace/default/airlines/Flight_delay_cleaned.csv"
df.to_csv(output_path, index=False)

print(f"✅ Cleaned dataset saved to: {output_path}")
print("📌 Final shape:", df.shape)


✅ Cleaned dataset saved to: /Volumes/workspace/default/airlines/Flight_delay_cleaned.csv
📌 Final shape: (484551, 33)


Step 10:Duplicate removal

In [0]:
import pandas as pd

# Paths
cleaned_path = "/Volumes/workspace/default/airlines/Flight_delay_cleaned.csv"
final_path = "/Volumes/workspace/default/airlines/Flight_delay_final.csv"

# Load cleaned dataset
df_cleaned = pd.read_csv(cleaned_path)

# Drop duplicates
before_shape = df_cleaned.shape
df_final = df_cleaned.drop_duplicates()
after_shape = df_final.shape

# Save final dataset
df_final.to_csv(final_path, index=False)

print("📌 Shape before duplicate removal:", before_shape)
print("📌 Shape after duplicate removal:", after_shape)
print("📌 Duplicate rows left:", df_final.duplicated().sum())
print(f"✅ Final cleaned dataset saved at: {final_path}")


📌 Shape before duplicate removal: (484551, 33)
📌 Shape after duplicate removal: (484549, 33)
📌 Duplicate rows left: 0
✅ Final cleaned dataset saved at: /Volumes/workspace/default/airlines/Flight_delay_final.csv


Step 11:Compare Cleaned vs Final

In [0]:
# Load final dataset
df_final = pd.read_csv(final_path)

print("\n📌 CLEANED Dataset Shape:", df_cleaned.shape)
print("📌 FINAL Dataset Shape:", df_final.shape)

# Check nulls
print("\n📌 Null values in FINAL dataset:")
print(df_final.isnull().sum().sum(), "total nulls (should be 0)")

# Compare extra columns
print("\n📌 Columns in FINAL dataset:", len(df_final.columns))
print("Extra derived columns:", set(df_final.columns) - set(df_cleaned.columns))

# Quick sample
print("\n📌 Sample 3 rows from FINAL dataset:")
print(df_final.head(3))



📌 CLEANED Dataset Shape: (484551, 33)
📌 FINAL Dataset Shape: (484549, 33)

📌 Null values in FINAL dataset:
0 total nulls (should be 0)

📌 Columns in FINAL dataset: 33
Extra derived columns: set()

📌 Sample 3 rows from FINAL dataset:
   DayOfWeek        Date  DepTime  ArrTime  ...  Month DayNumber Hour    Route
0          4  2019-03-01     1829     1959  ...      3         1   18  IND-BWI
1          4  2019-03-01     1937     2037  ...      3         1   19  IND-LAS
2          4  2019-03-01     1644     1845  ...      3         1   16  IND-MCO

[3 rows x 33 columns]
