# Week 1 — Data Cleaning and EDA
###  **Intern:** Sarthak Mokal , sarthakmokal198@gamil.com  
### **Project:** AirFly Insights — Infosys Springboard  

### Objective: Load the raw CSV, perform EDA, remove duplicates, handle missing values simply, derive basic features, save a cleaned CSV, and compute KPIs for Week 1.


In [0]:
import pandas as pd
import numpy as np
import time

# Load dataset 
df = pd.read_csv("/Volumes/workspace/default/airlines/Flight_delay.csv")

print("Data loaded.")
print("Original shape:", df.shape)
display(df.head())


Data loaded.
Original shape: (484551, 29)


DayOfWeek,Date,DepTime,ArrTime,CRSArrTime,UniqueCarrier,Airline,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Org_Airport,Dest,Dest_Airport,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
4,03-01-2019,1829,1959,1925,WN,Southwest Airlines Co.,3920,N464WN,90,90,77,34,34,IND,Indianapolis International Airport,BWI,Baltimore-Washington International Airport,515,3,10,0,N,0,2,0,0,0,32
4,03-01-2019,1937,2037,1940,WN,Southwest Airlines Co.,509,N763SW,240,250,230,57,67,IND,Indianapolis International Airport,LAS,McCarran International Airport,1591,3,7,0,N,0,10,0,0,0,47
4,03-01-2019,1644,1845,1725,WN,Southwest Airlines Co.,1333,N334SW,121,135,107,80,94,IND,Indianapolis International Airport,MCO,Orlando International Airport,828,6,8,0,N,0,8,0,0,0,72
4,03-01-2019,1452,1640,1625,WN,Southwest Airlines Co.,675,N286WN,228,240,213,15,27,IND,Indianapolis International Airport,PHX,Phoenix Sky Harbor International Airport,1489,7,8,0,N,0,3,0,0,0,12
4,03-01-2019,1323,1526,1510,WN,Southwest Airlines Co.,4,N674AA,123,135,110,16,28,IND,Indianapolis International Airport,TPA,Tampa International Airport,838,4,9,0,N,0,0,0,0,0,16


## Quick EDA
We will check: dtypes, missing counts, duplicate rows, and basic numeric summary for delay columns.


In [0]:
# basic info
print("Dtypes:")
display(df.dtypes)

# missing values
print("\nMissing values per column (top 20):")
display(df.isnull().sum().sort_values(ascending=False).head(20))

# duplicates
print("\nExact duplicate rows:", df.duplicated().sum())

# numeric summary for delay columns (if present)
delay_cols = ["ArrDelay","DepDelay","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay"]
present_delay_cols = [c for c in delay_cols if c in df.columns]
if present_delay_cols:
    print("\nDelay columns summary:")
    display(df[present_delay_cols].describe().T)
else:
    print("\nNo standard delay columns detected.")


Dtypes:


DayOfWeek             int64
Date                 object
DepTime               int64
ArrTime               int64
CRSArrTime            int64
UniqueCarrier        object
Airline              object
FlightNum             int64
TailNum              object
ActualElapsedTime     int64
CRSElapsedTime        int64
AirTime               int64
ArrDelay              int64
DepDelay              int64
Origin               object
Org_Airport          object
Dest                 object
Dest_Airport         object
Distance              int64
TaxiIn                int64
TaxiOut               int64
Cancelled             int64
CancellationCode     object
Diverted              int64
CarrierDelay          int64
WeatherDelay          int64
NASDelay              int64
SecurityDelay         int64
LateAircraftDelay     int64
dtype: object


Missing values per column (top 20):


Dest_Airport        1479
Org_Airport         1177
DayOfWeek              0
SecurityDelay          0
NASDelay               0
WeatherDelay           0
CarrierDelay           0
Diverted               0
CancellationCode       0
Cancelled              0
TaxiOut                0
TaxiIn                 0
Distance               0
Dest                   0
Origin                 0
Date                   0
DepDelay               0
ArrDelay               0
AirTime                0
CRSElapsedTime         0
dtype: int64


Exact duplicate rows: 2

Delay columns summary:


count,mean,std,min,25%,50%,75%,max
484551.0,60.90776409500754,56.97542038382813,15.0,25.0,42.0,76.0,1707.0
484551.0,57.49808585680351,55.99101236850621,6.0,23.0,40.0,72.0,1710.0
484551.0,17.41943985256454,39.41789257389614,0.0,0.0,2.0,19.0,1707.0
484551.0,3.1532841744212683,19.503656630515582,0.0,0.0,0.0,0.0,1148.0
484551.0,13.599420907190368,31.45465484663617,0.0,0.0,1.0,13.0,1357.0
484551.0,0.0820326446545358,1.8847739628803888,0.0,0.0,0.0,0.0,392.0
484551.0,26.653586516176837,40.53599410662208,0.0,0.0,13.0,36.0,1254.0


## Cleaning plan (simple & safe)
1. Drop exact duplicate rows.  
2. For delay numeric columns: fill missing with 0.  
3. For categorical columns (Origin, Dest, TailNum, CancellationCode, UniqueCarrier): fill missing with 'Unknown'.  
4. Convert Date to datetime; extract Month & DayOfWeek.  
5. Extract Hour from DepTime if present (HHMM).  
6. Create Route = Origin-Dest.  
7. Save cleaned CSV as `Flight_delay_cleaned.csv`.  
8. Compute KPIs (before & after numbers).


###  Save KPI baseline (before cleaning)

In [0]:
# KPI baseline before cleaning
kpi = {}
kpi['rows_before'] = len(df)
kpi['nulls_before'] = int(df.isnull().sum().sum())
kpi['dup_before'] = int(df.duplicated().sum())
# simple data quality counts before
kpi['neg_arrival_before'] = int((df['ArrDelay'] < 0).sum()) if 'ArrDelay' in df.columns else None

# print baseline
print("KPI baseline:")
for k,v in kpi.items():
    print(f"{k}: {v}")


KPI baseline:
rows_before: 484551
nulls_before: 2656
dup_before: 2
neg_arrival_before: 0


### Remove duplicates & fill delays

In [0]:
start_time = time.time()

# 1) drop exact duplicates
df = df.drop_duplicates()

# 2) fill delay numeric columns with 0
for c in present_delay_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce").fillna(0)

# 3) fill categorical columns with 'Unknown'
for c in ["Origin","Dest","TailNum","CancellationCode","UniqueCarrier","Airline"]:
    if c in df.columns:
        df[c] = df[c].fillna("Unknown")

print("Cleaning steps applied (duplicates removed, simple fills).")


Cleaning steps applied (duplicates removed, simple fills).


In [0]:


# 1) check what airport-name columns exist (helps if column name variants exist)
print("Columns present:", [c for c in df.columns if "Org" in c or "Dest" in c or "Airport" in c][:30])

# 2) Normalize whitespace & case for airport columns if they exist
for col in ["Org_Airport", "Dest_Airport", "Origin", "Dest"]:
    if col in df.columns:
        # strip whitespace and replace empty strings with NaN
        df[col] = df[col].astype(str).str.strip().replace({"": pd.NA, "nan": pd.NA})

# 3) Fill common categorical columns with 'Unknown' 
for c in ["Org_Airport", "Dest_Airport", "Origin", "Dest", "TailNum", "CancellationCode"]:
    if c in df.columns:
        df[c] = df[c].fillna("Unknown")

# optional:
for c in ["Org_Airport", "Dest_Airport"]:
    if c in df.columns:
        df[c + "_was_missing"] = (df[c] == "Unknown").astype("int8")

print("Filled airport/categorical nulls with 'Unknown' where applicable.")


Columns present: ['Org_Airport', 'Dest', 'Dest_Airport']
Filled airport/categorical nulls with 'Unknown' where applicable.


In [0]:
# — drop rows with nulls in critical columns
critical = [c for c in ["Date","Origin","Dest"] if c in df.columns]
print("Rows before drop:", len(df))
df = df.dropna(subset=critical)
print("Rows after drop:", len(df))

Rows before drop: 484549
Rows after drop: 484549


### Date/time handling & features

In [0]:
# Date -> Month, DayOfWeek
if "Date" in df.columns:
    df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
    df["Month"] = df["Date"].dt.month
    df["DayOfWeek"] = df["Date"].dt.day_name()

# DepTime -> Hour (simple HHMM extraction)
if "DepTime" in df.columns:
    try:
        s = df["DepTime"].fillna(0).astype(int).astype(str).str.zfill(4)
        df["Hour"] = s.str[:2].astype(int)
    except Exception:
        pass

# Route
if "Origin" in df.columns and "Dest" in df.columns:
    df["Route"] = df["Origin"].astype(str) + "-" + df["Dest"].astype(str)

print("Feature engineering done (Month, DayOfWeek, Hour, Route where possible).")


  df["Date"] = pd.to_datetime(df["Date"], errors="coerce")


Feature engineering done (Month, DayOfWeek, Hour, Route where possible).


### Extra simple exploration

In [0]:
# top airlines by count
if "UniqueCarrier" in df.columns:
    print("Top airlines by count:")
    display(df["UniqueCarrier"].value_counts().head())

# top routes
if "Route" in df.columns:
    print("Top 10 routes:")
    display(df["Route"].value_counts().head(10))

# average arrival delay by day of week
if "DayOfWeek" in df.columns and "ArrDelay" in df.columns:
    print("Average arrival delay by DayOfWeek:")
    display(df.groupby("DayOfWeek")["ArrDelay"].mean().sort_values(ascending=False))


Top airlines by count:


WN    119048
AA     73053
MQ     58698
UA     56896
OO     50384
Name: UniqueCarrier, dtype: int64

Top 10 routes:


ORD-LGA    1920
LGA-ORD    1615
LAX-SFO    1603
SFO-LAX    1457
LAS-LAX    1305
HOU-DAL    1276
DAL-HOU    1200
ORD-LAX    1154
PHX-LAS    1152
DFW-ORD    1125
Name: Route, dtype: int64

Average arrival delay by DayOfWeek:


DayOfWeek
Monday       64.116298
Saturday     62.234384
Tuesday      61.390009
Thursday     61.335392
Sunday       60.093748
Wednesday    59.625807
Friday       57.909918
Name: ArrDelay, dtype: float64

### Save cleaned CSV

In [0]:
# Step 9: Save the cleaned dataset (persistent path in workspace volume)
output_path = "/Volumes/workspace/default/airlines/Flight_delay_cleaned.csv"

df.to_csv(output_path, index=False)

print(f"✅ Cleaned dataset saved to: {output_path}")
print("📌 Final shape:", df.shape)


✅ Cleaned dataset saved to: /Volumes/workspace/default/airlines/Flight_delay_cleaned.csv
📌 Final shape: (484549, 34)


In [0]:
# KPIs after cleaning
kpi['rows_after'] = len(df)
kpi['nulls_after'] = int(df.isnull().sum().sum())
kpi['dup_after'] = int(df.duplicated().sum())
kpi['neg_arrival_after'] = int((df['ArrDelay'] < 0).sum()) if 'ArrDelay' in df.columns else None
kpi['time_taken_s'] = round(time.time() - start_time, 1)

# derived KPI metrics
kpi['null_reduction_pct'] = round(100 * (kpi['nulls_before'] - kpi['nulls_after']) / (kpi['nulls_before'] if kpi['nulls_before'] else 1), 1)

print("KPI results (copy these numbers into your Week-1 PDF):")
for k,v in kpi.items():
    print(f"{k}: {v}")


KPI results (copy these numbers into your Week-1 PDF):
rows_before: 484551
nulls_before: 2656
dup_before: 2
neg_arrival_before: 0
rows_after: 484549
nulls_after: 2656
dup_after: 0
neg_arrival_after: 0
time_taken_s: 108.4
null_reduction_pct: 0.0


In [0]:
print("Final shape:", kpi['rows_after'])
print("Top missing values after cleaning:")
display(df.isnull().sum().sort_values(ascending=False).head(15))

print("Sample rows after cleaning:")
display(df.head())


Final shape: 484549
Top missing values after cleaning:


Dest_Airport         1479
Org_Airport          1177
DayOfWeek               0
Hour                    0
Month                   0
LateAircraftDelay       0
SecurityDelay           0
NASDelay                0
WeatherDelay            0
CarrierDelay            0
Diverted                0
CancellationCode        0
Cancelled               0
TaxiOut                 0
TaxiIn                  0
dtype: int64

Sample rows after cleaning:


DayOfWeek,Date,DepTime,ArrTime,CRSArrTime,UniqueCarrier,Airline,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Org_Airport,Dest,Dest_Airport,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,Month,Hour,Route
Friday,2019-03-01T00:00:00.000Z,1829,1959,1925,WN,Southwest Airlines Co.,3920,N464WN,90,90,77,34,34,IND,Indianapolis International Airport,BWI,Baltimore-Washington International Airport,515,3,10,0,N,0,2,0,0,0,32,3,18,IND-BWI
Friday,2019-03-01T00:00:00.000Z,1937,2037,1940,WN,Southwest Airlines Co.,509,N763SW,240,250,230,57,67,IND,Indianapolis International Airport,LAS,McCarran International Airport,1591,3,7,0,N,0,10,0,0,0,47,3,19,IND-LAS
Friday,2019-03-01T00:00:00.000Z,1644,1845,1725,WN,Southwest Airlines Co.,1333,N334SW,121,135,107,80,94,IND,Indianapolis International Airport,MCO,Orlando International Airport,828,6,8,0,N,0,8,0,0,0,72,3,16,IND-MCO
Friday,2019-03-01T00:00:00.000Z,1452,1640,1625,WN,Southwest Airlines Co.,675,N286WN,228,240,213,15,27,IND,Indianapolis International Airport,PHX,Phoenix Sky Harbor International Airport,1489,7,8,0,N,0,3,0,0,0,12,3,14,IND-PHX
Friday,2019-03-01T00:00:00.000Z,1323,1526,1510,WN,Southwest Airlines Co.,4,N674AA,123,135,110,16,28,IND,Indianapolis International Airport,TPA,Tampa International Airport,838,4,9,0,N,0,0,0,0,0,16,3,13,IND-TPA
