# Flight delays and weather condition – Data Exploration

**Purpose**
- Explore data for flight delays and weather condition use cases
- Validate assumptions before adding API endpoints
- Prototype logic for FastAPI services

**Author:** Rashed  
**Date:** 2025-24-12


In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("robikscube/flight-delay-dataset-20182022")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/robikscube/flight-delay-dataset-20182022?dataset_version_number=4...


100%|██████████| 3.73G/3.73G [00:18<00:00, 218MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/robikscube/flight-delay-dataset-20182022/versions/4


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/Colab\ Notebooks/FlightDelay
%ls

/content/drive/MyDrive/Colab Notebooks/FlightDelay
[0m[01;34maeroapi-python[0m/     [01;34mfeatures[0m/        [01;34mnotebooks[0m/        sample_df.parquet
[01;34maeromarket_api[0m/     [01;34mFlightWeather[0m/   [01;34mopenmeteo_api[0m/    [01;34msrc[0m/
bootstrap_paths.py  full_df.parquet  [01;34m__pycache__[0m/      [01;34msublemmentary_folder[0m/
[01;34mcatboost_info[0m/      main.py          pyproject.toml    weather.parquet
[01;34mdata[0m/               [01;34mmodels[0m/          requirements.txt


In [4]:
import pandas as pd
# We only need few columns for our analysis
KEEP_COLS = [
    # ---- Date & time (join keys) ----
    "FlightDate",
    "DepTimeBlk",

    # ---- Airports (for direction) ----
    "Origin",
    "Dest",

    # ---- Flight identity (API join) ----
    "IATA_Code_Operating_Airline",

    # ---- Targets / outcomes ----
    "DepDelayMinutes",

    # ---- Operational signal ----
    "Distance",
    "CRSElapsedTime",

    # ---- Cancellation info (to drop)----
    "Cancelled",

    # ---- Arrival time to calculate counts ----
    "ArrTimeBlk",

]

df_2018 = pd.read_parquet(f'{path}/Combined_Flights_2018.parquet', columns=KEEP_COLS)
df_2019 = pd.read_parquet(f'{path}/Combined_Flights_2019.parquet', columns=KEEP_COLS)
df_2020 = pd.read_parquet(f'{path}/Combined_Flights_2020.parquet', columns=KEEP_COLS)
df_2021 = pd.read_parquet(f'{path}/Combined_Flights_2021.parquet', columns=KEEP_COLS)
df_2022 = pd.read_parquet(f'{path}/Combined_Flights_2022.parquet', columns=KEEP_COLS)

Now we will need to concatenate the data from different years into a single DataFrame for easier analysis.

In [18]:
df = pd.concat([df_2018, df_2019, df_2020, df_2021, df_2022], ignore_index=True)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29193782 entries, 0 to 29193781
Data columns (total 10 columns):
 #   Column                       Dtype         
---  ------                       -----         
 0   FlightDate                   datetime64[us]
 1   DepTimeBlk                   object        
 2   Origin                       object        
 3   Dest                         object        
 4   IATA_Code_Operating_Airline  object        
 5   DepDelayMinutes              float64       
 6   Distance                     float64       
 7   CRSElapsedTime               float64       
 8   Cancelled                    bool          
 9   ArrTimeBlk                   object        
dtypes: bool(1), datetime64[us](1), float64(3), object(5)
memory usage: 2.0+ GB


In [20]:
df.describe()

Unnamed: 0,FlightDate,DepDelayMinutes,Distance,CRSElapsedTime
count,29193782,28430700.0,29193780.0,29193760.0
mean,2020-04-23 22:27:03.485606,12.78311,779.7346,138.7605
min,2018-01-01 00:00:00,0.0,16.0,-292.0
25%,2019-03-18 00:00:00,0.0,354.0,88.0
50%,2020-02-08 00:00:00,0.0,626.0,121.0
75%,2021-07-17 00:00:00,5.0,1014.0,169.0
max,2022-07-31 00:00:00,7223.0,5812.0,1645.0
std,,46.17337,581.2739,70.77316


In [21]:
# TODO: do not train the model, until you make sure that you can obtain distance and CRSElapsedTime from the API
df.head()

Unnamed: 0,FlightDate,DepTimeBlk,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,Cancelled,ArrTimeBlk
0,2018-01-23,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359
1,2018-01-24,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359
2,2018-01-25,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359
3,2018-01-26,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359
4,2018-01-27,1400-1459,ABY,ATL,9E,0.0,145.0,60.0,False,1500-1559


Our goal is to predict delay, so we do not need cancelled flights. We will filter them out during the data loading phase.

**Update:** after deciding that we are doing classification instead of regression, we think that cancelled flight matter, so we won't run next cell

In [9]:
df = df[df['Cancelled'] == 0].copy()
df.drop(columns=['Cancelled'], inplace=True)
df.head()

Unnamed: 0,FlightDate,DepTimeBlk,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,ArrTimeBlk
0,2018-01-23,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,1300-1359
1,2018-01-24,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,1300-1359
2,2018-01-25,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,1300-1359
3,2018-01-26,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,1300-1359
4,2018-01-27,1400-1459,ABY,ATL,9E,0.0,145.0,60.0,1500-1559


In [22]:
df.shape

(29193782, 10)

Let's refine the date & time columns for departure and arrival times to create a proper datetime representation.

In [23]:
df['FlightDate'] = pd.to_datetime(df['FlightDate'])
df['Hour'] = df['DepTimeBlk'].str.slice(0, 2).astype(int)
df.drop(columns=['DepTimeBlk'], inplace=True)
df["datetime"] = (
    df["FlightDate"]
    + pd.to_timedelta(df["Hour"], unit="h")
)
df.head()

Unnamed: 0,FlightDate,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,Cancelled,ArrTimeBlk,Hour,datetime
0,2018-01-23,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359,12,2018-01-23 12:00:00
1,2018-01-24,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359,12,2018-01-24 12:00:00
2,2018-01-25,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359,12,2018-01-25 12:00:00
3,2018-01-26,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359,12,2018-01-26 12:00:00
4,2018-01-27,ABY,ATL,9E,0.0,145.0,60.0,False,1500-1559,14,2018-01-27 14:00:00


In [24]:
df['Hour_Arrival'] = df['ArrTimeBlk'].str.slice(0, 2).astype(int)
df['arrival_next_day'] = df['Hour_Arrival'] < df['Hour']
df["arr_datetime"] = (
    df["FlightDate"]
    + pd.to_timedelta(df["Hour_Arrival"], unit="h")
    + pd.to_timedelta(df["arrival_next_day"].astype(int), unit="D")
)
df.drop(columns=['ArrTimeBlk', 'Hour_Arrival', 'arrival_next_day', 'Hour', 'FlightDate'], inplace=True)


In [25]:
df.head(3)

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,Cancelled,datetime,arr_datetime
0,ABY,ATL,9E,0.0,145.0,62.0,False,2018-01-23 12:00:00,2018-01-23 13:00:00
1,ABY,ATL,9E,0.0,145.0,62.0,False,2018-01-24 12:00:00,2018-01-24 13:00:00
2,ABY,ATL,9E,0.0,145.0,62.0,False,2018-01-25 12:00:00,2018-01-25 13:00:00


In order to calculate the number of flights departing from an airport, and arriving to the same airport, within the same hour, we need to group by the origin and deparuture datetime, and then do the same for arrival. Then we can add these counts as new features to the main dataframe.

First, we will group by 'Origin' and 'datetime' to get the count of departures per hour for each airport.

In [26]:
dep_counts = (
    df.groupby(["Origin", "datetime"])
      .size()
      .reset_index(name="departures_per_hour")
      .rename(columns={"Origin": "airport", "datetime": "hour"})
)

arr_counts = (
    df.groupby(["Dest", "arr_datetime"])
      .size()
      .reset_index(name="arrivals_per_hour")
      .rename(columns={"Dest": "airport", "arr_datetime": "hour"})
)


In [27]:
dep_counts.head()

Unnamed: 0,airport,hour,departures_per_hour
0,ABE,2018-01-01 06:00:00,2
1,ABE,2018-01-01 09:00:00,2
2,ABE,2018-01-01 17:00:00,1
3,ABE,2018-01-01 20:00:00,1
4,ABE,2018-01-02 06:00:00,3


In [28]:
arr_counts.head()

Unnamed: 0,airport,hour,arrivals_per_hour
0,ABE,2018-01-01 09:00:00,1
1,ABE,2018-01-01 16:00:00,1
2,ABE,2018-01-01 17:00:00,2
3,ABE,2018-01-01 19:00:00,1
4,ABE,2018-01-01 22:00:00,1


In [29]:
congestion = (
    dep_counts
    .merge(
        arr_counts,
        on=["airport", "hour"],
        how="outer"
    )
    .fillna(0)
)

congestion["scheduled_congestion"] = (
    congestion["departures_per_hour"]
    + congestion["arrivals_per_hour"]
)

congestion.head(10)

Unnamed: 0,airport,hour,departures_per_hour,arrivals_per_hour,scheduled_congestion
0,ABE,2018-01-01 06:00:00,2.0,0.0,2.0
1,ABE,2018-01-01 09:00:00,2.0,1.0,3.0
2,ABE,2018-01-01 16:00:00,0.0,1.0,1.0
3,ABE,2018-01-01 17:00:00,1.0,2.0,3.0
4,ABE,2018-01-01 19:00:00,0.0,1.0,1.0
5,ABE,2018-01-01 20:00:00,1.0,0.0,1.0
6,ABE,2018-01-01 22:00:00,0.0,1.0,1.0
7,ABE,2018-01-02 06:00:00,3.0,0.0,3.0
8,ABE,2018-01-02 09:00:00,1.0,1.0,2.0
9,ABE,2018-01-02 15:00:00,0.0,1.0,1.0


In [30]:
congestion[congestion['scheduled_congestion'] == 132]

Unnamed: 0,airport,hour,departures_per_hour,arrivals_per_hour,scheduled_congestion
300958,ATL,2018-01-12 20:00:00,43.0,89.0,132.0
300984,ATL,2018-01-14 08:00:00,53.0,79.0,132.0
301015,ATL,2018-01-15 20:00:00,42.0,90.0,132.0
301065,ATL,2018-01-18 13:00:00,66.0,66.0,132.0
301072,ATL,2018-01-18 20:00:00,42.0,90.0,132.0
...,...,...,...,...,...
4485313,ORD,2022-07-22 18:00:00,54.0,78.0,132.0
4485351,ORD,2022-07-24 18:00:00,53.0,79.0,132.0
4485370,ORD,2022-07-25 18:00:00,54.0,78.0,132.0
4485427,ORD,2022-07-28 18:00:00,54.0,78.0,132.0


Now let's merge these counts back into the main dataframe.

In [31]:
df = df.merge(
    congestion[["airport", "hour", "scheduled_congestion"]],
    left_on=["Origin", "datetime"],
    right_on=["airport", "hour"],
    how="left",
    validate="many_to_one"
)
df

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,Cancelled,datetime,arr_datetime,airport,hour,scheduled_congestion
0,ABY,ATL,9E,0.0,145.0,62.0,False,2018-01-23 12:00:00,2018-01-23 13:00:00,ABY,2018-01-23 12:00:00,1.0
1,ABY,ATL,9E,0.0,145.0,62.0,False,2018-01-24 12:00:00,2018-01-24 13:00:00,ABY,2018-01-24 12:00:00,1.0
2,ABY,ATL,9E,0.0,145.0,62.0,False,2018-01-25 12:00:00,2018-01-25 13:00:00,ABY,2018-01-25 12:00:00,1.0
3,ABY,ATL,9E,0.0,145.0,62.0,False,2018-01-26 12:00:00,2018-01-26 13:00:00,ABY,2018-01-26 12:00:00,1.0
4,ABY,ATL,9E,0.0,145.0,60.0,False,2018-01-27 14:00:00,2018-01-27 15:00:00,ABY,2018-01-27 14:00:00,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
29193777,MSY,EWR,YX,25.0,1167.0,185.0,False,2022-03-31 19:00:00,2022-03-31 23:00:00,MSY,2022-03-31 19:00:00,11.0
29193778,CLT,EWR,YX,44.0,529.0,129.0,True,2022-03-17 17:00:00,2022-03-17 19:00:00,CLT,2022-03-17 17:00:00,64.0
29193779,ALB,ORD,YX,378.0,723.0,158.0,False,2022-03-08 17:00:00,2022-03-08 18:00:00,ALB,2022-03-08 17:00:00,7.0
29193780,EWR,PIT,YX,113.0,319.0,86.0,False,2022-03-25 21:00:00,2022-03-25 22:00:00,EWR,2022-03-25 21:00:00,34.0


In [32]:
df[df['scheduled_congestion'] == 132]

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,Cancelled,datetime,arr_datetime,airport,hour,scheduled_congestion
593,ATL,FAY,9E,8.0,331.0,78.0,False,2018-01-14 08:00:00,2018-01-14 09:00:00,ATL,2018-01-14 08:00:00,132.0
599,ATL,FAY,9E,0.0,331.0,76.0,False,2018-01-20 08:00:00,2018-01-20 09:00:00,ATL,2018-01-20 08:00:00,132.0
605,ATL,FAY,9E,47.0,331.0,76.0,False,2018-01-27 08:00:00,2018-01-27 09:00:00,ATL,2018-01-27 08:00:00,132.0
1298,ATL,FAY,9E,42.0,331.0,75.0,False,2018-01-18 13:00:00,2018-01-18 14:00:00,ATL,2018-01-18 13:00:00,132.0
1305,ATL,FAY,9E,0.0,331.0,75.0,False,2018-01-25 13:00:00,2018-01-25 14:00:00,ATL,2018-01-25 13:00:00,132.0
...,...,...,...,...,...,...,...,...,...,...,...,...
29192145,ORD,IND,YX,83.0,177.0,63.0,False,2022-03-26 18:00:00,2022-03-26 20:00:00,ORD,2022-03-26 18:00:00,132.0
29192150,ORD,ROC,YX,71.0,528.0,114.0,False,2022-03-26 18:00:00,2022-03-26 21:00:00,ORD,2022-03-26 18:00:00,132.0
29192170,ORD,CMH,YX,0.0,296.0,75.0,False,2022-03-26 18:00:00,2022-03-26 20:00:00,ORD,2022-03-26 18:00:00,132.0
29192196,ORD,CVG,YX,0.0,264.0,77.0,False,2022-03-26 18:00:00,2022-03-26 20:00:00,ORD,2022-03-26 18:00:00,132.0


In [33]:
df.to_parquet("./data/df_v1.parquet")

**Update**: It has been realized that even the arrival congestion affect departure delay, so it has been added here.

In [None]:
import pandas as pd

In [None]:
full_df = pd.read_parquet("./data/full_df_v2.parquet")

In [None]:
full_df.columns

Index(['Origin', 'Dest', 'IATA_Code_Operating_Airline', 'DepDelayMinutes',
       'Distance', 'CRSElapsedTime', 'arr_datetime', 'scheduled_congestion',
       'dep_snowfall', 'dep_rain', 'dep_precipitation', 'dep_wind_speed_10m',
       'dep_wind_gusts_10m', 'dep_cloud_cover_low', 'dep_cloud_cover',
       'dep_temperature_2m', 'dep_apparent_temperature',
       'dep_surface_pressure', 'dep_relative_humidity_2m', 'dep_pressure_msl',
       'dep_date_local', 'arr_date', 'arr_snowfall', 'arr_rain',
       'arr_precipitation', 'arr_wind_speed_10m', 'arr_wind_gusts_10m',
       'arr_cloud_cover_low', 'arr_cloud_cover', 'arr_temperature_2m',
       'arr_apparent_temperature', 'arr_surface_pressure',
       'arr_relative_humidity_2m', 'arr_pressure_msl'],
      dtype='object')

In [None]:
full_df = full_df.rename(columns={'scheduled_congestion': 'dep_scheduled_congestion'})

In [None]:
full_df.head(1)['arr_datetime']

Unnamed: 0,arr_datetime
0,2018-01-23 13:00:00


In [None]:
dep_counts = (
    full_df.groupby(["Origin", "dep_date_local"])
      .size()
      .reset_index(name="departures_per_hour")
      .rename(columns={"Origin": "airport", "dep_date_local": "hour"})
)

arr_counts = (
    full_df.groupby(["Dest", "arr_datetime"])
      .size()
      .reset_index(name="arrivals_per_hour")
      .rename(columns={"Dest": "airport", "arr_datetime": "hour"})
)


In [None]:
congestion = (
    dep_counts
    .merge(
        arr_counts,
        on=["airport", "hour"],
        how="outer"
    )
    .fillna(0)
)

congestion["scheduled_congestion"] = (
    congestion["departures_per_hour"]
    + congestion["arrivals_per_hour"]
)
congestion.head(6)

Unnamed: 0,airport,hour,departures_per_hour,arrivals_per_hour,scheduled_congestion
0,ABE,2018-01-01 06:00:00,2.0,0.0,2.0
1,ABE,2018-01-01 09:00:00,2.0,1.0,3.0
2,ABE,2018-01-01 16:00:00,0.0,1.0,1.0
3,ABE,2018-01-01 17:00:00,1.0,2.0,3.0
4,ABE,2018-01-01 19:00:00,0.0,1.0,1.0
5,ABE,2018-01-01 20:00:00,1.0,0.0,1.0


In [None]:
full_df = full_df.merge(
    congestion[["airport", "hour", "scheduled_congestion"]],
    left_on=["Dest", "arr_datetime"],
    right_on=["airport", "hour"],
    how="left",
    validate="many_to_one"
)

In [None]:
cols_to_drop = ['airport', 'hour']
full_df.drop(columns=cols_to_drop, inplace=True)
full_df.rename(columns={'scheduled_congestion': 'arr_scheduled_congestion'}, inplace=True)

In [None]:
sample_df = full_df.sample(n=100000, random_state=42)

In [None]:
sample_df.to_parquet("./data/sample_df_v3.parquet")
full_df.to_parquet("./data/full_df_v3.parquet")

Now we can include weather data into our dataset. These classes were created in order to simplify the apis calls on the notebook.

In [None]:
%%capture
!pip install -r requirements.txt

In [None]:
from openmeteo_api.src.openmeteoapi.WeatherData import Weather
from openmeteo_api.src.openmeteoapi.APICaller import OpenMeteoAPICaller
import os
from dotenv import load_dotenv

The flight delay dataset is from 2018 to 2022, so all we need to do is getting the weather data for this range of time for each unique airpor, and then join them together.

In [None]:
print(set(df['Origin'].unique()) == set(df['Dest'].unique()))
print(len(set(df['Origin'].unique())))
# print(df[df['Origin']== "ISN"])

True
388


In [None]:
from tqdm import tqdm
import time

BATCH_SIZE = 10
length = len(set(df['Origin'].unique()))
airports = list(df['Origin'].unique())
start_date = "2018-01-01"
end_date = "2022-12-31"
api_caller_weather = OpenMeteoAPICaller()

dfs = []

for i in tqdm(range(0, length, BATCH_SIZE)):
    airport_list = airports[i:i+BATCH_SIZE]
    print(airport_list)
    if len(airport_list) == 0: break
    w = Weather(
        api_caller=api_caller_weather,
        airport_code=airport_list,
        code_type="iata",
        start_date=start_date,
        end_date=end_date,
    )

    for attempt in range(3):
        try:
            df_weather = w.to_hourly_dataframe()
            dfs.append(df_weather)
            break
        except Exception as e:
            if "limit" in str(e).lower():
                time.sleep(60)
            else:
                raise

    time.sleep(5)

weather_df = pd.concat(dfs, ignore_index=True)

  0%|          | 0/39 [00:00<?, ?it/s]

['ABY', 'ATL', 'MOB', 'BUF', 'DFW', 'BTV', 'CVG', 'LGA', 'CHO', 'EWN']


  3%|▎         | 1/39 [00:07<04:46,  7.53s/it]

['MCI', 'MGM', 'MSP', 'DCA', 'FAY', 'OAJ', 'STL', 'CWA', 'DTW', 'RDU']


  3%|▎         | 1/39 [00:10<06:46, 10.70s/it]


KeyboardInterrupt: 

In [None]:
weather_df = pd.read_parquet("./data/weather_01022026_V3_big.parquet")

In [None]:
weather_df[weather_df["queried_airport_code"] == "JFK"].head(3)

Unnamed: 0,date,snowfall,rain,precipitation,wind_speed_10m,wind_gusts_10m,cloud_cover_low,cloud_cover,temperature_2m,apparent_temperature,surface_pressure,relative_humidity_2m,pressure_msl,queried_airport_code
6840432,2017-12-31 22:00:00-07:00,0.0,0.0,0.0,18.59845,32.399998,0.0,0.0,-13.1,-19.658886,1026.895264,51.989731,1027.300049,JFK
6840433,2017-12-31 23:00:00-07:00,0.0,0.0,0.0,18.11841,31.319998,0.0,0.0,-13.3,-19.781853,1027.194824,53.764278,1027.599976,JFK
6840434,2018-01-01 00:00:00-07:00,0.0,0.0,0.0,18.11841,30.599998,0.0,0.0,-13.5,-19.981882,1027.594238,54.644852,1028.0,JFK


In [None]:
weather_df["date_local"] = weather_df["date"].dt.tz_localize(None)

In [None]:
weather_df.head(3)

Unnamed: 0,date,snowfall,rain,precipitation,wind_speed_10m,wind_gusts_10m,cloud_cover_low,cloud_cover,temperature_2m,apparent_temperature,surface_pressure,relative_humidity_2m,pressure_msl,queried_airport_code,date_local
0,2017-01-31 00:00:00-07:00,0.0,0.0,0.0,17.283749,31.319998,0.0,26.0,1.55,-3.807626,841.16626,59.465061,1024.800049,ABQ,2017-01-31 00:00:00
1,2017-01-31 01:00:00-07:00,0.0,0.0,0.0,10.446206,27.0,0.0,21.0,0.05,-4.299987,840.681946,66.755997,1025.300049,ABQ,2017-01-31 01:00:00
2,2017-01-31 02:00:00-07:00,0.0,0.0,0.0,4.802999,16.199999,0.0,30.0,-0.2,-3.811146,840.85968,63.962601,1025.699951,ABQ,2017-01-31 02:00:00


## Duplicates
One problem that we would face when collecting weather data is daylight saving. For example,

In [None]:
weather_df[weather_df.duplicated(subset=["queried_airport_code", "date_local"])]

Unnamed: 0,date,snowfall,rain,precipitation,wind_speed_10m,wind_gusts_10m,cloud_cover_low,cloud_cover,temperature_2m,apparent_temperature,surface_pressure,relative_humidity_2m,pressure_msl,queried_airport_code,date_local
6673,2017-11-05 01:00:00-07:00,0.0,0.0,0.0,11.841756,19.080000,0.0,0.0,11.6500,8.442201,837.136108,58.163963,1012.900024,ABQ,2017-11-05 01:00:00
14738,2017-11-05 01:00:00-07:00,0.0,0.0,0.0,14.291592,28.080000,0.0,39.0,-1.6500,-6.763009,967.247986,64.854851,1016.299988,ABR,2017-11-05 01:00:00
22800,2017-11-05 01:00:00-07:00,0.0,0.0,0.0,9.346143,20.519999,1.0,4.0,4.9500,2.038206,1011.125061,89.389397,1018.599976,ACV,2017-11-05 01:00:00
30866,2017-11-05 01:00:00-07:00,0.0,0.0,0.0,14.400000,24.840000,100.0,100.0,22.5000,24.967684,1012.389221,93.223862,1015.200012,AEX,2017-11-05 01:00:00
38928,2017-11-05 01:00:00-07:00,0.0,0.2,0.2,11.983188,20.880001,0.0,97.0,5.1500,0.807792,967.731445,54.021641,1010.900024,ALW,2017-11-05 01:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19381464,2018-11-04 01:00:00-07:00,0.0,0.0,0.0,2.620839,19.799999,0.0,95.0,16.9505,14.088112,880.911682,26.388680,1018.900024,BIH,2018-11-04 01:00:00
19390200,2019-11-03 01:00:00-07:00,0.0,0.0,0.0,4.896529,13.320000,0.0,0.0,15.7005,11.573908,880.364807,13.520775,1018.900024,BIH,2019-11-03 01:00:00
19398936,2020-11-01 01:00:00-07:00,0.0,0.0,0.0,5.001280,14.040000,0.0,0.0,18.1005,13.779694,883.833252,9.132733,1021.700012,BIH,2020-11-01 01:00:00
19407840,2021-11-07 01:00:00-07:00,0.0,0.0,0.0,9.449572,28.080000,0.0,0.0,9.5505,5.935616,873.216919,48.630173,1013.799988,BIH,2021-11-07 01:00:00


Can you notice something in these duplicated items? They are all in daylight saving time. When we request the data from the open meteo api in local time, it gives the correct time with the correct offset, but when stored just the date and time in date_local without the offset, it remains duplicated, but we are only interested in the time after the DST, because flights set the take off time according to the new time. Let's remove the duplicate and keep only updated time.

In [None]:
weather_df = (
    weather_df
    .sort_values("date")
    .drop_duplicates(
        subset=["queried_airport_code", "date_local"],
        keep="last"
    )
)


In [None]:
assert not weather_df.duplicated(
    ["queried_airport_code", "date_local"]
).any()


In [None]:
full_df = df.merge(weather_df, left_on=["Origin", "datetime"], right_on=["queried_airport_code", "date_local"], how="left")

In [None]:
full_df.head(3)

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,datetime,arr_datetime,airport,hour,...,wind_gusts_10m,cloud_cover_low,cloud_cover,temperature_2m,apparent_temperature,surface_pressure,relative_humidity_2m,pressure_msl,queried_airport_code,date_local
0,ABY,ATL,9E,0.0,145.0,62.0,2018-01-23 12:00:00,2018-01-23 13:00:00,ABY,2018-01-23 12:00:00,...,36.360001,0.0,88.0,17.049999,13.199808,1009.687012,31.722488,1016.599976,ABY,2018-01-23 12:00:00
1,ABY,ATL,9E,0.0,145.0,62.0,2018-01-24 12:00:00,2018-01-24 13:00:00,ABY,2018-01-24 12:00:00,...,31.68,0.0,88.0,14.0,10.193374,1017.16156,34.573521,1024.199951,ABY,2018-01-24 12:00:00
2,ABY,ATL,9E,0.0,145.0,62.0,2018-01-25 12:00:00,2018-01-25 13:00:00,ABY,2018-01-25 12:00:00,...,30.599998,0.0,33.0,15.2,11.500547,1024.242065,26.188848,1031.300049,ABY,2018-01-25 12:00:00


In [None]:
display(len(full_df) == len(df))

True

In [None]:
df.isna().sum()

Unnamed: 0,0
Origin,0
Dest,0
IATA_Code_Operating_Airline,0
DepDelayMinutes,1113
Distance,0
CRSElapsedTime,4
datetime,0
arr_datetime,0
airport,0
hour,0


In [None]:
full_df.isna().sum()

Unnamed: 0,0
Origin,0
Dest,0
IATA_Code_Operating_Airline,0
DepDelayMinutes,1113
Distance,0
CRSElapsedTime,4
datetime,0
arr_datetime,0
airport,0
hour,0


Before saving the Dataset, it would be a good idea to drop the features that were created for dataset construction purposes.

In [None]:
full_df.columns
col_to_drop = ['datetime', 'queried_airport_code', "airport", "hour", "date"]
full_df.drop(columns=col_to_drop, inplace=True)

In [None]:
full_df.head()

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,arr_datetime,scheduled_congestion,snowfall,rain,...,wind_speed_10m,wind_gusts_10m,cloud_cover_low,cloud_cover,temperature_2m,apparent_temperature,surface_pressure,relative_humidity_2m,pressure_msl,date_local
0,ABY,ATL,9E,0.0,145.0,62.0,2018-01-23 13:00:00,1.0,0.0,0.0,...,15.905319,36.360001,0.0,88.0,17.049999,13.199808,1009.687012,31.722488,1016.599976,2018-01-23 12:00:00
1,ABY,ATL,9E,0.0,145.0,62.0,2018-01-24 13:00:00,1.0,0.0,0.0,...,13.493999,31.68,0.0,88.0,14.0,10.193374,1017.16156,34.573521,1024.199951,2018-01-24 12:00:00
2,ABY,ATL,9E,0.0,145.0,62.0,2018-01-25 13:00:00,1.0,0.0,0.0,...,12.096214,30.599998,0.0,33.0,15.2,11.500547,1024.242065,26.188848,1031.300049,2018-01-25 12:00:00
3,ABY,ATL,9E,0.0,145.0,62.0,2018-01-26 13:00:00,1.0,0.0,0.0,...,17.106628,38.519997,9.0,86.0,18.1,14.929314,1024.808716,46.488724,1031.800049,2018-01-26 12:00:00
4,ABY,ATL,9E,0.0,145.0,60.0,2018-01-27 15:00:00,1.0,0.0,0.0,...,11.988594,36.360001,100.0,100.0,20.799999,19.435747,1017.820312,54.349552,1024.699951,2018-01-27 14:00:00


Also, it is better to do our experiments in a small set of data for better computational runtime, and then train the final model on full dataset

In [None]:
sample_df = full_df.sample(n=100000, random_state=42)

In [None]:
full_df.to_parquet("./data/full_df.parquet")
sample_df.to_parquet("./data/sample_df.parquet")

**Update**: It has been realized that arrival weather affect delay even more than Dep Weather, so it has been added here.

In [None]:
import pandas as pd

In [None]:
full_df = pd.read_parquet("./data/full_df.parquet")
weather_df = pd.read_parquet("./data/weather_01022026_V3_big.parquet")

In [None]:
airports = ['ITO', 'LIH', 'KOA', 'OGG']
start_date = "2017-12-30"
end_date = "2018-01-02"

api_caller_weather = OpenMeteoAPICaller()

w = Weather(
    api_caller=api_caller_weather,
    airport_code=airports,
    code_type="iata",
    start_date=start_date,
    end_date=end_date,
)
updated_df = w.to_hourly_dataframe()['data']

Fetching weather data from: https://archive-api.open-meteo.com/v1/archive
Successfully fetched data for 4 airport(s).


In [None]:
weather_df["date"] = pd.to_datetime(weather_df["date"])
updated_df['date'] = pd.to_datetime(updated_df['date'])
weather_df["date_local"] = weather_df["date"].dt.tz_localize(None)
updated_df["date_local"] = updated_df["date"].dt.tz_localize(None)

In [None]:
weather_df = pd.concat([weather_df, updated_df], ignore_index=True)

In [None]:
weather_df.head(2)

Unnamed: 0,date,snowfall,rain,precipitation,wind_speed_10m,wind_gusts_10m,cloud_cover_low,cloud_cover,temperature_2m,apparent_temperature,surface_pressure,relative_humidity_2m,pressure_msl,queried_airport_code,date_local
0,2017-01-31 00:00:00-07:00,0.0,0.0,0.0,17.283749,31.319998,0.0,26.0,1.55,-3.807626,841.16626,59.465061,1024.800049,ABQ,2017-01-31 00:00:00
1,2017-01-31 01:00:00-07:00,0.0,0.0,0.0,10.446206,27.0,0.0,21.0,0.05,-4.299987,840.681946,66.755997,1025.300049,ABQ,2017-01-31 01:00:00


In [None]:
weather_df = (
    weather_df
    .sort_values("date")
    .drop_duplicates(
        subset=["queried_airport_code", "date_local"],
        keep="last"
    )
)

In [None]:
arr_weather = weather_df.copy()
arr_weather = arr_weather.add_prefix("arr_")
arr_weather.head(2)

Unnamed: 0,arr_date,arr_snowfall,arr_rain,arr_precipitation,arr_wind_speed_10m,arr_wind_gusts_10m,arr_cloud_cover_low,arr_cloud_cover,arr_temperature_2m,arr_apparent_temperature,arr_surface_pressure,arr_relative_humidity_2m,arr_pressure_msl,arr_queried_airport_code,arr_date_local
266112,2017-01-30 23:00:00-07:00,0.0,0.0,0.0,15.077082,30.239998,0.0,0.0,8.7,4.768111,993.656677,64.382835,1017.0,HSV,2017-01-30 23:00:00
596736,2017-01-30 23:00:00-07:00,0.0,0.0,0.0,14.76439,24.119999,0.0,0.0,12.25,8.745193,1010.128967,58.51899,1018.5,SHV,2017-01-30 23:00:00


In [None]:
full_df = full_df.merge(arr_weather, left_on=["Dest", "arr_datetime"], right_on=["arr_queried_airport_code", "arr_date_local"], how="left")

In [None]:
full_df['arr_cloud_cover'].isna().sum()

np.int64(0)

In [None]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28416515 entries, 0 to 28416514
Data columns (total 36 columns):
 #   Column                       Dtype         
---  ------                       -----         
 0   Origin                       object        
 1   Dest                         object        
 2   IATA_Code_Operating_Airline  object        
 3   DepDelayMinutes              float64       
 4   Distance                     float64       
 5   CRSElapsedTime               float64       
 6   arr_datetime                 datetime64[ns]
 7   scheduled_congestion         float64       
 8   snowfall                     float32       
 9   rain                         float32       
 10  precipitation                float32       
 11  wind_speed_10m               float32       
 12  wind_gusts_10m               float32       
 13  cloud_cover_low              float32       
 14  cloud_cover                  float32       
 15  temperature_2m               float32       
 16

In [None]:
dep_rename_map = {
    "snowfall": "dep_snowfall",
    "rain": "dep_rain",
    "precipitation": "dep_precipitation",
    "wind_speed_10m": "dep_wind_speed_10m",
    "wind_gusts_10m": "dep_wind_gusts_10m",
    "cloud_cover_low": "dep_cloud_cover_low",
    "cloud_cover": "dep_cloud_cover",
    "temperature_2m": "dep_temperature_2m",
    "apparent_temperature": "dep_apparent_temperature",
    "surface_pressure": "dep_surface_pressure",
    "relative_humidity_2m": "dep_relative_humidity_2m",
    "pressure_msl": "dep_pressure_msl",
    "date_local": "dep_date_local",
}
cols_to_drop = ['arr_queried_airport_code', 'arr_date_local']
full_df = full_df.rename(columns=dep_rename_map)


In [None]:
full_df.drop(columns=cols_to_drop, inplace=True)

In [None]:
display(full_df.head(1), full_df[full_df['arr_snowfall'].isna()])

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,arr_datetime,scheduled_congestion,dep_snowfall,dep_rain,...,arr_precipitation,arr_wind_speed_10m,arr_wind_gusts_10m,arr_cloud_cover_low,arr_cloud_cover,arr_temperature_2m,arr_apparent_temperature,arr_surface_pressure,arr_relative_humidity_2m,arr_pressure_msl
0,ABY,ATL,9E,0.0,145.0,62.0,2018-01-23 13:00:00,1.0,0.0,0.0,...,0.0,22.819571,43.560001,0.0,69.0,11.9,6.463589,979.427368,44.244251,1015.400024


Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,arr_datetime,scheduled_congestion,dep_snowfall,dep_rain,...,arr_precipitation,arr_wind_speed_10m,arr_wind_gusts_10m,arr_cloud_cover_low,arr_cloud_cover,arr_temperature_2m,arr_apparent_temperature,arr_surface_pressure,arr_relative_humidity_2m,arr_pressure_msl


In [None]:
full_df.to_parquet("./data/full_df_v2.parquet")

In [None]:
sample_df = full_df.sample(n=100000, random_state=42)

In [None]:
sample_df.to_parquet("./data/sample_df_v2.parquet")

This concludes our approach to modifying the dataset. Let's move on to engineering the data for better model accuracy