# Flight delays and weather condition â€“ Data Exploration

**Purpose**
- Explore data for flight delays and weather condition use cases
- Validate assumptions before adding API endpoints
- Prototype logic for FastAPI services

**Author:** Rashed  
**Date:** 2025-24-12


In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("robikscube/flight-delay-dataset-20182022")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'flight-delay-dataset-20182022' dataset.
Path to dataset files: /kaggle/input/flight-delay-dataset-20182022


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/Colab\ Notebooks/FlightDelay
%ls

/content/drive/MyDrive/Colab Notebooks/FlightDelay
[0m[01;34maeroapi-python[0m/     main.py         pyproject.toml         weather.parquet
[01;34maeromarket_api[0m/     [01;34mnotebooks[0m/      requirements.txt
bootstrap_paths.py  [01;34mopenmeteo_api[0m/  [01;34msrc[0m/
[01;34mdata[0m/               [01;34m__pycache__[0m/    [01;34msublemmentary_folder[0m/


In [4]:
%ls /kaggle/input/flight-delay-dataset-20182022/

Airlines.csv                   Combined_Flights_2021.csv
Combined_Flights_2018.csv      Combined_Flights_2021.parquet
Combined_Flights_2018.parquet  Combined_Flights_2022.csv
Combined_Flights_2019.csv      Combined_Flights_2022.parquet
Combined_Flights_2019.parquet  [0m[01;34mraw[0m/
Combined_Flights_2020.csv      readme.html
Combined_Flights_2020.parquet  readme.md


In [7]:
import pandas as pd
# We only need few columns for our analysis
KEEP_COLS = [
    # ---- Date & time (join keys) ----
    "FlightDate",
    "DepTimeBlk",

    # ---- Airports (for direction) ----
    "Origin",
    "Dest",

    # ---- Flight identity (API join) ----
    "IATA_Code_Operating_Airline",

    # ---- Targets / outcomes ----
    "DepDelayMinutes",

    # ---- Operational signal ----
    "Distance",
    "CRSElapsedTime",

    # ---- Cancellation info (to drop)----
    "Cancelled",

    # ---- Arrival time to calculate counts ----
    "ArrTimeBlk",

]

df_2018 = pd.read_parquet(f'{path}/Combined_Flights_2018.parquet', columns=KEEP_COLS)
df_2019 = pd.read_parquet(f'{path}/Combined_Flights_2019.parquet', columns=KEEP_COLS)
df_2020 = pd.read_parquet(f'{path}/Combined_Flights_2020.parquet', columns=KEEP_COLS)
df_2021 = pd.read_parquet(f'{path}/Combined_Flights_2021.parquet', columns=KEEP_COLS)
df_2022 = pd.read_parquet(f'{path}/Combined_Flights_2022.parquet', columns=KEEP_COLS)

Now we will need to concatenate the data from different years into a single DataFrame for easier analysis.

In [51]:
df = pd.concat([df_2018, df_2019, df_2020, df_2021, df_2022], ignore_index=True)

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29193782 entries, 0 to 29193781
Data columns (total 10 columns):
 #   Column                       Dtype         
---  ------                       -----         
 0   FlightDate                   datetime64[us]
 1   DepTimeBlk                   object        
 2   Origin                       object        
 3   Dest                         object        
 4   IATA_Code_Operating_Airline  object        
 5   DepDelayMinutes              float64       
 6   Distance                     float64       
 7   CRSElapsedTime               float64       
 8   Cancelled                    bool          
 9   ArrTimeBlk                   object        
dtypes: bool(1), datetime64[us](1), float64(3), object(5)
memory usage: 2.0+ GB


In [53]:
df.describe()

Unnamed: 0,FlightDate,DepDelayMinutes,Distance,CRSElapsedTime
count,29193782,28430700.0,29193780.0,29193760.0
mean,2020-04-23 22:27:03.485606,12.78311,779.7346,138.7605
min,2018-01-01 00:00:00,0.0,16.0,-292.0
25%,2019-03-18 00:00:00,0.0,354.0,88.0
50%,2020-02-08 00:00:00,0.0,626.0,121.0
75%,2021-07-17 00:00:00,5.0,1014.0,169.0
max,2022-07-31 00:00:00,7223.0,5812.0,1645.0
std,,46.17337,581.2739,70.77316


In [54]:
# TODO: do not train the model, until you make sure that you can obtain distance and CRSElapsedTime from the API
df.head()

Unnamed: 0,FlightDate,DepTimeBlk,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,Cancelled,ArrTimeBlk
0,2018-01-23,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359
1,2018-01-24,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359
2,2018-01-25,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359
3,2018-01-26,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,False,1300-1359
4,2018-01-27,1400-1459,ABY,ATL,9E,0.0,145.0,60.0,False,1500-1559


Our goal is to predict delay, so we do not need cancelled flights. We will filter them out during the data loading phase.

In [55]:
df = df[df['Cancelled'] == 0].copy()
df.drop(columns=['Cancelled'], inplace=True)
df.head()

Unnamed: 0,FlightDate,DepTimeBlk,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,ArrTimeBlk
0,2018-01-23,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,1300-1359
1,2018-01-24,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,1300-1359
2,2018-01-25,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,1300-1359
3,2018-01-26,1200-1259,ABY,ATL,9E,0.0,145.0,62.0,1300-1359
4,2018-01-27,1400-1459,ABY,ATL,9E,0.0,145.0,60.0,1500-1559


In [56]:
df.shape

(28416515, 9)

Let's refine the date & time columns for departure and arrival times to create a proper datetime representation.

In [57]:
df['FlightDate'] = pd.to_datetime(df['FlightDate'])
df['Hour'] = df['DepTimeBlk'].str.slice(0, 2).astype(int)
df.drop(columns=['DepTimeBlk'], inplace=True)
df["datetime"] = (
    df["FlightDate"]
    + pd.to_timedelta(df["Hour"], unit="h")
)
df.head()

Unnamed: 0,FlightDate,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,ArrTimeBlk,Hour,datetime
0,2018-01-23,ABY,ATL,9E,0.0,145.0,62.0,1300-1359,12,2018-01-23 12:00:00
1,2018-01-24,ABY,ATL,9E,0.0,145.0,62.0,1300-1359,12,2018-01-24 12:00:00
2,2018-01-25,ABY,ATL,9E,0.0,145.0,62.0,1300-1359,12,2018-01-25 12:00:00
3,2018-01-26,ABY,ATL,9E,0.0,145.0,62.0,1300-1359,12,2018-01-26 12:00:00
4,2018-01-27,ABY,ATL,9E,0.0,145.0,60.0,1500-1559,14,2018-01-27 14:00:00


In [58]:
df['Hour_Arrival'] = df['ArrTimeBlk'].str.slice(0, 2).astype(int)
df['arrival_next_day'] = df['Hour_Arrival'] < df['Hour']
df["arr_datetime"] = (
    df["FlightDate"]
    + pd.to_timedelta(df["Hour_Arrival"], unit="h")
    + pd.to_timedelta(df["arrival_next_day"].astype(int), unit="D")
)
df.drop(columns=['ArrTimeBlk', 'Hour_Arrival', 'arrival_next_day', 'Hour', 'FlightDate'], inplace=True)


In [59]:
df.head()

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,datetime,arr_datetime
0,ABY,ATL,9E,0.0,145.0,62.0,2018-01-23 12:00:00,2018-01-23 13:00:00
1,ABY,ATL,9E,0.0,145.0,62.0,2018-01-24 12:00:00,2018-01-24 13:00:00
2,ABY,ATL,9E,0.0,145.0,62.0,2018-01-25 12:00:00,2018-01-25 13:00:00
3,ABY,ATL,9E,0.0,145.0,62.0,2018-01-26 12:00:00,2018-01-26 13:00:00
4,ABY,ATL,9E,0.0,145.0,60.0,2018-01-27 14:00:00,2018-01-27 15:00:00


In order to calculate the number of flights departing from an airport, and arriving to the same airport, within the same hour, we need to group by the origin and deparuture datetime, and then do the same for arrival. Then we can add these counts as new features to the main dataframe.

First, we will group by 'Origin' and 'datetime' to get the count of departures per hour for each airport.

In [60]:
dep_counts = (
    df.groupby(["Origin", "datetime"])
      .size()
      .reset_index(name="departures_per_hour")
      .rename(columns={"Origin": "airport", "datetime": "hour"})
)

arr_counts = (
    df.groupby(["Dest", "arr_datetime"])
      .size()
      .reset_index(name="arrivals_per_hour")
      .rename(columns={"Dest": "airport", "arr_datetime": "hour"})
)


In [61]:
dep_counts.head()

Unnamed: 0,airport,hour,departures_per_hour
0,ABE,2018-01-01 06:00:00,2
1,ABE,2018-01-01 09:00:00,2
2,ABE,2018-01-01 17:00:00,1
3,ABE,2018-01-01 20:00:00,1
4,ABE,2018-01-02 06:00:00,3


In [62]:
arr_counts.head()

Unnamed: 0,airport,hour,arrivals_per_hour
0,ABE,2018-01-01 09:00:00,1
1,ABE,2018-01-01 16:00:00,1
2,ABE,2018-01-01 17:00:00,2
3,ABE,2018-01-01 19:00:00,1
4,ABE,2018-01-01 22:00:00,1


In [63]:
congestion = (
    dep_counts
    .merge(
        arr_counts,
        on=["airport", "hour"],
        how="outer"
    )
    .fillna(0)
)

congestion["scheduled_congestion"] = (
    congestion["departures_per_hour"]
    + congestion["arrivals_per_hour"]
)

congestion.head(10)

Unnamed: 0,airport,hour,departures_per_hour,arrivals_per_hour,scheduled_congestion
0,ABE,2018-01-01 06:00:00,2.0,0.0,2.0
1,ABE,2018-01-01 09:00:00,2.0,1.0,3.0
2,ABE,2018-01-01 16:00:00,0.0,1.0,1.0
3,ABE,2018-01-01 17:00:00,1.0,2.0,3.0
4,ABE,2018-01-01 19:00:00,0.0,1.0,1.0
5,ABE,2018-01-01 20:00:00,1.0,0.0,1.0
6,ABE,2018-01-01 22:00:00,0.0,1.0,1.0
7,ABE,2018-01-02 06:00:00,3.0,0.0,3.0
8,ABE,2018-01-02 09:00:00,1.0,1.0,2.0
9,ABE,2018-01-02 15:00:00,0.0,1.0,1.0


In [64]:
congestion[congestion['scheduled_congestion'] == 132]

Unnamed: 0,airport,hour,departures_per_hour,arrivals_per_hour,scheduled_congestion
294598,ATL,2018-01-02 14:00:00,51.0,81.0,132.0
294699,ATL,2018-01-07 20:00:00,48.0,84.0,132.0
294801,ATL,2018-01-13 08:00:00,44.0,88.0,132.0
294820,ATL,2018-01-14 08:00:00,53.0,79.0,132.0
294851,ATL,2018-01-15 20:00:00,42.0,90.0,132.0
...,...,...,...,...,...
4423238,ORD,2022-06-23 18:00:00,53.0,79.0,132.0
4423252,ORD,2022-06-24 13:00:00,71.0,61.0,132.0
4423371,ORD,2022-06-30 18:00:00,49.0,83.0,132.0
4423637,ORD,2022-07-14 18:00:00,54.0,78.0,132.0


Now let's merge these counts back into the main dataframe.

In [65]:
df = df.merge(
    congestion[["airport", "hour", "scheduled_congestion"]],
    left_on=["Origin", "datetime"],
    right_on=["airport", "hour"],
    how="left",
    validate="many_to_one"
)
df

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,datetime,arr_datetime,airport,hour,scheduled_congestion
0,ABY,ATL,9E,0.0,145.0,62.0,2018-01-23 12:00:00,2018-01-23 13:00:00,ABY,2018-01-23 12:00:00,1.0
1,ABY,ATL,9E,0.0,145.0,62.0,2018-01-24 12:00:00,2018-01-24 13:00:00,ABY,2018-01-24 12:00:00,1.0
2,ABY,ATL,9E,0.0,145.0,62.0,2018-01-25 12:00:00,2018-01-25 13:00:00,ABY,2018-01-25 12:00:00,1.0
3,ABY,ATL,9E,0.0,145.0,62.0,2018-01-26 12:00:00,2018-01-26 13:00:00,ABY,2018-01-26 12:00:00,1.0
4,ABY,ATL,9E,0.0,145.0,60.0,2018-01-27 14:00:00,2018-01-27 15:00:00,ABY,2018-01-27 14:00:00,1.0
...,...,...,...,...,...,...,...,...,...,...,...
28416510,EWR,MEM,YX,154.0,946.0,182.0,2022-03-19 20:00:00,2022-03-19 22:00:00,EWR,2022-03-19 20:00:00,30.0
28416511,MSY,EWR,YX,25.0,1167.0,185.0,2022-03-31 19:00:00,2022-03-31 23:00:00,MSY,2022-03-31 19:00:00,11.0
28416512,ALB,ORD,YX,378.0,723.0,158.0,2022-03-08 17:00:00,2022-03-08 18:00:00,ALB,2022-03-08 17:00:00,6.0
28416513,EWR,PIT,YX,113.0,319.0,86.0,2022-03-25 21:00:00,2022-03-25 22:00:00,EWR,2022-03-25 21:00:00,34.0


In [66]:
df[df['scheduled_congestion'] == 132]

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,datetime,arr_datetime,airport,hour,scheduled_congestion
563,ATL,FAY,9E,0.0,331.0,76.0,2018-01-13 08:00:00,2018-01-13 09:00:00,ATL,2018-01-13 08:00:00,132.0
564,ATL,FAY,9E,8.0,331.0,78.0,2018-01-14 08:00:00,2018-01-14 09:00:00,ATL,2018-01-14 08:00:00,132.0
576,ATL,FAY,9E,47.0,331.0,76.0,2018-01-27 08:00:00,2018-01-27 09:00:00,ATL,2018-01-27 08:00:00,132.0
657,ATL,FSM,9E,0.0,579.0,123.0,2018-01-02 14:00:00,2018-01-02 15:00:00,ATL,2018-01-02 14:00:00,132.0
1235,ATL,FAY,9E,0.0,331.0,75.0,2018-01-25 13:00:00,2018-01-25 14:00:00,ATL,2018-01-25 13:00:00,132.0
...,...,...,...,...,...,...,...,...,...,...,...
28415352,ORD,MSN,YX,18.0,109.0,57.0,2022-03-28 09:00:00,2022-03-28 10:00:00,ORD,2022-03-28 09:00:00,132.0
28416213,ORD,PIT,YX,9.0,413.0,95.0,2022-03-31 18:00:00,2022-03-31 21:00:00,ORD,2022-03-31 18:00:00,132.0
28416240,ORD,IND,YX,0.0,177.0,72.0,2022-03-31 18:00:00,2022-03-31 20:00:00,ORD,2022-03-31 18:00:00,132.0
28416330,ORD,CMH,YX,0.0,296.0,85.0,2022-03-31 18:00:00,2022-03-31 21:00:00,ORD,2022-03-31 18:00:00,132.0


Now we can include weather data into our dataset. These classes were created in order to simplify the apis calls on the notebook.

In [33]:
%%capture
!pip install -r requirements.txt

In [34]:
from openmeteo_api.src.openmeteoapi.WeatherData import Weather
from openmeteo_api.src.openmeteoapi.APICaller import OpenMeteoAPICaller
import os
from dotenv import load_dotenv

The flight delay dataset is from 2018 to 2022, so all we need to do is getting the weather data for this range of time for each unique airpor, and then join them together.

In [35]:
print(set(df['Origin'].unique()) == set(df['Dest'].unique()))
print(len(set(df['Origin'].unique())))
# print(df[df['Origin']== "ISN"])

True
388


In [36]:
from tqdm import tqdm
import time

BATCH_SIZE = 10
length = len(set(df['Origin'].unique()))
airports = list(df['Origin'].unique())
start_date = "2018-01-01"
end_date = "2022-12-31"
api_caller_weather = OpenMeteoAPICaller()

dfs = []

for i in tqdm(range(0, length, BATCH_SIZE)):
    airport_list = airports[i:i+BATCH_SIZE]
    print(airport_list)
    if len(airport_list) == 0: break
    w = Weather(
        api_caller=api_caller_weather,
        airport_code=airport_list,
        code_type="iata",
        start_date=start_date,
        end_date=end_date,
    )

    for attempt in range(3):
        try:
            df_weather = w.to_hourly_dataframe()
            dfs.append(df_weather)
            break
        except Exception as e:
            if "limit" in str(e).lower():
                time.sleep(60)
            else:
                raise

    time.sleep(5)

weather_df = pd.concat(dfs, ignore_index=True)

  0%|          | 0/39 [00:00<?, ?it/s]

['ABY', 'ATL', 'MOB', 'BUF', 'DFW', 'BTV', 'CVG', 'LGA', 'CHO', 'EWN']


  3%|â–Ž         | 1/39 [00:07<04:46,  7.53s/it]

['MCI', 'MGM', 'MSP', 'DCA', 'FAY', 'OAJ', 'STL', 'CWA', 'DTW', 'RDU']


  3%|â–Ž         | 1/39 [00:10<06:46, 10.70s/it]


KeyboardInterrupt: 

In [67]:
weather_df = pd.read_parquet("weather.parquet")

In [40]:
weather_df[weather_df["queried_airport_code"] == "JFK"].head(10)

Unnamed: 0,date,snowfall,rain,precipitation,wind_speed_10m,wind_gusts_10m,cloud_cover_low,cloud_cover,temperature_2m,apparent_temperature,surface_pressure,relative_humidity_2m,pressure_msl,queried_airport_code
4820640,2018-01-01 00:00:00-05:00,0.0,0.0,0.0,18.59845,32.399998,0.0,0.0,-13.1,-19.658886,1026.895264,51.989731,1027.300049,JFK
4820641,2018-01-01 01:00:00-05:00,0.0,0.0,0.0,18.11841,31.319998,0.0,0.0,-13.3,-19.781853,1027.194824,53.764278,1027.599976,JFK
4820642,2018-01-01 02:00:00-05:00,0.0,0.0,0.0,18.11841,30.599998,0.0,0.0,-13.5,-19.981882,1027.594238,54.644852,1028.0,JFK
4820643,2018-01-01 03:00:00-05:00,0.0,0.0,0.0,17.826363,30.239998,0.0,0.0,-13.7,-20.141098,1027.693848,55.301258,1028.099976,JFK
4820644,2018-01-01 04:00:00-05:00,0.0,0.0,0.0,18.584509,30.960001,0.0,0.0,-13.9,-20.455217,1027.593628,55.7248,1028.0,JFK
4820645,2018-01-01 05:00:00-05:00,0.0,0.0,0.0,19.107151,32.039997,0.0,0.0,-14.05,-20.688435,1027.793213,55.440041,1028.199951,JFK
4820646,2018-01-01 06:00:00-05:00,0.0,0.0,0.0,19.35314,32.399998,0.0,0.0,-14.1,-20.781181,1027.993164,54.706738,1028.400024,JFK
4820647,2018-01-01 07:00:00-05:00,0.0,0.0,0.0,19.602652,32.399998,0.0,0.0,-14.0,-20.705467,1028.293335,55.93824,1028.699951,JFK
4820648,2018-01-01 08:00:00-05:00,0.0,0.0,0.0,19.862083,33.48,0.0,0.0,-13.55,-20.305321,1028.793701,52.307777,1029.199951,JFK
4820649,2018-01-01 09:00:00-05:00,0.0,0.0,0.0,22.183128,38.880001,0.0,0.0,-12.45,-19.562,1029.095337,45.608284,1029.5,JFK


In [68]:
import airportsdata

IATA_DB = airportsdata.load("IATA")

def airport_code_to_timezone(airport_code):
    airport = IATA_DB.get(airport_code)
    if airport is None:
        return None
    return airport.get("tz")

In [72]:
df['timezone'] = df['Origin'].map(airport_code_to_timezone)

In [79]:
df[df['timezone'].isna()]['Origin'].unique()[0]

'ISN'

This tell us that there is only one missing timezone for airport ISN, we can just hard code it

In [85]:
df.loc[df["Origin"] == "ISN", 'timezone'] = "America/Chicago"

In [87]:
df[df['Origin'] == "ISN"].head()

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,datetime,arr_datetime,airport,hour,scheduled_congestion,timezone
24218,ISN,DEN,OO,0.0,576.0,118.0,2018-01-19 17:00:00,2018-01-19 18:00:00,ISN,2018-01-19 17:00:00,1.0,America/Chicago
24856,ISN,MSP,OO,0.0,546.0,121.0,2018-01-20 06:00:00,2018-01-20 08:00:00,ISN,2018-01-20 06:00:00,1.0,America/Chicago
25798,ISN,MSP,OO,14.0,546.0,119.0,2018-01-05 12:00:00,2018-01-05 14:00:00,ISN,2018-01-05 12:00:00,2.0,America/Chicago
25827,ISN,MSP,OO,0.0,546.0,123.0,2018-01-05 06:00:00,2018-01-05 08:00:00,ISN,2018-01-05 06:00:00,1.0,America/Chicago
27503,ISN,MSP,OO,0.0,546.0,121.0,2018-01-06 06:00:00,2018-01-06 08:00:00,ISN,2018-01-06 06:00:00,1.0,America/Chicago


In [92]:
df.head()

Unnamed: 0,Origin,Dest,IATA_Code_Operating_Airline,DepDelayMinutes,Distance,CRSElapsedTime,datetime,arr_datetime,airport,hour,scheduled_congestion,timezone
0,ABY,ATL,9E,0.0,145.0,62.0,2018-01-23 12:00:00,2018-01-23 13:00:00,ABY,2018-01-23 12:00:00,1.0,America/New_York
1,ABY,ATL,9E,0.0,145.0,62.0,2018-01-24 12:00:00,2018-01-24 13:00:00,ABY,2018-01-24 12:00:00,1.0,America/New_York
2,ABY,ATL,9E,0.0,145.0,62.0,2018-01-25 12:00:00,2018-01-25 13:00:00,ABY,2018-01-25 12:00:00,1.0,America/New_York
3,ABY,ATL,9E,0.0,145.0,62.0,2018-01-26 12:00:00,2018-01-26 13:00:00,ABY,2018-01-26 12:00:00,1.0,America/New_York
4,ABY,ATL,9E,0.0,145.0,60.0,2018-01-27 14:00:00,2018-01-27 15:00:00,ABY,2018-01-27 14:00:00,1.0,America/New_York


In [93]:
weather_df.head()

Unnamed: 0,date,snowfall,rain,precipitation,wind_speed_10m,wind_gusts_10m,cloud_cover_low,cloud_cover,temperature_2m,apparent_temperature,surface_pressure,relative_humidity_2m,pressure_msl,queried_airport_code
0,2018-01-01 00:00:00-05:00,0.0,0.0,0.0,16.575644,30.960001,100.0,100.0,3.65,-0.640734,1017.892456,86.149536,1025.199951,ABY
1,2018-01-01 01:00:00-05:00,0.0,0.0,0.0,16.575644,31.68,100.0,100.0,2.85,-1.760402,1018.070068,78.900909,1025.400024,ABY
2,2018-01-01 02:00:00-05:00,0.0,0.0,0.0,18.175545,34.919998,100.0,100.0,2.3,-2.882902,1018.353088,68.517944,1025.699951,ABY
3,2018-01-01 03:00:00-05:00,0.0,0.0,0.0,18.003599,35.639999,99.0,99.0,1.6,-3.862596,1018.831116,59.252598,1026.199951,ABY
4,2018-01-01 04:00:00-05:00,0.0,0.0,0.0,18.003599,36.0,72.0,96.0,0.9,-4.777263,1018.91156,52.844017,1026.300049,ABY


In [94]:
weather_df["date_local"] = weather_df["date"].dt.tz_localize(None)

In [95]:
weather_df.head()

Unnamed: 0,date,snowfall,rain,precipitation,wind_speed_10m,wind_gusts_10m,cloud_cover_low,cloud_cover,temperature_2m,apparent_temperature,surface_pressure,relative_humidity_2m,pressure_msl,queried_airport_code,date_local
0,2018-01-01 00:00:00-05:00,0.0,0.0,0.0,16.575644,30.960001,100.0,100.0,3.65,-0.640734,1017.892456,86.149536,1025.199951,ABY,2018-01-01 00:00:00
1,2018-01-01 01:00:00-05:00,0.0,0.0,0.0,16.575644,31.68,100.0,100.0,2.85,-1.760402,1018.070068,78.900909,1025.400024,ABY,2018-01-01 01:00:00
2,2018-01-01 02:00:00-05:00,0.0,0.0,0.0,18.175545,34.919998,100.0,100.0,2.3,-2.882902,1018.353088,68.517944,1025.699951,ABY,2018-01-01 02:00:00
3,2018-01-01 03:00:00-05:00,0.0,0.0,0.0,18.003599,35.639999,99.0,99.0,1.6,-3.862596,1018.831116,59.252598,1026.199951,ABY,2018-01-01 03:00:00
4,2018-01-01 04:00:00-05:00,0.0,0.0,0.0,18.003599,36.0,72.0,96.0,0.9,-4.777263,1018.91156,52.844017,1026.300049,ABY,2018-01-01 04:00:00


In [None]:
df.merge(weather_df, left_on="datetime", right_on="date_local")