# INTRODUCTION

This notebook is the completed data loading and joining notebook. Unlike the previous notebooks, this notebooks uses the 2020-2025 flight delay data, weather data for origin airports, and airport codes with related latitude and longitudes.

First, we'll walk through loading each dataset from the "raw" data. As an important note, we will be switching between the `../data/raw/` and the `../data/intermediate/` directories. This is because the scraped delay data files are in the raw directory but the combined weather data needed some preprocessing because it was pulled in batches from the OpenMeteo API so it is in the intermediate directory.

For the purposes of cleaning and eda in your local coding environment, you can just use pd.read_csv or duckdb.read_csv + duckdb.sql on the data sets in the intermediate folder. The intermediate folder contains three data sets: 

1. `airport_codes.csv` which contains a dataset of airport codes and their latitudes and longitudes.
2. `delays_PHL_2020_2025.csv` which contains all flights arriving at PHL between 2020 and 2025. 
3. `origin_weather_data.csv` which contains weather data from 2020 to 2025 for every unique origin airport. 

There will also be a final combined dataset (before pre-processing) in the `../data/processed/` directory. We will create that file in **this** notebook. This notebook will be focused on cleanly writing the code that will be used in the project's final Colab notebook, so it will act as if we are working from the raw data files.

In [1]:
# Imports
import pandas as pd
import duckdb

## LOAD DELAY DATA

To load the flight delay data, we need to read in every delay CSV file in `../data/raw/`. Each CVS file contains information about flights in a given month/year. They all share the same naming convention of `{YYYY}-{MM}.csv`, which we can take advantage of to easily load them all into a data_view in duckdb before querying for just the flights arriving at PHL for our raw data frame.

In [2]:
%%time

# Because the flight delay data has EVERY variable from the 
# OST database, we need to pick just the ones we want to look at
COLUMNS_TO_SELECT = [
    "FlightDate", "DOT_ID_Reporting_Airline", "Tail_Number", 
    "Flight_Number_Reporting_Airline", "OriginAirportID", "Origin",
    "DestAirportID", "Dest", "CRSDepTime",
    "DepTime", "DepDelay", "TaxiOut",
    "WheelsOff", "WheelsOn", "TaxiIn",
    "CRSArrTime", "ArrDelay", "Cancelled",
    "CancellationCode", "Diverted", "CRSElapsedTime",
    "ActualElapsedTime", "AirTime", "Distance",
    "CarrierDelay", "WeatherDelay", "NASDelay",
    "SecurityDelay","LateAircraftDelay"
]

# Using an f-string, define the types of files 
# we want to load into a view in duckdb
data_view = duckdb.read_csv(
    f"../data/raw/[0-9][0-9][0-9][0-9]-[0-9][0-9].csv",
    auto_detect = True
)


# Programmatically define out query by unpacking our list of columns in to a string
# of comma-separated column names to pass into the SQL query
query = f"SELECT {", ".join(COLUMNS_TO_SELECT)} FROM data_view WHERE DEST = 'PHL'"

# Use duckdb.sql to query the dataview and then store result in a pandas dataframe
delays_raw = duckdb.sql(query).df()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

CPU times: user 1min 19s, sys: 5.09 s, total: 1min 25s
Wall time: 11.1 s


## LOAD AIRPORT CODE DATA

The airport code dataset is a relatively small data set, so duckdb doesn't provide any particular performance improvements. I've included both methods with their time to execute below just to illustrate.

In [3]:
%%time

airport_codes_raw = pd.read_csv("../data/raw/airports.csv")

CPU times: user 15.9 ms, sys: 8.24 ms, total: 24.1 ms
Wall time: 25.9 ms


In [4]:
%%time

data_view = duckdb.read_csv(
    "../data/raw/airports.csv",
    auto_detect = True
)

airport_codes_raw = duckdb.sql("""
                               SELECT * 
                               FROM data_view;
                               """).df()

CPU times: user 42.2 ms, sys: 4.46 ms, total: 46.7 ms
Wall time: 41.5 ms


## LOAD ORIGIN WEATHER DATA
The weather data is a reasonable size so there is likely some benefit to using duckdb over pd.read_csv. I'll do both methods to directly see the difference below. From the results below, there's actually not that much of a difference between the two methods. For the colab notebook, it'll be simpler to just use pd.read_csv().

In [5]:
%%time

origin_weather_raw = pd.read_csv("../data/intermediate/origin_weather_data.csv")

CPU times: user 105 ms, sys: 23.8 ms, total: 129 ms
Wall time: 130 ms


In [6]:
%%time

data_view = duckdb.read_csv(
    "../data/intermediate/origin_weather_data.csv",
    auto_detect = True
)

origin_weather_raw = duckdb.sql("SELECT * FROM data_view;").df()
origin_weather_raw = origin_weather_raw.drop(columns = ["column00"])

CPU times: user 170 ms, sys: 33 ms, total: 203 ms
Wall time: 132 ms


## LOAD DESTINATION WEATHER DATA
The weather data for the destination is much smaller since it's only for PHL.

In [7]:
%%time

dest_weather_raw = pd.read_csv("../data/raw/destination_weather_data.csv")
dest_weather_raw = dest_weather_raw.drop(columns = ["Unnamed: 0"])

CPU times: user 5.1 ms, sys: 4.64 ms, total: 9.74 ms
Wall time: 9.52 ms


## BRIEF LOOK AT DELAY DATA
In this section, we'll briefly look at the raw delay data to get an idea of how many rows, columns, NA values, etc. there are. 

In [8]:
delays_raw.head()

Unnamed: 0,FlightDate,DOT_ID_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,Origin,DestAirportID,Dest,CRSDepTime,DepTime,...,Diverted,CRSElapsedTime,ActualElapsedTime,AirTime,Distance,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2020-01-01,20409,N655JB,976,11697,FLL,14100,PHL,2152,2143,...,0.0,158.0,166.0,130.0,992.0,,,,,
1,2020-01-02,20409,N591JB,976,11697,FLL,14100,PHL,2152,2152,...,0.0,158.0,147.0,128.0,992.0,,,,,
2,2020-01-03,20409,N657JB,976,11697,FLL,14100,PHL,2152,2150,...,0.0,158.0,143.0,124.0,992.0,,,,,
3,2020-01-04,20409,N709JB,976,11697,FLL,14100,PHL,2152,2215,...,0.0,158.0,134.0,119.0,992.0,,,,,
4,2020-01-05,20409,N627JB,976,11697,FLL,14100,PHL,2152,2149,...,0.0,158.0,153.0,131.0,992.0,,,,,


In [9]:
delays_raw.shape

(488392, 29)

In [10]:
delays_raw.describe()

Unnamed: 0,FlightDate,DOT_ID_Reporting_Airline,Flight_Number_Reporting_Airline,OriginAirportID,DestAirportID,DepDelay,TaxiOut,TaxiIn,ArrDelay,Cancelled,Diverted,CRSElapsedTime,ActualElapsedTime,AirTime,Distance,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
count,488392,488392.0,488392.0,488392.0,488392.0,476427.0,476198.0,476153.0,475374.0,488392.0,488392.0,488392.0,475374.0,475374.0,488392.0,94035.0,94035.0,94035.0,94035.0,94035.0
mean,2022-12-18 09:04:10.677324,20065.29539,2673.708894,12619.935425,14100.0,13.075508,17.261234,7.514561,7.518152,0.025002,0.001652,149.178987,143.829379,119.064278,892.80055,28.343393,3.500388,12.907215,0.179508,35.9829
min,2020-01-01 00:00:00,19393.0,6.0,10154.0,14100.0,-56.0,1.0,1.0,-88.0,0.0,0.0,-56.0,35.0,18.0,80.0,0.0,0.0,0.0,0.0,0.0
25%,2021-08-27 00:00:00,19805.0,1300.0,11066.0,14100.0,-7.0,12.0,5.0,-16.0,0.0,0.0,102.0,98.0,72.0,453.0,0.0,0.0,0.0,0.0,0.0
50%,2023-01-11 00:00:00,19805.0,2301.0,12892.0,14100.0,-3.0,15.0,6.0,-7.0,0.0,0.0,129.0,127.0,101.0,690.0,3.0,0.0,1.0,0.0,3.0
75%,2024-05-20 00:00:00,20416.0,4508.0,13931.0,14100.0,7.0,19.0,9.0,8.0,0.0,0.0,170.0,168.0,141.0,1013.0,22.0,0.0,17.0,0.0,39.0
max,2025-07-31 00:00:00,20452.0,8815.0,15919.0,14100.0,3403.0,179.0,296.0,3407.0,1.0,1.0,397.0,584.0,555.0,2522.0,3403.0,1332.0,1217.0,277.0,2557.0
std,,335.343967,1715.869855,1561.446414,0.0,67.863894,8.774088,5.535872,69.242305,0.156133,0.040616,63.608243,63.931164,62.623689,588.519116,97.634385,27.082548,29.851321,2.960894,84.952342


In [11]:
delays_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 488392 entries, 0 to 488391
Data columns (total 29 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   FlightDate                       488392 non-null  datetime64[us]
 1   DOT_ID_Reporting_Airline         488392 non-null  int64         
 2   Tail_Number                      484298 non-null  object        
 3   Flight_Number_Reporting_Airline  488392 non-null  int64         
 4   OriginAirportID                  488392 non-null  int64         
 5   Origin                           488392 non-null  object        
 6   DestAirportID                    488392 non-null  int64         
 7   Dest                             488392 non-null  object        
 8   CRSDepTime                       488392 non-null  object        
 9   DepTime                          476427 non-null  object        
 10  DepDelay                         476427 non-

In [12]:
delays_raw.isna().sum()

FlightDate                              0
DOT_ID_Reporting_Airline                0
Tail_Number                          4094
Flight_Number_Reporting_Airline         0
OriginAirportID                         0
Origin                                  0
DestAirportID                           0
Dest                                    0
CRSDepTime                              0
DepTime                             11965
DepDelay                            11965
TaxiOut                             12194
WheelsOff                           12194
WheelsOn                            12239
TaxiIn                              12239
CRSArrTime                              0
ArrDelay                            13018
Cancelled                               0
CancellationCode                   476181
Diverted                                0
CRSElapsedTime                          0
ActualElapsedTime                   13018
AirTime                             13018
Distance                          

In [13]:
delays_raw.columns

Index(['FlightDate', 'DOT_ID_Reporting_Airline', 'Tail_Number',
       'Flight_Number_Reporting_Airline', 'OriginAirportID', 'Origin',
       'DestAirportID', 'Dest', 'CRSDepTime', 'DepTime', 'DepDelay', 'TaxiOut',
       'WheelsOff', 'WheelsOn', 'TaxiIn', 'CRSArrTime', 'ArrDelay',
       'Cancelled', 'CancellationCode', 'Diverted', 'CRSElapsedTime',
       'ActualElapsedTime', 'AirTime', 'Distance', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
      dtype='object')

## COMBINE DATASETS
In this section, we'll use pd.merge to combine the datasets into a unified dataset that we can clean, explore, visualize, and model from.

In [14]:
# First create copies of all datasets so we're not using the raw objects
delays_df = delays_raw.copy()
airport_codes_df = airport_codes_raw.copy()
origin_weather_df = origin_weather_raw.copy()
dest_weather_df = dest_weather_raw.copy()

To join the origin weather data, we'll need our left table to have both date and lat,long pairs. So, we will left join `airport_codes_df` to delays_df by `delays_df.Origin == airport_codes_df.code`. We'll also drop all necessary columns so that we're left with basically the delays_df with a new `lat_long` column.

In [15]:
# Look at columns we want to add to delays
airport_codes_df[["code", "name", "latitude", "longitude"]].head()

Unnamed: 0,code,name,latitude,longitude
0,AAA,Anaa,-17.350665,-145.51112
1,AAB,Arrabury Airport,-26.696783,141.049092
2,AAC,El Arish International Airport,31.074284,33.829172
3,AAD,Adado Airport,6.096286,46.637708
4,AAE,Les Salines Airport,36.821392,7.811857


In [16]:
# Join airport_codes_df to delays_df -> use indexing to only select relevant columns from airport_codes_df

delays_df1 = pd.merge(left = delays_df, right = airport_codes_df[["code", "name", "latitude", "longitude"]].add_prefix("origin_"),
                      how = "left", left_on = "Origin", right_on = "origin_code")

delays_df1.head()

Unnamed: 0,FlightDate,DOT_ID_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,Origin,DestAirportID,Dest,CRSDepTime,DepTime,...,Distance,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,origin_code,origin_name,origin_latitude,origin_longitude
0,2020-01-01,20409,N655JB,976,11697,FLL,14100,PHL,2152,2143,...,992.0,,,,,,FLL,Fort Lauderdale-Hollywood International Airport,26.072017,-80.150997
1,2020-01-02,20409,N591JB,976,11697,FLL,14100,PHL,2152,2152,...,992.0,,,,,,FLL,Fort Lauderdale-Hollywood International Airport,26.072017,-80.150997
2,2020-01-03,20409,N657JB,976,11697,FLL,14100,PHL,2152,2150,...,992.0,,,,,,FLL,Fort Lauderdale-Hollywood International Airport,26.072017,-80.150997
3,2020-01-04,20409,N709JB,976,11697,FLL,14100,PHL,2152,2215,...,992.0,,,,,,FLL,Fort Lauderdale-Hollywood International Airport,26.072017,-80.150997
4,2020-01-05,20409,N627JB,976,11697,FLL,14100,PHL,2152,2149,...,992.0,,,,,,FLL,Fort Lauderdale-Hollywood International Airport,26.072017,-80.150997


Now that we have a latitude and longitude column for each origin airport, we can join our new dataframe with the `origin_weather_df`. First, we need to unpack the tuple `lat_long` column into separate latitutde and longitude columns, then we need to make sure both tables have standardized values for latittude and longitude. Finally, we can join the tables using their date, latitutde, and longitude columns.

The tuples in `origin_lat_long` in `origin_weather_df` were converted to strings when read in from CSV. So, the very first step is to reconvert them to tuples of floats.

In [17]:
origin_weather_df

Unnamed: 0,origin_time,origin_temperature_2m_mean,origin_temperature_2m_max,origin_temperature_2m_min,origin_apparent_temperature_mean,origin_apparent_temperature_max,origin_apparent_temperature_min,origin_wind_speed_10m_max,origin_wind_gusts_10m_max,origin_wind_direction_10m_dominant,origin_shortwave_radiation_sum,origin_et0_fao_evapotranspiration,origin_precipitation_sum,origin_rain_sum,origin_snowfall_sum,origin_precipitation_hours,origin_weather_code,origin_lat_long
0,2020-01-01,18.9,23.7,14.4,19.2,24.5,13.3,11.3,20.5,351,15.43,2.79,0.0,0.0,0.0,0.0,1,"(26.072, -80.151)"
1,2020-01-02,21.4,25.5,17.2,22.5,27.2,17.4,16.3,31.3,111,14.55,2.88,0.0,0.0,0.0,0.0,3,"(26.072, -80.151)"
2,2020-01-03,24.8,27.0,22.9,26.6,28.1,25.2,25.4,46.1,160,12.22,2.86,0.0,0.0,0.0,0.0,3,"(26.072, -80.151)"
3,2020-01-04,25.3,29.0,23.3,27.5,30.0,25.5,24.1,43.9,189,14.09,3.30,0.0,0.0,0.0,0.0,3,"(26.072, -80.151)"
4,2020-01-05,17.9,23.1,13.1,15.7,25.7,9.0,26.3,45.0,335,14.94,3.15,0.5,0.5,0.0,3.0,51,"(26.072, -80.151)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224285,2025-07-27,21.4,23.7,18.7,20.9,22.8,19.6,22.7,51.1,201,14.73,3.55,0.2,0.2,0.0,1.0,51,"(41.671, -70.284)"
224286,2025-07-28,23.8,27.1,21.7,26.8,31.1,22.8,21.6,43.6,275,21.77,4.20,0.1,0.1,0.0,1.0,51,"(41.671, -70.284)"
224287,2025-07-29,25.5,33.3,20.1,28.2,37.2,22.4,19.5,46.4,227,26.02,5.76,0.0,0.0,0.0,0.0,3,"(41.671, -70.284)"
224288,2025-07-30,27.6,31.9,22.5,30.6,36.9,26.4,17.8,36.0,249,25.85,5.95,0.0,0.0,0.0,0.0,3,"(41.671, -70.284)"


In [18]:
# Convert to tuple, extract latitude and longitude
import ast
origin_weather_df["origin_lat_long"] = origin_weather_df["origin_lat_long"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
origin_weather_df[["origin_latitude", "origin_longitude"]] = pd.DataFrame(origin_weather_df["origin_lat_long"].tolist(), index = origin_weather_df.index)
origin_weather_df.head()

Unnamed: 0,origin_time,origin_temperature_2m_mean,origin_temperature_2m_max,origin_temperature_2m_min,origin_apparent_temperature_mean,origin_apparent_temperature_max,origin_apparent_temperature_min,origin_wind_speed_10m_max,origin_wind_gusts_10m_max,origin_wind_direction_10m_dominant,origin_shortwave_radiation_sum,origin_et0_fao_evapotranspiration,origin_precipitation_sum,origin_rain_sum,origin_snowfall_sum,origin_precipitation_hours,origin_weather_code,origin_lat_long,origin_latitude,origin_longitude
0,2020-01-01,18.9,23.7,14.4,19.2,24.5,13.3,11.3,20.5,351,15.43,2.79,0.0,0.0,0.0,0.0,1,"(26.072, -80.151)",26.072,-80.151
1,2020-01-02,21.4,25.5,17.2,22.5,27.2,17.4,16.3,31.3,111,14.55,2.88,0.0,0.0,0.0,0.0,3,"(26.072, -80.151)",26.072,-80.151
2,2020-01-03,24.8,27.0,22.9,26.6,28.1,25.2,25.4,46.1,160,12.22,2.86,0.0,0.0,0.0,0.0,3,"(26.072, -80.151)",26.072,-80.151
3,2020-01-04,25.3,29.0,23.3,27.5,30.0,25.5,24.1,43.9,189,14.09,3.3,0.0,0.0,0.0,0.0,3,"(26.072, -80.151)",26.072,-80.151
4,2020-01-05,17.9,23.1,13.1,15.7,25.7,9.0,26.3,45.0,335,14.94,3.15,0.5,0.5,0.0,3.0,51,"(26.072, -80.151)",26.072,-80.151


In [19]:
# Make all floats consistent
delays_df1["origin_latitude"] = delays_df1["origin_latitude"].round(2)
delays_df1["origin_longitude"] = delays_df1["origin_longitude"].round(2)
origin_weather_df["origin_latitude"] = origin_weather_df["origin_latitude"].round(2)
origin_weather_df["origin_longitude"] = origin_weather_df["origin_longitude"].round(2)


In [20]:
# Combine delays_df1 and origin_weather_df
delays_df2 = pd.merge(left = delays_df1, right = origin_weather_df, how = "left",
                      left_on = ["FlightDate", "origin_latitude", "origin_longitude"], right_on = ["origin_time", "origin_latitude", "origin_longitude"])

delays_df2.head()

Unnamed: 0,FlightDate,DOT_ID_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,Origin,DestAirportID,Dest,CRSDepTime,DepTime,...,origin_wind_gusts_10m_max,origin_wind_direction_10m_dominant,origin_shortwave_radiation_sum,origin_et0_fao_evapotranspiration,origin_precipitation_sum,origin_rain_sum,origin_snowfall_sum,origin_precipitation_hours,origin_weather_code,origin_lat_long
0,2020-01-01,20409,N655JB,976,11697,FLL,14100,PHL,2152,2143,...,20.5,351.0,15.43,2.79,0.0,0.0,0.0,0.0,1.0,"(26.072, -80.151)"
1,2020-01-02,20409,N591JB,976,11697,FLL,14100,PHL,2152,2152,...,31.3,111.0,14.55,2.88,0.0,0.0,0.0,0.0,3.0,"(26.072, -80.151)"
2,2020-01-03,20409,N657JB,976,11697,FLL,14100,PHL,2152,2150,...,46.1,160.0,12.22,2.86,0.0,0.0,0.0,0.0,3.0,"(26.072, -80.151)"
3,2020-01-04,20409,N709JB,976,11697,FLL,14100,PHL,2152,2215,...,43.9,189.0,14.09,3.3,0.0,0.0,0.0,0.0,3.0,"(26.072, -80.151)"
4,2020-01-05,20409,N627JB,976,11697,FLL,14100,PHL,2152,2149,...,45.0,335.0,14.94,3.15,0.5,0.5,0.0,3.0,51.0,"(26.072, -80.151)"


Lastly, we want to add the destination airport's weather features to the data. Since we have already filtered the data so that the only destination airport is PHL, we only need to join the two tables on their date columns.

In [21]:
# Need to convert dest_time column to pd.datetime to match delays_df2 FlightDate column
dest_weather_df["dest_time"] = pd.to_datetime(dest_weather_df["dest_time"])
dest_weather_df.head()

Unnamed: 0,dest_time,dest_temperature_2m_mean,dest_temperature_2m_max,dest_temperature_2m_min,dest_apparent_temperature_mean,dest_apparent_temperature_max,dest_apparent_temperature_min,dest_wind_speed_10m_max,dest_wind_gusts_10m_max,dest_wind_direction_10m_dominant,dest_shortwave_radiation_sum,dest_et0_fao_evapotranspiration,dest_precipitation_sum,dest_rain_sum,dest_snowfall_sum,dest_precipitation_hours,dest_weather_code
0,2020-01-01,-16.8,-10.9,-22.6,-21.6,-15.3,-27.7,8.0,28.8,338,10.44,0.69,0.0,0.0,0.0,0.0,3
1,2020-01-02,-16.1,-10.4,-21.5,-20.8,-14.7,-26.4,7.9,26.3,326,9.66,0.64,0.0,0.0,0.0,0.0,3
2,2020-01-03,-15.0,-10.4,-19.7,-19.6,-14.8,-24.5,7.1,35.3,332,8.78,0.65,0.0,0.0,0.0,0.0,3
3,2020-01-04,-17.9,-12.2,-23.6,-22.8,-16.9,-28.9,9.1,40.7,354,7.64,0.54,0.0,0.0,0.0,0.0,3
4,2020-01-05,-18.7,-14.0,-25.4,-23.4,-18.9,-30.5,8.8,25.9,13,10.25,0.54,0.2,0.0,0.14,2.0,71


In [22]:
delays_df3 = pd.merge(left = delays_df2, right = dest_weather_df,
                      left_on = "FlightDate", right_on = "dest_time")

delays_df3.head()

Unnamed: 0,FlightDate,DOT_ID_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,Origin,DestAirportID,Dest,CRSDepTime,DepTime,...,dest_wind_speed_10m_max,dest_wind_gusts_10m_max,dest_wind_direction_10m_dominant,dest_shortwave_radiation_sum,dest_et0_fao_evapotranspiration,dest_precipitation_sum,dest_rain_sum,dest_snowfall_sum,dest_precipitation_hours,dest_weather_code
0,2020-01-01,20409,N655JB,976,11697,FLL,14100,PHL,2152,2143,...,8.0,28.8,338,10.44,0.69,0.0,0.0,0.0,0.0,3
1,2020-01-02,20409,N591JB,976,11697,FLL,14100,PHL,2152,2152,...,7.9,26.3,326,9.66,0.64,0.0,0.0,0.0,0.0,3
2,2020-01-03,20409,N657JB,976,11697,FLL,14100,PHL,2152,2150,...,7.1,35.3,332,8.78,0.65,0.0,0.0,0.0,0.0,3
3,2020-01-04,20409,N709JB,976,11697,FLL,14100,PHL,2152,2215,...,9.1,40.7,354,7.64,0.54,0.0,0.0,0.0,0.0,3
4,2020-01-05,20409,N627JB,976,11697,FLL,14100,PHL,2152,2149,...,8.8,25.9,13,10.25,0.54,0.2,0.0,0.14,2.0,71


In [23]:
delays_df3

Unnamed: 0,FlightDate,DOT_ID_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,Origin,DestAirportID,Dest,CRSDepTime,DepTime,...,dest_wind_speed_10m_max,dest_wind_gusts_10m_max,dest_wind_direction_10m_dominant,dest_shortwave_radiation_sum,dest_et0_fao_evapotranspiration,dest_precipitation_sum,dest_rain_sum,dest_snowfall_sum,dest_precipitation_hours,dest_weather_code
0,2020-01-01,20409,N655JB,976,11697,FLL,14100,PHL,2152,2143,...,8.0,28.8,338,10.44,0.69,0.0,0.0,0.00,0.0,3
1,2020-01-02,20409,N591JB,976,11697,FLL,14100,PHL,2152,2152,...,7.9,26.3,326,9.66,0.64,0.0,0.0,0.00,0.0,3
2,2020-01-03,20409,N657JB,976,11697,FLL,14100,PHL,2152,2150,...,7.1,35.3,332,8.78,0.65,0.0,0.0,0.00,0.0,3
3,2020-01-04,20409,N709JB,976,11697,FLL,14100,PHL,2152,2215,...,9.1,40.7,354,7.64,0.54,0.0,0.0,0.00,0.0,3
4,2020-01-05,20409,N627JB,976,11697,FLL,14100,PHL,2152,2149,...,8.8,25.9,13,10.25,0.54,0.2,0.0,0.14,2.0,71
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
488387,2025-07-19,20416,N648NK,1617,13303,MIA,14100,PHL,1103,1052,...,16.5,52.9,148,20.73,3.67,3.7,3.7,0.00,11.0,53
488388,2025-07-21,20416,N680NK,1617,13303,MIA,14100,PHL,1529,1522,...,10.3,33.5,24,18.89,3.23,8.7,8.7,0.00,5.0,65
488389,2025-07-25,20416,N680NK,1617,13303,MIA,14100,PHL,1529,1522,...,12.6,37.8,202,23.23,4.05,0.0,0.0,0.00,0.0,3
488390,2025-07-26,20416,N905NK,1617,13303,MIA,14100,PHL,1103,1055,...,10.7,37.4,76,22.94,3.84,11.3,11.3,0.00,11.0,63


Now, we have the fully merged dataset. I will save this file to `../data/processed` as `delays_PHL_coord_weather_data.csv`.

In [24]:
delays_df3.to_csv("../data/processed/delays_PHL_coord_weather_data.csv")

In [25]:
delays_df3["origin_lat_long"] = delays_df3["origin_lat_long"].astype(str)

In [26]:
# Also load as parquet to send to GitHub
delays_df3.to_parquet("../data/processed/delays_PHL_coord_weather_data.parquet", index=False, engine="pyarrow", compression="snappy")