# Final Project - Flight Delays

#### Team members:
- Carla Cortez
- Redwan Hussain
- Anqi Liu
- Murray Stokely


### 1. Abstract
In the aviation industry, airline companies have taken a keen interest in predicting flight delays because of their financial impact and to retain customer satisfaction. The ability to accurately predict a delay will enable flight operations to proactively mitigate any costs associated with rerouting flights. Our team is proposing a project to analyze historical flight and weather data from 2015-2021 and build a machine learning model to predict such delays within a 2-hour window. Our objective is to design a classification model that will yield the best results based on chosen evaluation metrics.

The project timeline will span 4 phases and consist of exploratory data analysis, data cleaning, and model creation. Given the volume of the dataset, we will implement a pipeline to train, validate, and test our models. During the experimentation, we will detail all transformations, joins, anomalies, and choices behind feature engineering and testing metrics.

### 2. Data description

This project relies on three main datasets.

1. **Flights data** This is a subset of the passenger flight's on-time performance data taken from the TranStats data collection available from the U.S. Department of Transportation (DOT).  There are approximately 100 features in this data set.

2. **Weather** This provides weather information from NOAA for the same time period 2015-2021 as the flights data.

3. **Airports** This provides more detailed information about the airports in the flights dataset.

#### 2.1 EDA

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns  
sns.set(style="darkgrid")  
from pyspark.sql.functions import col,isnan,when,count
import statsmodels.api as sm
from tabulate import tabulate
data_BASE_DIR = "dbfs:/mnt/mids-w261/datasets_final_project_2022/"



##### 2.1.1 Full Flight (Main) Data

In [None]:
df_airlines_full = spark.read.parquet(f"{data_BASE_DIR}parquet_airlines_data/")
df_airlines_full.createOrReplaceTempView("df_airlines_full_tb")

In [None]:
%sql

select
year, month,
concat(cast(year as string), case when month >= 10 then cast(month as string) else concat('0',cast(month as string)) end) as year_month,
sum(case when dep_delay>15 then 1 else 0 end) as delay_cnt,
sum(case when dep_delay>15 then 1 else 0 end)/count(dep_delay) as delay_pct
from df_airlines_full_tb
group by year, month, year_month
order by year, month;

year,month,year_month,delay_cnt,delay_pct
2015,1,201501,175414,0.1913700522134533
2015,2,201502,176200,0.2153339264589423
2015,3,201503,183702,0.1860563571432912
2015,4,201504,153274,0.1593622764078869
2015,5,201505,173236,0.176175667182609
2015,6,201506,226180,0.2283728359709934
2015,7,201507,212222,0.2055574174126471
2015,8,201508,185980,0.1838411915771909
2015,9,201509,112126,0.121081685809191
2015,10,201510,115470,0.1193300897430067


Output can only be rendered in Databricks

Output can only be rendered in Databricks

##### 2.1.2 Data To Be Excluded

1. Flights from 2020 is showing very different trends comparing to all the other years, which is considered to be an outlier year and will be excluded from the further analysis.
2. Flight cancelled are not included in the "delay" analysis.

##### 2.1.3 Variables that could be used

##### 2.1.3.1 Flight

In [None]:
df_airlines_full.printSchema()

root
 |-- QUARTER: integer (nullable = true)
 |-- MONTH: integer (nullable = true)
 |-- DAY_OF_MONTH: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- FL_DATE: string (nullable = true)
 |-- OP_UNIQUE_CARRIER: string (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER: string (nullable = true)
 |-- TAIL_NUM: string (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- ORIGIN_AIRPORT_SEQ_ID: integer (nullable = true)
 |-- ORIGIN_CITY_MARKET_ID: integer (nullable = true)
 |-- ORIGIN: string (nullable = true)
 |-- ORIGIN_CITY_NAME: string (nullable = true)
 |-- ORIGIN_STATE_ABR: string (nullable = true)
 |-- ORIGIN_STATE_FIPS: integer (nullable = true)
 |-- ORIGIN_STATE_NM: string (nullable = true)
 |-- ORIGIN_WAC: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_SEQ_ID: integer (nullable = true)
 |-- DEST_CITY_MARKET_

In [None]:
pd_df_airlines = df_airlines_full.filter("YEAR <= 2018").sample(0.001).toPandas()

In [None]:
airline_cols = ['QUARTER',
 'MONTH',
 'DAY_OF_MONTH',
 'DAY_OF_WEEK',
 'FL_DATE', # string
 'OP_UNIQUE_CARRIER', # string
 'OP_CARRIER_AIRLINE_ID',
 'OP_CARRIER', # string
 'TAIL_NUM', # string
 'OP_CARRIER_FL_NUM',
 'ORIGIN_AIRPORT_ID',
 'ORIGIN_AIRPORT_SEQ_ID',
 'ORIGIN_CITY_MARKET_ID',
 'ORIGIN',
 'ORIGIN_CITY_NAME', # string
 'ORIGIN_STATE_ABR', # string
 'ORIGIN_STATE_FIPS',
 'ORIGIN_STATE_NM', # string
 'ORIGIN_WAC',
 'DEST_AIRPORT_ID',
 'DEST_AIRPORT_SEQ_ID',
 'DEST_CITY_MARKET_ID',
 'DEST', # string
 'DEST_CITY_NAME', # string
 'DEST_STATE_ABR', # string
 'DEST_STATE_FIPS',
 'DEST_STATE_NM', # string
 'DEST_WAC',
 'CRS_DEP_TIME',
 'DEP_TIME',
 'DEP_DELAY',
 'DEP_DELAY_NEW',
 'DEP_DEL15',
 'DEP_DELAY_GROUP',
 'DEP_TIME_BLK', # string
 'TAXI_OUT',
 'WHEELS_OFF',
 'WHEELS_ON',
 'TAXI_IN',
 'CRS_ARR_TIME',
 'ARR_TIME',
 'ARR_DELAY',
 'ARR_DELAY_NEW',
 'ARR_DEL15',
 'ARR_DELAY_GROUP',
 'ARR_TIME_BLK', # string
 'CANCELLED',
 'CANCELLATION_CODE', # string
 'DIVERTED',
 'CRS_ELAPSED_TIME',
 'ACTUAL_ELAPSED_TIME',
 'AIR_TIME',
 'FLIGHTS',
 'DISTANCE',
 'DISTANCE_GROUP',
 'CARRIER_DELAY',
 'WEATHER_DELAY',
 'NAS_DELAY',
 'SECURITY_DELAY',
 'LATE_AIRCRAFT_DELAY',
 'FIRST_DEP_TIME',
 'TOTAL_ADD_GTIME',
 'LONGEST_ADD_GTIME',
#  'DIV_AIRPORT_LANDINGS',
#  'DIV_REACHED_DEST',
#  'DIV_ACTUAL_ELAPSED_TIME',
#  'DIV_ARR_DELAY',
#  'DIV_DISTANCE',
#  'DIV1_AIRPORT',
#  'DIV1_AIRPORT_ID',
#  'DIV1_AIRPORT_SEQ_ID',
#  'DIV1_WHEELS_ON',
#  'DIV1_TOTAL_GTIME',
#  'DIV1_LONGEST_GTIME',
#  'DIV1_WHEELS_OFF',
#  'DIV1_TAIL_NUM',
#  'DIV2_AIRPORT',
#  'DIV2_AIRPORT_ID',
#  'DIV2_AIRPORT_SEQ_ID',
#  'DIV2_WHEELS_ON',
#  'DIV2_TOTAL_GTIME',
#  'DIV2_LONGEST_GTIME',
#  'DIV2_WHEELS_OFF',
#  'DIV2_TAIL_NUM',
#  'DIV3_AIRPORT',
#  'DIV3_AIRPORT_ID',
#  'DIV3_AIRPORT_SEQ_ID',
#  'DIV3_WHEELS_ON',
#  'DIV3_TOTAL_GTIME',
#  'DIV3_LONGEST_GTIME',
#  'DIV3_WHEELS_OFF',
#  'DIV3_TAIL_NUM',
#  'DIV4_AIRPORT',
#  'DIV4_AIRPORT_ID',
#  'DIV4_AIRPORT_SEQ_ID',
#  'DIV4_WHEELS_ON',
#  'DIV4_TOTAL_GTIME',
#  'DIV4_LONGEST_GTIME',
#  'DIV4_WHEELS_OFF',
#  'DIV4_TAIL_NUM',
#  'DIV5_AIRPORT',
#  'DIV5_AIRPORT_ID',
#  'DIV5_AIRPORT_SEQ_ID',
#  'DIV5_WHEELS_ON',
#  'DIV5_TOTAL_GTIME',
#  'DIV5_LONGEST_GTIME',
#  'DIV5_WHEELS_OFF',
#  'DIV5_TAIL_NUM',
 'YEAR']

sanity_check_table = [['Variable Name', 
#                        'Unique Counts', 
                       'Missing Pct', 
                       'Min Value', 
                       'Max Value', 
                       'Enough Records']]

for var in airline_cols:
    sanity_check_table +=[[var, 
#                            pd_df_airlines[var].nunique(), 
                           round(pd_df_airlines[var].isnull().sum() / pd_df_airlines.shape[0],4), 
                           pd_df_airlines[var].dropna().min(), 
                           pd_df_airlines[var].dropna().max(), 
                           'Y' if pd_df_airlines[var].count()/pd_df_airlines.shape[0] > 0.5 else 'N']]
    
print(tabulate(sanity_check_table, floatfmt=".2f"))

---------------------  -----------  ------------  ----------  --------------
Variable Name          Missing Pct  Min Value     Max Value   Enough Records
QUARTER                0.0          1             4           Y
MONTH                  0.0          1             12          Y
DAY_OF_MONTH           0.0          1             31          Y
DAY_OF_WEEK            0.0          1             7           Y
FL_DATE                0.0          2015-01-01    2018-12-31  Y
OP_UNIQUE_CARRIER      0.0          9E            YX          Y
OP_CARRIER_AIRLINE_ID  0.0          19393         21171       Y
OP_CARRIER             0.0          9E            YX          Y
TAIL_NUM               0.002        215NV         N9EAMQ      Y
OP_CARRIER_FL_NUM      0.0          1             7439        Y
ORIGIN_AIRPORT_ID      0.0          10135         16218       Y
ORIGIN_AIRPORT_SEQ_ID  0.0          1013503       1621801     Y
ORIGIN_CITY_MARKET_ID  0.0          30070         35991       Y
ORIGIN        

In [None]:
%sql

select
DAY_OF_MONTH,
sum(case when dep_delay>15 then 1 else 0 end) as delay_cnt,
sum(case when dep_delay>15 then 1 else 0 end)/count(dep_delay) as delay_pct
from df_airlines_full_tb
where year <= 2018
and cancelled = 0
group by DAY_OF_MONTH
order by DAY_OF_MONTH;

DAY_OF_MONTH,delay_cnt,delay_pct
1,272004,0.1747010856998987
2,283616,0.1811886223030867
3,268220,0.1745099187372722
4,252668,0.164143674763368
5,271450,0.1730349741642751
6,268764,0.1712081987096511
7,263946,0.1695474213241793
8,280384,0.1779720941097015
9,292240,0.1856541156590073
10,271534,0.1733121257341705


Output can only be rendered in Databricks

In [None]:
%sql

select
DAY_OF_WEEK,
sum(case when dep_delay>15 then 1 else 0 end) as delay_cnt,
sum(case when dep_delay>15 then 1 else 0 end)/count(dep_delay) as delay_pct
from df_airlines_full_tb
where year <= 2018
and cancelled = 0
group by DAY_OF_WEEK
order by DAY_OF_WEEK;

DAY_OF_WEEK,delay_cnt,delay_pct
1,1304606,0.1833796348617054
2,1134786,0.1641927960896514
3,1139830,0.1625719060091544
4,1327972,0.186241458727714
5,1369984,0.1911252666368117
6,904478,0.1550542042911545
7,1163792,0.1721170273685406


Output can only be rendered in Databricks

##### 2.1.3.2 Weather

In [None]:
df_weather_full = spark.read.parquet(f"{data_BASE_DIR}parquet_weather_data/")
df_weather_full.createOrReplaceTempView("df_weather_full_tb")
pd_df_weather = df_weather_full.filter("YEAR <= 2018").sample(0.001).toPandas()

In [None]:
df_weather_full.printSchema()

root
 |-- STATION: string (nullable = true)
 |-- DATE: string (nullable = true)
 |-- LATITUDE: string (nullable = true)
 |-- LONGITUDE: string (nullable = true)
 |-- ELEVATION: string (nullable = true)
 |-- NAME: string (nullable = true)
 |-- REPORT_TYPE: string (nullable = true)
 |-- SOURCE: string (nullable = true)
 |-- HourlyAltimeterSetting: string (nullable = true)
 |-- HourlyDewPointTemperature: string (nullable = true)
 |-- HourlyDryBulbTemperature: string (nullable = true)
 |-- HourlyPrecipitation: string (nullable = true)
 |-- HourlyPresentWeatherType: string (nullable = true)
 |-- HourlyPressureChange: string (nullable = true)
 |-- HourlyPressureTendency: string (nullable = true)
 |-- HourlyRelativeHumidity: string (nullable = true)
 |-- HourlySkyConditions: string (nullable = true)
 |-- HourlySeaLevelPressure: string (nullable = true)
 |-- HourlyStationPressure: string (nullable = true)
 |-- HourlyVisibility: string (nullable = true)
 |-- HourlyWetBulbTemperature: string (nu

In [None]:
sanity_check_table = [['Variable Name', 
#                        'Unique Counts', 
                       'Missing Pct', 
                       'Min Value', 
                       'Max Value', 
                       'Enough Records']]

weather_cols = ['STATION',
 'DATE',
 'LATITUDE',
 'LONGITUDE',
 'ELEVATION',
#  'NAME',
 'REPORT_TYPE',
 'SOURCE',
 'HourlyAltimeterSetting',
 'HourlyDewPointTemperature',
 'HourlyDryBulbTemperature',
 'HourlyPrecipitation',
 'HourlyPresentWeatherType',
 'HourlyPressureChange',
 'HourlyPressureTendency',
 'HourlyRelativeHumidity',
 'HourlySkyConditions',
 'HourlySeaLevelPressure',
 'HourlyStationPressure',
 'HourlyVisibility',
 'HourlyWetBulbTemperature',
 'HourlyWindDirection',
 'HourlyWindGustSpeed',
 'HourlyWindSpeed',
 'Sunrise',
 'Sunset',
 'DailyAverageDewPointTemperature',
 'DailyAverageDryBulbTemperature',
 'DailyAverageRelativeHumidity',
 'DailyAverageSeaLevelPressure',
 'DailyAverageStationPressure',
 'DailyAverageWetBulbTemperature',
 'DailyAverageWindSpeed',
 'DailyCoolingDegreeDays',
 'DailyDepartureFromNormalAverageTemperature',
 'DailyHeatingDegreeDays',
 'DailyMaximumDryBulbTemperature',
 'DailyMinimumDryBulbTemperature',
 'DailyPeakWindDirection',
 'DailyPeakWindSpeed',
 'DailyPrecipitation',
 'DailySnowDepth',
 'DailySnowfall',
 'DailySustainedWindDirection',
 'DailySustainedWindSpeed',
 'DailyWeather',
#  'MonthlyAverageRH',
#  'MonthlyDaysWithGT001Precip',
#  'MonthlyDaysWithGT010Precip',
#  'MonthlyDaysWithGT32Temp',
#  'MonthlyDaysWithGT90Temp',
#  'MonthlyDaysWithLT0Temp',
#  'MonthlyDaysWithLT32Temp',
#  'MonthlyDepartureFromNormalAverageTemperature',
#  'MonthlyDepartureFromNormalCoolingDegreeDays',
#  'MonthlyDepartureFromNormalHeatingDegreeDays',
#  'MonthlyDepartureFromNormalMaximumTemperature',
#  'MonthlyDepartureFromNormalMinimumTemperature',
#  'MonthlyDepartureFromNormalPrecipitation',
#  'MonthlyDewpointTemperature',
#  'MonthlyGreatestPrecip',
#  'MonthlyGreatestPrecipDate',
#  'MonthlyGreatestSnowDepth',
#  'MonthlyGreatestSnowDepthDate',
#  'MonthlyGreatestSnowfall',
#  'MonthlyGreatestSnowfallDate',
#  'MonthlyMaxSeaLevelPressureValue',
#  'MonthlyMaxSeaLevelPressureValueDate',
#  'MonthlyMaxSeaLevelPressureValueTime',
#  'MonthlyMaximumTemperature',
#  'MonthlyMeanTemperature',
#  'MonthlyMinSeaLevelPressureValue',
#  'MonthlyMinSeaLevelPressureValueDate',
#  'MonthlyMinSeaLevelPressureValueTime',
#  'MonthlyMinimumTemperature',
#  'MonthlySeaLevelPressure',
#  'MonthlyStationPressure',
#  'MonthlyTotalLiquidPrecipitation',
#  'MonthlyTotalSnowfall',
#  'MonthlyWetBulb',
 'AWND',
 'CDSD',
 'CLDD',
 'DSNW',
 'HDSD',
 'HTDD',
 'NormalsCoolingDegreeDay',
 'NormalsHeatingDegreeDay',
#  'ShortDurationEndDate005',
#  'ShortDurationEndDate010',
#  'ShortDurationEndDate015',
#  'ShortDurationEndDate020',
#  'ShortDurationEndDate030',
#  'ShortDurationEndDate045',
#  'ShortDurationEndDate060',
#  'ShortDurationEndDate080',
#  'ShortDurationEndDate100',
#  'ShortDurationEndDate120',
#  'ShortDurationEndDate150',
#  'ShortDurationEndDate180',
#  'ShortDurationPrecipitationValue005',
#  'ShortDurationPrecipitationValue010',
#  'ShortDurationPrecipitationValue015',
#  'ShortDurationPrecipitationValue020',
#  'ShortDurationPrecipitationValue030',
#  'ShortDurationPrecipitationValue045',
#  'ShortDurationPrecipitationValue060',
#  'ShortDurationPrecipitationValue080',
#  'ShortDurationPrecipitationValue100',
#  'ShortDurationPrecipitationValue120',
#  'ShortDurationPrecipitationValue150',
#  'ShortDurationPrecipitationValue180',
#  'REM',
#  'BackupDirection',
#  'BackupDistance',
#  'BackupDistanceUnit',
#  'BackupElements',
#  'BackupElevation',
#  'BackupEquipment',
#  'BackupLatitude',
#  'BackupLongitude',
#  'BackupName',
 'WindEquipmentChangeDate',
 'YEAR'
               ]

for var in weather_cols:
    sanity_check_table +=[[var, 
#                            pd_df_weather[var].nunique(), 
                           round(pd_df_weather[var].isnull().sum()/pd_df_weather.shape[0],4), 
                           pd_df_weather[var].dropna().min(), 
                           pd_df_weather[var].dropna().max(), 
                           'Y' if pd_df_weather[var].count()/pd_df_weather.shape[0] > 0.5 else 'N']]
    
print(tabulate(sanity_check_table))

------------------------------------------  -----------  -------------------  -------------------  --------------
Variable Name                               Missing Pct  Min Value            Max Value            Enough Records
STATION                                     0.0          00702699999          A5125600451          Y
DATE                                        0.0          2015-01-01T00:00:00  2018-12-31T23:59:00  Y
LATITUDE                                    0.0076       -0.0166667           9.993861             Y
LONGITUDE                                   0.0076       -0.005456            99.9666666           Y
ELEVATION                                   0.0076       -1.0                 999.1                Y
REPORT_TYPE                                 0.0          CRN05                SY-MT                Y
SOURCE                                      0.0          1                    O                    Y
HourlyAltimeterSetting                      0.4781       27.97   

In [None]:
%sql

select 
year,
month(date) as month,
max(concat(cast(year as string), case when month(date) > 9 then cast(month(date) as string) else concat('0',cast(month(date) as string)) end)) as year_month,
avg(cast(regexp_replace(HourlyAltimeterSetting,'[^0-9.]+','') as float)) as altimeter,
avg(cast(regexp_replace(HourlySeaLevelPressure,'[^0-9.]+','') as float)) as sea_level_pressure,
avg(cast(regexp_replace(HourlyStationPressure,'[^0-9.]+','') as float)) as station_pressure,
avg(cast(regexp_replace(HourlyWindSpeed,'[^0-9.]+','') as float)) as wind_speed
from df_weather_full_tb
where year <= 2018
group by year, month(date)
order by year, month(date)

year,month,year_month,altimeter,sea_level_pressure,station_pressure,wind_speed
2015,1,201501,30.06473766718564,30.019837584157283,28.84106605595806,8.231777744107491
2015,2,201502,30.04511910131013,30.02732800969941,28.829686204683632,8.412876530160824
2015,3,201503,30.06001503168958,30.036951671806538,28.87534460313336,8.102456908552844
2015,4,201504,29.964504146175397,29.96090556918036,28.759775482371357,8.179527987373811
2015,5,201505,30.000689347384423,29.959020679414312,28.780613345610583,7.830066649185492
2015,6,201506,29.97430284563576,29.936081344545432,28.79352913728112,7.131825208958373
2015,7,201507,29.94673039331379,29.892646408550576,28.777643558969032,6.992404343885842
2015,8,201508,29.97600403696272,29.937781116211088,28.8154343941312,6.662581771153774
2015,9,201509,29.993101945888835,29.968418783594487,28.828767484451348,6.855071248585863
2015,10,201510,30.017696893316405,30.0078111949378,28.851700269625656,7.322301654997074


Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

##### 2.1.3.3 Station

In [None]:
df_stations = spark.read.parquet(f"{data_BASE_DIR}stations_data/*")
df_stations.createOrReplaceTempView("df_station_tb")
pd_df_stations = df_stations.sample(0.1).toPandas()

In [None]:
df_stations.printSchema()

root
 |-- usaf: string (nullable = true)
 |-- wban: string (nullable = true)
 |-- station_id: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- lon: double (nullable = true)
 |-- neighbor_id: string (nullable = true)
 |-- neighbor_name: string (nullable = true)
 |-- neighbor_state: string (nullable = true)
 |-- neighbor_call: string (nullable = true)
 |-- neighbor_lat: double (nullable = true)
 |-- neighbor_lon: double (nullable = true)
 |-- distance_to_neighbor: double (nullable = true)



In [None]:
sanity_check_table = [['Variable Name', 
#                        'Unique Counts', 
                       'Missing Pct', 
                       'Min Value', 
                       'Max Value', 
                       'Enough Records']]

for var in list(pd_df_stations.columns):
    sanity_check_table +=[[var, 
#                            pd_df_stations[var].nunique(), 
                           round(pd_df_stations[var].isnull().sum()/pd_df_stations.shape[0],4), 
                           pd_df_stations[var].dropna().min(), 
                           pd_df_stations[var].dropna().max(), 
                           'Y' if pd_df_stations[var].count()/pd_df_stations.shape[0] > 0.5 else 'N']]
    
print(tabulate(sanity_check_table))

--------------------  -----------  ------------------------  -----------------------------  --------------
Variable Name         Missing Pct  Min Value                 Max Value                      Enough Records
usaf                  0.0          690020                    A51256                         Y
wban                  0.0          00102                     96402                          Y
station_id            0.0          69002093218               A5125600451                    Y
lat                   0.0          17.7                      71.333                         Y
lon                   0.0          -176.65                   174.1                          Y
neighbor_id           0.0          69002093218               A5125600451                    Y
neighbor_name         0.0          A L MANGHAM JR RGNL ARPT  ZEPHYRHILLS MUNICIPAL AIRPORT  Y
neighbor_state        0.0          AK                        WY                             Y
neighbor_call         0.0         

#### 2.2 Table description
a) df_flights: It contains a time series of the flight schedules from the year 2015 to 2021, including:
- Time period information:
  - Flight date and time
  - Day of the week/day of the month and year
- Flight operational data:
  - Scheduled departure/arrival time
  - Departure/arrival time
  - Departure/arrival delay measured in minutes, it is calculated as the difference between the scheduled and actual arrival/departure time.
  - Flight cancellation, in our case we will ignore the the canceled flights
- Metadata associated with the flight's origin and destination airports:
  - IATA airport code which is the airport location's unique 3 letter identifier
  - Airport state and city
- Carrier information
 
b) df_neighbor_stations: It provides metadata about weather stations with neighbor airports. It includes:
- Station unique identifier
- neighbor_call corresponds to the ICAO airport code which is defined by the International Civil Aviation Organization
- Neighbor airport name, state

c) df_weather_by_station: It contains a time series of weather information per weather station, including:
- Station unique identifier
- Time period information 
- Station metadata
- Weather metrics 

d) df_iata_icao_codes: External resource that contains the mapping between the IATA and ICAO unique airport codes

#### 2.3 Table joins

a) df_neighbor_stations <> df_iata_icao_codes
- We will perform an inner join between the df_neighbor_stations and df_iata_icao_codes tables, using the df_neighbor_stations.neighbor_call and the df_iata_icao_codes.icao_code columns to enhance the df_neighbor_stations table with a new corresponding IATA airport code.

b) df_neighbor_stations <> df_flights
- Both tables can be joined using the df_flights.ORIGIN and the df_flights.DEST columns, which contain the IATA airport code for the origin and destination airports, respectively,  with the 
df_neighbor_stations.iata_code that was created above. We will need to join the tables twice, once using the df_flights.ORIGIN and df_neighbor_stations.iata_code columns and the second join using df_flights.DEST and df_neighbor_stations.iata_code columns. We can perform inner joins between these two tables because we are only interested in flights with associated weather information.
- This relationship is many to many because multiple flights can be mapped to multiple neighbor stations.
- With the result of these joins, we will create a new table called df_flight_station that contains the origin/destination flight and airport information along with the associated weather station_id. This table can be stored in the Azure Blob storage.

c) df_flights_station <> df_weather_by_station
- We will join the df_flights_station and the df_weather_by_station tables using the df_flights_station.station_id and df_weather_by_station.station columns and calculate the average of each weather metric on a time interval of two hours before the flight departure.

#### 2.4 Data Cleaning and Validation

We are joining several complex data sets and we expect to find a number of issues that require data cleaning.  Some of the levels of different categorical variables may need to be merged, we will need to handle missing values, and normalize other numeric quantities.

##### 2.4.1 NULL Values

We will drop all columns with more than 50% NULL values, which are identified through data summarization.

##### 2.4.2 Outliers

We will carefully report any outlier data that we believe should be excluded, and only exclude it after a thorough investigation.

##### 2.4.3 Normalization and Scaling

We will normalize variables should be treated numerically like weather information.

##### 2.4.4 Class Imbalance

We have about a 5:1 imbalance between our "no delay" and "delay" classes.  For some of our modelling work, we may need to boost the number of training examples from our "delay" category to build a predictive model.

### 3. Machine Learning Algorithms and Metrics

#### 3.1 Model

#### 3.1.1 Logistic Regression

Our first ML algorithm will be Logistic Regression.  We believe that with a variety of regression variables available to us that we can explore the feature space and build a logistic regression model that uses the source and destination city, airline, weather, seasonality, recent delay history, and many other robust features described above to classify all flights into two categories -- those with >= 15 minute delay of departure, and those without.

##### 3.1.1.1 Example Simple Logistic Regression Model

$$logit(p_i) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + ... + \beta_n x_{n,i}$$

Where features will be things like "airline carrier", "source city", "destination city", "raining at source city", "raining at destination city", "is winter", "is thunderstorm", etc.

##### 3.1.1.2 Loss Function and Regularization

We will aim to minimize our Log Loss function:

$$ Log Loss = \sum_{(x,y) \in D} -y log(y') - (1-y)log(1-y')$$

We have a large number of features to choose from, and so we will explore adding an L2 regularization term to avoid overfitting on large set of features.  

#### 3.1.2 Random Forests

We will also look to fit a Random Forest model to this data set.  We believe a random forest will be useful because it is robust to inclusion of irrelevant features, and invariant under scaling and transformation of many of the feature values.

#### 3.2 Metrics and Baseline

We are coming at this problem from the business perspective of our ability to predict a delay so that we can minimize the impact of our effected passengers.  For this reason, the cost of a False Negative is more important to us than the cost of a False Positive.  Across our entire data set, we see that 17.4% of flights are delayed more than 15 minutes from their scheduled departure.  We set as our baseline predictor a model that always predicts a delay.  We compute the precision and recall of this "Always Predict Delay" model, and will measure our logistic regression and other models against this baseline.

In [None]:
%sql

select
--year, month,
-- concat(cast(year as string), case when month >= 10 then cast(month as string) else concat('0',cast(month as string)) end) as year_month,
-- sum(case when dep_delay>15 then 1 else 0 end) as delay_cnt,
sum(case when dep_delay>15 then 1 else 0 end)/count(dep_delay) as delay_pct
from df_airlines_full_tb
where year <= 2018
and cancelled = 0;

delay_pct
0.174117145151199


##### Metrics

Precision and recall are defined in terms of True positive (TP), False positive (FP), True Negative (TN), and False Negative (FN) classifications.

Precision = $$\frac{TP}{TP+FP}$$

Recall = $$\frac{TP}{TP+FN}$$

F1 = $$\frac{2TP}{2TP+FP+FN}$$

For a given observed delay percentage D, assumed to be 17.4% here, these metrics for an "Always predict delay" strategy are computed as

Precision = $$\frac{17.4}{17.4 + (100-17.4)} = \frac{17.4}{100} = .174$$
Recall = $$\frac{17.4}{17.4 + 0} = 1$$
F1 = $$\frac{2 * 17.4}{2 * 17.4 + (100-17.4) + 0} = \frac{34.8}{117.4} = .296$$

We intend to update these metrics with our logistic regression model and other models in later phases of the project.

| Model | Precision | Recall | F1 |
|---|---|---|---|
| Always predict delay | .174 | 1.0 | .296 |
| Never predict delay  | 0 | 0 | 0 |
| Logistic Regression #1 | TBD | TBD | TBD |

### 4. Machine Learning Pipelines

#### 4.1 Pipeline description

a) Data Engineering:
- Step 1	
  - Ingest provided parquet files containing flights, weather, and station information and create corresponding dataframes.
  - Ingest external source airport code data in parquet format and create dataframe.
  - Perform EDA.
  - Data Cleaning.
  - Create a final dataframe with the result of the joined dataframes and store it in Azure Blob Storage in parquet format.

- Step 2
  - Ingest final_dataset from  Azure Blob Storage.
  - Select features
  - Split the data into Train, Validation, and Test dataset.
  - Perform normalization on the Train, Validation, and Test dataset using the Train dataset.
  - Store the normalized Train, Validation, and Test dataset Azure Blob Storage in parquet format.

b) Model Training
- Ingest the Train and Validation datasets from  Azure Blob Storage
- Build baseline model
- Build and Train the classification model
- Perform cross-validation and hyperparameter tuning

c) Model Evaluation
- Ingest the Test dataset from Azure Blob Storage
- Run the trained model on the Test dataset
- Evaluate model
- Store predictions in Azure Blob Storage

#### 4.2 Pipeline Block Diagram
<img src="https://github.com/carla-cortez/261/blob/main/ml_diagram.png?raw=true>" width=75%>

#### 4.3 Data Split Plan

The time serires data from 2015 to 2021 will be split in Train, Validation and Test datasets. We will remove the data from the year 2020 as it is considered an outlier.
- Train dataset: 2015 - 2018
- Validation dataset: 2019
- Test dataset: 2021

For cross-validation:
We will use a rolling window cross validation technique for time series. Having a fold for each year of data. For example:
- train on data from 2015 to predict 2016.  
- train on data from 2015 to 2016 to predict 2017.
- train on data from 2015 to 2017 to predict 2018.

### 5. Project Timeline

<img src="https://github.com/carla-cortez/261/blob/main/gantt_diagram.png?raw=true>" width=75%>