%md
# Final Project - Flight Delays

__Team members:__
- Carla Cortez
- Redwan Hussain
- Anqi Liu
- Murray Stokely


### 1. Abstract
In the aviation industry, airline companies have taken interest in predicting flight delays because of their financial impact and to retain customer satisfaction. Our team's goal is to analyze historical flight and weather data from 2015-2021 and build a classification model to predict delays within a 2-hour window. This report outlines the steps followed to successfully develop and compare models by implementing a machine learning pipeline. We have chosen precision and recall for our evaluation metrics as mitigating False Negatives leads to less customers missing their flight. 

The data we used included passenger flight's on-time performance data taken from the US Department of Transportation and a weather dataset from the National Oceanic Atmospheric Administration. Each set included datetime fields and weather station IDs that were used during the join processs. We also used an existing list of IATA ICAO codes to merge our data by airport codes and time zones. 

The overall project spanned four phases and consisted of the following steps: preprocessing data, joins and splits, feature engineering, model creation, hyperparameter tuning, and model evaluation. The full dataset was split based on year: 2015 through 2020 for training and 2021 for test (held-out). Our base model was developed using logistic regression and compared against random forest, gradient-boosted tree, and multilayer perceptron techniques. A Pearson correlation matrix and importance table were used to identify ws_origin_HourlyWindSpeed as the most predictive feature for our response variable, DEP_DEL15, across our experiments. Grid search was performed to find the optimal hyperparameters.

All models were trained with cross validation and tested against the held-out set. The logistic regression model had the best results with precision and recall at 0.563 and 0.594, respectively, and F1 and F2 scores at 0.579 and 0.588, respectively. The random forest model also performed well with precision and recall at 0.593 and 0.560, respectively, and F1 and F2 scores at 0.576 and 0.566, respectively.

### 2. Data description

This project relies on three main datasets.

1. **Flights data** 

This is a subset of the passenger flight's on-time performance data taken from the TranStats data collection available from the U.S. Department of Transportation (DOT).  There are approximately 100 features in this data set.

2. **Weather** 

This provides weather information from NOAA for the same time period 2015-2021 as the flights data.

3. **Airports** 
This provides more detailed information about the airports in the flights dataset.

#### 2.1 Table description
a) df_flights: It contains a time series of the flight schedules from the year 2015 to 2021, including:
- Time period information:
  - Flight date and time
  - Day of the week/day of the month and year
- Flight operational data:
  - Scheduled departure/arrival time
  - Departure/arrival time
  - Departure/arrival delay measured in minutes, it is calculated as the difference between the scheduled and actual arrival/departure time.
  - Flight cancellation, in our case we will ignore the the canceled flights
- Metadata associated with the flight's origin and destination airports:
  - IATA airport code which is the airport location's unique 3 letter identifier
  - Airport state and city
- Carrier information
 
b) df_neighbor_stations: It provides metadata about weather stations with neighbor airports. It includes:
- Station unique identifier
- neighbor_call corresponds to the ICAO airport code which is defined by the International Civil Aviation Organization
- Neighbor airport name, state

c) df_weather_by_station: It contains a time series of weather information per weather station, including:
- Station unique identifier
- Time period information 
- Station metadata
- Weather metrics
- Weather date lag timestamp

d) df_iata_icao_codes: External resource that contains the mapping between the IATA and ICAO unique airport codes and the airport timezone (source: https://openflights.org/data.html)

#### 2.2 Discussion of Data Leakage

Data Leakage refers to any information from outside the training set that is used to create the model.  In the context of this problem of predicting flight delays, this means any data that is not known to us two hours before the scheduled departure of the flight.

##### 2.2.1 Examples of Data Leakage

###### 2.2.1.1 Normalization - Using Mean Only From the training set

For many of our columns it was necessary to normalize the data.  However, it was important to only take mean values calculated from our training set.  Otherwise, we would be leaking data from our test set.

###### 2.2.1.2 Page Rank Per Year

To ensure the ranking from the test data set doesn't impact the results for the other years.

###### 2.2.1.3 Events Unknown Two Hours Before Flight Departure

This came up when looking at some of the natural disaster data, for example.  The NOAA has records when major named Hurricanes and Tropical storms made landfall in the USA.  This data is readily available at daily granularity, but using this information would be a form of data leakage, because we can not be sure that it would have been known that the storm would have made landfall or had already made landfall before the departure.  To address this concern, we can lag this feature by a day, under the assumption that if a hurricane made landfall yesterday in a state, nearby airports may experience delays today.

#### 2.3 Table joins

a) df_neighbor_stations <> df_iata_icao_codes
- We performed an inner join between the df_neighbor_stations and df_iata_icao_codes tables, using the df_neighbor_stations.neighbor_call and the df_iata_icao_codes.icao_code columns to enhance the df_neighbor_stations table with a new corresponding IATA airport code.

b) df_flights <> df_iata_icao_codes
- We performed a left join between the df_flights and df_iata_icao_codes tables to obtain the timezone per airport by iata code. This will be used later to transform the timestamps to UTC.

b) df_neighbor_stations <> df_flights
- Both tables were joined using the df_flights.ORIGIN and the df_flights.DEST columns, which contain the IATA airport code for the origin and destination airports, respectively,  with the 
df_neighbor_stations.iata_code that was created above. We will need to join the tables twice, once using the df_flights.ORIGIN and df_neighbor_stations.iata_code columns and the second join using df_flights.DEST and df_neighbor_stations.iata_code columns. We can perform inner joins between these two tables because we are only interested in flights with associated weather information.
- This relationship is many to many because multiple flights can be mapped to multiple neighbor stations.
- With the result of these joins, we created a new table called df_flight_station that contains the origin/destination flight and airport information along with the associated weather station_id. This table was stored in the Azure Blob storage.

c) df_flights_station <> df_weather_station
- We joined the df_flights_station and the df_weather_by_station tables using the df_flights_station.station_id and df_weather_by_station.station columns. To obtain the latest weather measurement two hours before the estimated flight departure, we created two columns that define the acceptable window of time (from 6 hours before the flight to 2 hours before the flight) that we should consider when selecting the latest weather measurement.

d) The final df_flights_weather_station contains the flight/weather data for the origin and destination airports. This table includes a column that contains the previous timestamp available for the weather station measurement timestamp (lag). This was used to obtain the origin and destination lag weather measurements by merging df_flights_weather_station.ws_origin/dest_weather_lag matching the weather.DATE and station_id columns.

##### 2.3.1 Final Join Result

**Total number of rows:** 
14,104,916

**Total join time:**
It took a total of 3.38 hours to perform the final join

**Cluster specification:**
1-4 Workers: 16-64 GB Memory, 4-16 Cores
1 Driver: 16 GB Memory, 4 Cores
Runtime: 11.3.x-cpu-ml-scala2.12

<img src="https://user-images.githubusercontent.com/88794396/204161781-0651e1c4-3a41-49ca-b7c0-903a295fe1d5.png?raw=true>" width=60%>

#### 2.4 Data Cleaning and Validation

We are joining several complex data sets and we expected to find a number of issues that require data cleaning.


##### 2.4.1 Excluded data
Cancelled flights are not included in the "delay" analysis. 

| Exclusion          | Number of Rows|
|--------------------|------------------|
|Cancelled flights   |     874,041      |
|Duplicated flights  |     31,746,841   |

##### 2.4.2 NULL Values

We will drop all columns with more than 50% NULL values, which are identified through data summarization.
<br>From flights dataset, only the core variables related to the airport, flight are kept.
<br>From weather dataset, only the well reported hourly variables are kept.
<br>For details, please refer to section 7

##### 2.4.3 Outliers

We carefully reported any outlier data that we believe should be excluded, and only exclude it after a thorough investigation.
<br>During the cross validation, we noticed that with the delayed percentage droped significantly since 2020 Mar, the data was imbalanced across the year as shown in the below plot.

<img src="https://user-images.githubusercontent.com/88794396/206003118-59f8bdef-890d-472b-8052-e828ea0d8596.png?raw=true>" width=65%>
<br>Therefore, we took several approaches to fix such "outlier". 
<br>We tried to either exclude a few months that showed significant discrepancies comparing to all other time periods, or carefully balanced data by month.
<br>For details, please refer to section 6.4



##### 2.4.4 Normalization and Scaling

We applied min max scaler to all features after encoding and vectorization. The numeric variables are filled nulls with the mean of itself from the training data only.

#### 2.5 Class Imbalance: 

We have about a 5:1 imbalance between our "no delay" and "delay" classes. For some of our modelling work, we may need to boost the number of training examples from our "delay" category to build a predictive model.

##### 2.5.1 Balancing Input Classes:
DownSampling vs OverSampling

In order to ensure an accurate classifier, we must deal with the fact that non-delayed flights are significantly over-represented in our data set.  Two appraoches we explore are DownSampling the over-represented class (preferred) and SMOTE (Synthetic Minority Oversampling TEchnique).

##### 2.5.2 DownSampling
We count the ratio between delayed vs non-delayed flights and apply it back to the majority which is the non-delayed flights by year to get a more balanced dataset for both EDA and modelling.

The plot below shows that the data is balanced per year after applying the Downsampling technique 

<br>
<br>
<br>

<img src="https://user-images.githubusercontent.com/88794396/205867988-6eb249cd-0eef-4090-ae0d-de6eb026924e.png?raw=true>" width=80%>

#### 2.6 Feature Engineering

We joined weather features to the flights dataset being our first round of feature engineering step.
<br>After that, event-based, time-based, and graph-based features are added on top of the existing dataset.

___Delay vs Non-Delay by State___
<br>`The plot shows the number of delayed and non-delay flights by departure state`
<br>No obvious difference between delay vs non-delay flights.


___Delay vs Non-Delay by Dew Point Temp___
<br>`The plot shows the number of delayed and non-delay flights by Dew Point Temperature`
<br>The relative steep increase from 10 to 30 could be observed from non-delay which is not that obvious for delayed flights.


___Delay vs Non-Delay by Dry Buld Temp___
<br>`The plot shows the number of delayed and non-delay flights by Dry Bulb Temperature`
<br>The trends are not significantly differ between delayed vs non-delayed flights.


___Delay vs Non-Delay by Relative Humidity___
<br>`The plot shows the number of delayed and non-delay flights by Relative Humidity`
<br>The overal trends are similar across delayed vs non-delayed flights, but the relative changes are not always consistent.


___Delay vs Non-Delay by Pressure___
<br>`The plot shows the number of delayed and non-delay flights by Station Pressure`
<br>The trends are not significantly differ between delayed vs non-delayed flights.

In [None]:
%sql

select 
year, quarter, month,
origin_state_abr, 
ws_origin_HourlyDewPointTemperature,
ws_origin_HourlyDryBulbTemperature,
ws_origin_HourlyRelativeHumidity,
ws_origin_HourlyStationPressure,
ws_origin_HourlyVisibility,
ws_origin_HourlyWetBulbTemperature,
ws_origin_HourlyWindSpeed,
ws_origin_OCV_CLR,
sum(dep_del15) as delay_cnt,
sum(1-dep_del15) as non_deplay_cnt
from joined_table_temp
group by year, quarter, month,
origin_state_abr, 
ws_origin_HourlyDewPointTemperature,
ws_origin_HourlyDryBulbTemperature,
ws_origin_HourlyRelativeHumidity,
ws_origin_HourlyStationPressure,
ws_origin_HourlyVisibility,
ws_origin_HourlyWetBulbTemperature,
ws_origin_HourlyWindSpeed,
ws_origin_OCV_CLR;

year,quarter,month,origin_state_abr,ws_origin_HourlyDewPointTemperature,ws_origin_HourlyDryBulbTemperature,ws_origin_HourlyRelativeHumidity,ws_origin_HourlyStationPressure,ws_origin_HourlyVisibility,ws_origin_HourlyWetBulbTemperature,ws_origin_HourlyWindSpeed,ws_origin_OCV_CLR,delay_cnt,non_deplay_cnt
2016,3,7,AK,48.0,66.0,52.0,29.81999969482422,10.0,56.0,7.0,1.0,0.0,3.0
2015,2,6,AK,20.0,27.0,75.0,29.93000030517578,10.0,25.0,17.0,1.0,0.0,1.0
2017,3,7,AK,52.0,52.0,100.0,30.059999465942383,10.0,52.0,6.0,1.0,0.0,2.0
2017,3,9,CO,43.0,45.0,93.0,24.549999237060547,10.0,44.0,7.0,1.0,1.0,7.0
2019,1,2,MN,6.0,10.0,84.0,29.489999771118164,3.0,9.0,6.0,1.0,3.0,7.0
2021,2,4,IL,47.0,51.0,86.0,28.8799991607666,9.0,49.0,8.0,1.0,3.0,59.0
2017,2,6,CO,54.0,62.0,75.0,24.770000457763672,10.0,57.0,10.0,1.0,1.0,6.0
2021,2,4,MN,28.0,42.0,58.0,29.07999992370605,10.0,36.0,15.0,1.0,7.0,23.0
2015,4,11,MO,40.0,58.0,51.0,29.40999984741211,10.0,49.0,14.0,1.0,0.0,9.0
2020,2,6,IL,44.0,78.0,30.0,29.600000381469727,10.0,59.0,14.0,0.0,0.0,1.0


Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

##### 2.6.1 Time-based feature - Flight day is holiday
Our analysis strongly suggests a seasonal impact in flight delays. We created a new feature to identify if a flight date is considered a holiday and see the impact in delays. This was accomplished by joining our main dataset with an external file of US holidays source: https://www.kaggle.com/datasets/donnetew/us-holiday-dates-2004-2021

<br>

The plot shows the total number of delays per date and identifies the date as a holiday or not-holiday. We can observe that the highest number of delays occur close to a holiday, therefore, we are labelling as a Holiday any data between the range of +/- 2 days from a US Holiday date.
<br>
<br>
<img src="https://user-images.githubusercontent.com/88794396/205879149-9f7ec94a-e85f-43ee-a643-27a3d7e0e788.png?raw=true>" width=100%>

##### 2.6.2 Graph-based feature - Origin Airport Pagerank
We are using pagerank to create a feature that identifies the traffic per origin airports per year. 


The graph below shows the top ten airports with the highest pagerank results per year. We can observe that DFW (Dallas/Fort Worth International Airport) ranks as number 1 in 2015, 2018, 2019 and 2020 and ATL(Hartsfield-Jackson Atlanta International Airport (ATL)) ranks as number 1 in 2026 and 2017.

<br>

<img src="https://user-images.githubusercontent.com/88794396/204217090-5b53689e-ed6c-4d9e-810b-723e857cd102.png?raw=true>" width=100%>

#### 2.7 Time-based features - Weather lag features

In time series, there is the assumption that past events can potentially contain information about the future. Therefore, we created 20 Lag features corresponding to the hourly weather telemetry measured one timestamp before the 2 hours timestamp prior to departure.
<br>

- ws_origin_HourlyAltimeterSetting_lag     
- ws_origin_HourlyDewPointTemperature_lag 
- ws_origin_HourlyDryBulbTemperature_lag  
- ws_origin_HourlyRelativeHumidity_lag    
- ws_origin_OCV_CLR_lag                   
- ws_origin_HourlyStationPressure_lag     
- ws_origin_HourlyVisibility_lag          
- ws_origin_HourlyWetBulbTemperature_lag  
- ws_origin_HourlyWindDirection_lag       
- ws_origin_HourlyWindSpeed_lag           
- ws_dest_HourlyAltimeterSetting_lag      
- ws_dest_HourlyDewPointTemperature_lag   
- ws_dest_HourlyDryBulbTemperature_lag    
- ws_dest_HourlyRelativeHumidity_lag      
- ws_dest_OCV_CLR_lag                     
- ws_dest_HourlyStationPressure_lag       
- ws_dest_HourlyVisibility_lag            
- ws_dest_HourlyWetBulbTemperature_lag    
- ws_dest_HourlyWindDirection_lag         
- ws_dest_HourlyWindSpeed_lag

### 3. Machine Learning Algorithms

#### 3.1 Logistic Regression

Our first ML algorithm is Logistic Regression.  We believe that with a variety of regression variables available to us that we can explore the feature space and build a logistic regression model that uses the source and destination city, airline, weather, seasonality, recent delay history, and many other robust features described above to classify all flights into two categories -- those with >= 15 minute delay of departure, and those without.

##### 3.1.1 Example Simple Logistic Regression Model

$$logit(p_i) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + ... + \beta_n x_{n,i}$$

We have a large number of features to choose from, and so we will explore adding an L2 regularization term to avoid overfitting on large set of features.

#### 3.2 Random Forests

We looked to fit a Random Forest model to this data set.  We believe a random forest will be useful because it is robust to inclusion of irrelevant features, and invariant under scaling and transformation of many of the feature values.

#### 3.3 Gradient-boosted Tree
We further looked into Gradient-boosted tree which should perform better than Random Forest since each new tree helps to correct errors made by perviously trained tree.

#### 3.4 Multilayer perceptron classifier
MLP's can be applied to complex non-linear problem, and it also works well with large input data with at relatively faster performance. Therefore, we'd also like to try it for as one of our experiments.

#### 3.5 Hyperparameter Tuning

We choose grid search for our experiments below - different than random search that for parameters, not all the values are tested and values tested are selected at random. 
<br>Since we don't have any expectation before the experiements, we'd like to set the specific values to be tested so to have a better understanding of how parameters affect the model's performance.

#### 3.6 Metrics for evaluation

Precision and recall are defined in terms of True positive (TP), False positive (FP), True Negative (TN), and False Negative (FN) classifications.

Precision = $$\frac{TP}{TP+FP}$$

Recall = $$\frac{TP}{TP+FN}$$

F1 = $$\frac{2TP}{2TP+FP+FN}$$


<br>
Assume we care more for FN, which means we weight recall higher than precision; therefore, we choose beta as 2.

F_beta = $$\frac{(1+\beta^2)TP}{(1+\beta^2)TP + \beta^2FN + FP}$$

F2 = $$\frac{5TP}{5TP + 4FN + FP}$$

### 4. Machine Learning Pipelines

#### 4.1 Pipeline Description

a) Data Engineering:
- Step 1	
  - Ingest provided parquet files containing flights, weather, and station information and create corresponding dataframes.
  - Ingest external source airport code data in parquet format and create dataframe.
  - Perform EDA.
  - Data Cleaning.
  - Create a final dataframe with the result of the joined dataframes and store it in Azure Blob Storage in parquet format.

- Step 2
  - Ingest final_dataset from  Azure Blob Storage.
  - Select features
  - Split the data into Train, Validation, and Test dataset.
  - Perform normalization on the Train, Validation, and Test dataset using the Train dataset.
  - Store the normalized Train, Validation, and Test dataset Azure Blob Storage in parquet format.

b) Model Training
- Ingest the Train and Validation datasets from  Azure Blob Storage
- Build baseline model
- Build and Train the classification model
- Perform cross-validation and hyperparameter tuning

c) Model Evaluation
- Ingest the Test dataset from Azure Blob Storage
- Run the trained model on the Test dataset
- Evaluate model
- Store predictions in Azure Blob Storage

#### 4.2 Pipeline Block Diagram
<img src="https://github.com/carla-cortez/261/blob/main/ml_diagram.png?raw=true>" width=75%>

#### 4.3 Data Split Plan

The time series data from 2015 to 2021 was split into Train, Test datasets.
- Train dataset: 2015 - 2020
- Test dataset: 2021

#### 4.4 Cross Validation

We used a rolling window cross validation technique for time series, having a fold for each year of data.
<br>Please refer to the print out below for our different folds used in cross validation.

In [None]:
trainDF_VA_full = spark.read.parquet(f"{blob_url}/trainDF_VA_V7")
trainDF_VA_pred = trainDF_VA_full.select(['DEP_DEL15','features_scaled', 'YEAR'])
# create dictionary of dataframes for custom cv fn to loop through
# assign train and test based on time series split 

d = {}

d['df1'] = trainDF_VA_pred.filter(trainDF_VA_pred.YEAR <= 2016)\
                       .withColumn('cv', F.when(trainDF_VA_pred.YEAR == 2015, 'train')
                                             .otherwise('test'))
d['df2'] = trainDF_VA_pred.filter(F.col('YEAR').between(2015, 2017))\
                       .withColumn('cv', F.when(trainDF_VA_pred.YEAR <= 2016, 'train')
                                             .otherwise('test'))
    
d['df3'] = trainDF_VA_pred.filter(F.col('YEAR').between(2015, 2018))\
                       .withColumn('cv', F.when(trainDF_VA_pred.YEAR <= 2017, 'train')
                                             .otherwise('test'))
    
d['df4'] = trainDF_VA_pred.filter(F.col('YEAR').between(2015, 2020))\
                       .withColumn('cv', F.when(trainDF_VA_pred.YEAR <= 2018, 'train')
                                             .otherwise('test'))

d['df1'].groupby('YEAR','cv').count().orderBy('YEAR').show()
d['df2'].groupby('YEAR','cv').count().orderBy('YEAR').show()
d['df3'].groupby('YEAR','cv').count().orderBy('YEAR').show()
d['df4'].groupby('YEAR','cv').count().orderBy('YEAR').show()

+----+-----+-------+
|YEAR|   cv|  count|
+----+-----+-------+
|2015|train|2095014|
|2016| test|1888884|
+----+-----+-------+

+----+-----+-------+
|YEAR|   cv|  count|
+----+-----+-------+
|2015|train|2095014|
|2016|train|1888884|
|2017| test|2009356|
+----+-----+-------+

+----+-----+-------+
|YEAR|   cv|  count|
+----+-----+-------+
|2015|train|2095014|
|2016|train|1888884|
|2017|train|2009356|
|2018| test|2589118|
+----+-----+-------+

+----+-----+-------+
|YEAR|   cv|  count|
+----+-----+-------+
|2015|train|2095014|
|2016|train|1888884|
|2017|train|2009356|
|2018|train|2589118|
|2019| test|2699488|
|2020| test| 794780|
+----+-----+-------+



#### 4.5 Area Under the Receiver Operating Curve (AUROC)

The ROC curve shows the behavior of the classifier for every threshold by plotting the True Positive and False positive. The larger the area under ROC, the better our model is at the separation between classes in a binary classifier.

### 5. Experiments
We are coming at this problem from the business perspective of our ability to predict a delay so that we can minimize the impact of our effected passengers.  For this reason, the cost of a False Negative is more important to us than the cost of a False Positive.

For our business case False Negative is defined as we failed to notify the passengers that the flight would delay. Passenger would be happy if they were prepared for a delay but the flight didn't; while they would be even more upset if they were told the flight wouldn't delay but it actually did.
Therefore, from passenger's perspective of view, we prioritized to minimize the False Negative as our research's goal.

#### 5.1 Feature Selection

Although for tree-based models or neural networks, colinearity is not a major concern, we decided to remove the highly correlated variables for logistic regression to ensure convergence.

##### 5.1.1 Pearson Correlation Plot - on Training Data 

Before we start running logistic regression experiments, we first check the Pearson correlation matrix to understand the relationship of all features against DEP_DEL15 (our response variable).
<br><br>Based on the plots below, we do see a few variables are highly correlated, which is as expected. To improve the model performance, we'll keep only one of those highly correlated variables (like quarter and month), and focus on those variables that are highly positive or negatively correlated with the response variables.
<br>
<br>Based on the plot from the Pearson correlation matrix, we've identified there are 8 temperature-related variables and 4 pressure-related variables. And only 1 of each group is kept for logistic regression modeling.
<br>Besides, the origin station and its city's indices are also highly correlated; therefore, only the station is kept.
<br>
<br>
<img src="https://user-images.githubusercontent.com/88794396/205848474-9568ecd0-856b-499e-b980-64d1c69e359e.png?raw=true>" width=50%>

#### 5.1.2 Feature Importance
Below is the Pearson correlation calculated for each rating variable with the response variable. The ones with higher absolute values are considered to be more important, especially for the logistic regression model.

|Feature | Pearson Correlation with response var|
|---|---|
|ws_origin_HourlyWindSpeed | 0.0915 |
|ws_origin_HourlyWindSpeed_lag | 0.0911 |
|ws_dest_HourlyWindSpeed_lag | 0.0849 |
|ws_dest_HourlyWindSpeed | 0.0829 |
|ws_origin_OCV_CLR_lag | 0.0726 |
|ws_origin_OCV_CLR | 0.0683 |
|ws_origin_HourlyVisibility_lag | -0.0671 |
|ws_origin_HourlyVisibility | -0.0643 |
|ws_origin_HourlyAltimeterSetting | -0.0617 |
|ws_origin_HourlyAltimeterSetting_lag | -0.0573 |
|ws_dest_OCV_CLR_lag | 0.0569 |
|ws_dest_HourlyAltimeterSetting | -0.0564 |
|ws_dest_OCV_CLR | 0.0528 |
|ORIGIN_STATE_ABR_Index | -0.0525 |
|ws_dest_HourlyAltimeterSetting_lag | -0.0521 |
|ws_dest_HourlyVisibility_lag | -0.0494 |
|ORIGIN_CITY_NAME_Index | -0.0473 |
|ws_dest_HourlyVisibility | -0.0460 |
|ORIGIN_Index | -0.0458 |
|ws_origin_HourlyDryBulbTemperature_lag | 0.0411 |
|ws_dest_HourlyDryBulbTemperature_lag | 0.0398 |
|ws_origin_HourlyWindDirection | 0.0373 |
|ws_origin_HourlyWindDirection_lag | 0.0369 |
|ws_origin_HourlyDryBulbTemperature | 0.0369 |
|ws_dest_HourlyDryBulbTemperature | 0.0358 |
|ws_origin_HourlyWetBulbTemperature_lag | 0.0340 |
|ws_dest_HourlyWindDirection | 0.0336 |
|ws_dest_HourlyWindDirection_lag | 0.0325 |
|ws_origin_HourlyWetBulbTemperature | 0.0314 |
|ws_dest_HourlyRelativeHumidity_lag | -0.0303 |
|MONTH | -0.0291 |
|ws_dest_HourlyWetBulbTemperature_lag | 0.0286 |
|ws_origin_HourlyDewPointTemperature_lag | 0.0285 |
|CRS_ELAPSED_TIME | 0.0284 |
|ws_origin_HourlyDewPointTemperature | 0.0272 |
|ws_dest_HourlyWetBulbTemperature | 0.0263 |
|DISTANCE | 0.0254 |
|DISTANCE_GROUP | 0.0252 |
|ws_dest_HourlyRelativeHumidity | -0.0249 |
|AIRPORT_ORIGIN_PAGERANK | 0.0229 |
|ws_dest_HourlyDewPointTemperature_lag | 0.0193 |
|ws_origin_HourlyRelativeHumidity_lag | -0.0183 |
|ws_dest_HourlyDewPointTemperature | 0.0182 |
|TAIL_NUM_Index | -0.0163 |
|ws_dest_HourlyStationPressure_lag | 0.0142 |
|ws_dest_HourlyStationPressure | 0.0132 |
|ws_origin_HourlyRelativeHumidity | -0.0126 |
|DEST_STATE_ABR_Index | -0.0121 |
|OP_UNIQUE_CARRIER_Index | -0.0104 |
|DEST_CITY_NAME_Index | -0.0094 |
|OP_CARRIER_FL_NUM_Index | -0.0094 |
|DEST_Index | -0.0075 |
|FL_DATE_IS_HOLIDAY_Index | -0.0038 |
|DAY_OF_MONTH | -0.0037 |
|ws_origin_HourlyStationPressure_lag | 0.0029 |
|ws_origin_HourlyStationPressure | 0.0016 |
|DAY_OF_WEEK | -0.0016 |
|YEAR | 0.0000 |

##### 5.1.3 Dictionary of selected features

The table below show the dictionary of the selected features prior encoding, divided by family after the correlation analysis
|Feature by family                               |  Description                                                                      | Type 
|---------------------------------------|-----------------------------------------------------------------------------------|---|     
|  **Flight**| 
|  OP_CARRIER_FL_NUM                    |  Carrier-Flight identifier                                                        | Categorical
|  OP_UNIQUE_CARRIER                    |  Flight Carrier unique identifier                                                 | Categorical
|  TAIL_NUM                             |  Tail Number Flight tracker                                                       | Categorical
|  YEAR                                 |  Flight year                                                                      | Numerical 
|  MONTH                                |  Flight Month                                                                     | Numerical
|  DAY_OF_MONTH                         |  Flight Day of the month                                                          | Numerical
|  DAY_OF_WEEK                          |  Flight Day of week                                                               | Numerical
|  CRS_ELAPSED_TIME                     |  CRS estimated flight elapsed time                                                | Numerical                                                 
|  **Airport**|
|  ORIGIN                               |  Origin Airport IATA code                                                         | Categorical                         
|  ORIGIN_CITY_NAME                     |  Origin Airport City Name                                                         | Categorical
|  ORIGIN_STATE_ABR                     |  Origin Airport State abbreviation                                                | Categorical
|  DEST                                 |  Destination Airport IATA code                                                    | Categorical
|  DEST_CITY_NAME                       |  Destination Airport City Name                                                    | Categorical
|  DEST_STATE_ABR                       |  Destination Airport State abbreviation                                           | Categorical
|  DISTANCE                             |  Distance between airports                                                        |Numerical
|  DISTANCE_GROUP                       |  Distance Intervals, every 250 Miles, for Flight Segment                          |Numerical  
|  **Graph-based**|
|  AIRPORT_ORIGIN_PAGERANK              |  Graph feature to determine Origin Airport pagerank per year                      |Numerical
|  **Event-based**| 
|  FL_DATE_IS_HOLIDAY                   |  FLIGHT DATE IS US HOLIDAY 1 = true, 0 = false                                    | Categorical         
|  **Weather Telemetry**|   
|  ws_origin_HourlyAltimeterSetting     |  Origin Weather station Hourly Altimeter measurement                              |Numerical
|  ws_origin_HourlyDewPointTemperature  |  Origin Weather station Hourly Dew Point Temperature measurement                  |Numerical
|  ws_origin_HourlyDryBulbTemperature   |  Origin Weather station Hourly Dry Bulb Temperature measurement                   |Numerical
|  ws_origin_HourlyRelativeHumidity     |  Origin Weather station Hourly Relative Humidity measurement                      |Numerical
|  ws_origin_OCV_CLR                    |  Origin Weather station Hourly Sky Code, OCV(overcloud) = 1 CLR(clear) = 0        |Numerical
|  ws_origin_HourlyStationPressure      |  Origin Weather station Hourly Station Pressure measurement                       |Numerical
|  ws_origin_HourlyVisibility           |  Origin Weather station Hourly Visibility measurement                             |Numerical
|  ws_origin_HourlyWetBulbTemperature   |  Origin Weather station Hourly Wet Bulb Temperature measurement                   |Numerical
|  ws_origin_HourlyWindDirection        |  Origin Weather station Hourly Wind Direction measurement                         |Numerical
|  ws_origin_HourlyWindSpeed            |  Origin Weather station Hourly Wind Speed measurement                             |Numerical
|  ws_dest_HourlyAltimeterSetting       |  Destination Weather station Hourly Altimeter measurement                         |Numerical
|  ws_dest_HourlyDewPointTemperature    |  Destination Weather station Hourly Dew Point Temperature measurement             |Numerical
|  ws_dest_HourlyDryBulbTemperature     |  Destination Weather station Hourly Dry Bulb Temperature measurement              |Numerical
|  ws_dest_HourlyRelativeHumidity       |  Destination Weather station Hourly Relative Humidity measurement                 |Numerical
|  ws_dest_OCV_CLR                      |  Destination Weather station Hourly Sky Code, OCV(overcloud) = 1 CLR(clear) = 0   |Numerical
|  ws_dest_HourlyStationPressure        |  Destination Weather station Hourly Station Pressure measurement                  |Numerical
|  ws_dest_HourlyVisibility             |  Destination Weather station Hourly Visibility measurement                        |Numerical
|  ws_dest_HourlyWetBulbTemperature     |  Destination Weather station Hourly Wet Bulb Temperature measurement              |Numerical
|  ws_dest_HourlyWindDirection          |  Destination Weather station Hourly Wind Direction measurement                    |Numerical
|  ws_dest_HourlyWindSpeed              |  Destination Weather station Hourly Wind Speed measurement                        |Numerical
|  **Time-base Weather features**|  
|  Lag features for weather telemetry         |Weather telemetry from the previous available timestamp ~3 hours before departure|NumNumerical/Categorical

<br>
The below table shows the number of features per family
<br>

Feature Family| No. of features|
|--|--|
Flight | 8
Airport| 8
Graph-based| 1
Event-based| 1
Weather Telemetry|  20
Time-base Weather features|  20
**Total**|58

##### 5.1.4 EDA for the important variables after downsampling

In [None]:
%sql

select
MONTH,
ws_origin_HourlyWindSpeed, 
ws_origin_OCV_CLR,
ws_origin_HourlyVisibility,
ws_origin_HourlyAltimeterSetting,
sum(dep_del15) as delay_cnt,
sum(1-dep_del15) as non_deplay_cnt
from trainDF_VA_full
group by MONTH,
ws_origin_HourlyWindSpeed, 
ws_origin_OCV_CLR,
ws_origin_HourlyVisibility,
ws_origin_HourlyAltimeterSetting;

MONTH,ws_origin_HourlyWindSpeed,ws_origin_OCV_CLR,ws_origin_HourlyVisibility,ws_origin_HourlyAltimeterSetting,delay_cnt,non_deplay_cnt
12,9.0,0.6326512732810423,9.324322874495108,30.03218427128645,0.0,4.0
7,6.0,0.6326512732810423,6.0,29.96999931335449,22.0,28.0
2,6.0,0.0,10.0,30.07999992370605,90.0,142.0
4,8.0,0.6326512732810423,9.9399995803833,30.03218427128645,4388.0,5818.0
4,11.0,1.0,2.5,30.200000762939453,5.0,5.0
11,8.0,0.6326512732810423,8.699999809265137,30.03218427128645,102.0,116.0
10,11.0,0.6326512732810423,10.0,29.88999938964844,273.0,283.0
5,3.0,0.6326512732810423,10.0,30.059999465942383,430.0,459.0
5,7.0,1.0,10.0,30.239999771118164,37.0,42.0
4,15.0,0.6326512732810423,10.0,29.850000381469727,233.0,161.0


Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

#### 5.2 Baseline

Across our entire data set, we see that 17.4% of flights are delayed more than 15 minutes from their scheduled departure. We set as our baseline predictor a model that always predicts a delay. We compute the precision and recall of this "Always Predict Delay" model, and will measure our logistic regression and other models against this baseline.

For a given observed delay percentage D, assumed to be 17.4% here, these metrics for an "Always predict delay" strategy are computed as

We intend to update these metrics with our logistic regression model and other models in later phases of the project.


#####7.2.1 Baseline results

|   | Baseline Experiment | Train | Test | Metrics - Precision | Metrics - Recall | Metrics - F1 | Metrics - F2 |
|---|---|---|---|---|---|---|---|
| Baseline | Always Predict Delay | 14,272,964 | 3,504,568 | 0.174 | 1.0 | 0.296 | 0.513 |


<br>

#### 5.3 Modeling Experiments

All experiments were performed using 58 downsampled, encoded, vectorized, normalized selected features. To do this, we setup a pipeline that uses one-hot encoding and label encoding for our categorical variables, and then use the VectorAssembler to vectorize them to prepare them to be used it with all our models. We also used Grid-search through the provided Custom Cross Validator for hyperparameter selection. The provided Custom Cross Validator uses F1 as the metric to be maximized

We used pyspark.ml.evaluation.MulticlassClassificationEvaluator to evaluate the model on the Cross Validation dataset built from the train dataset. 


**Cluster specifications**
1-10 Workers
16-160 GB Memory
4-40 Cores
1 Driver
16 GB Memory, 4 Cores


##### 5.3.1 Logistic Regression

We used pyspark.mk.classification.LogisticRegression to build a model based on our training data to predict membership of the "delay" or "no delay" class.


##### 5.3.1.1 Experiment Results

| Experiment  | No.Train rows | No. Test rows |Hyperparameters| Metrics - Precision | Metrics - Recall | Metrics - F1 | Metrics - F2 |Execution time| Notes
|---|---|---|---|---|---|---|---|---|---|
| 1 |11,281,860 | 2,028,276|maxIter= [12, 15], regParam = [0.003, 0.005], elasticNetParam = [0.07, 0.09]|0.569|0.556|0.562|0.558|56 mins|Excluded all data from year 2020|
| 2 |11,607,072   | 2,028,276|maxIter= [10, 5], regParam = [0, 0], elasticNetParam = [1, 1]|0.563|0.594|0.579|**0.588**|6 mins|Excluded Mar to Dec from year 2020|
| 3 |12,076,640   | 2,028,276|maxIter= [10, 5], regParam = [0.01, 0.01], elasticNetParam = [0.9, 0.8]|0.553|0.557|0.554|0.555|10 mins|Included all year 2020|

Logistic regression was not significatly affected by the underlying data used for cross validation. The results are relatively consistent and well performed.
<br>This might due to the we understand the mechanism of logistic regression better than all the other algorithms; therefore, we ensured things like regularization were properly applied when we trained model.

##### 5.3.2 Best Result

Experiment number 2 had the best F2 Metric = 0.588  from all Logistic regression experiments

**ROC curve**

<img src="https://user-images.githubusercontent.com/88794396/205852347-b47ee546-8497-4fec-8c78-8f6c41f05233.png?raw=true>" width=30%>

#### 5.4 Random Forest
We used pyspark.mk.classification.RandomForestClassifier to build a random forest model to build a model based on our training data to predict membership of the "delay" or "no delay" class.

##### 5.4.1 Experiment Results

|  Experiment | No. Train rows | No. Test rows | Hyperparameters| Metrics - Precision | Metrics - Recall | Metrics - F1 | Metrics - F2 |Execution time| Notes
|---|---|---|---|---|---|---|---|---|---|
| 1 | 11,281,860| 2,028,276 |numTrees=[1, 3, 5], maxDepth = [1, 10, 20], maxBins = [60, 100, 500]|0.593 | 0.560 | 0.576 | **0.566**  |1.44 hours|Excluded all data from year 2020|
| 2 | 11,607,072| 2,028,276 |numTrees = [1, 3, 5], maxDepth = [1, 10, 20], maxBins = [60, 100, 500]|0.591 | 0.557 | 0.570 | 0.558  |1.39 hours|Excluded Mar to Dec from year 2020|
| 3 | 12,076,640| 2,028,276 |numTrees = [1, 3, 5], maxDepth = [1, 10, 20], maxBins = [60, 100, 500]|0.577 | 0.457| 0.510 | 0.477  |1.39 hours|Included all year 2020|

The results from Random Forest is slightly affected by the data used for cross validation. The experiment using imbalanced data for the last fold gave us the worst result based on the table above.
<br>However, when we swtich to the same data used for the best logistic regression, we also got a comparable result from the Random Forest.
<br>In short, it's still an effective and robust model to use.

##### 5.4.2 Best result

Experiment number 1 had the best F2 Metric = 0.566 from all Random Forest experiments

##### 5.4.2.1 Feature Importance from Random Forest

|Feature|Rank|Importance|
|---|---|---|
|ws_origin_HourlyWindSpeed| 1 |0.0934|
|ws_dest_HourlyWindSpeed_lag| 2 |0.0911|
|ws_origin_HourlyWindSpeed_lag| 3 |0.0668|
|MONTH| 4 |0.0523|
|ws_origin_HourlyVisibility| 5 |0.0484|
|ws_origin_HourlyDryBulbTemperature_lag| 6 |0.0465|
|ws_origin_HourlyVisibility_lag| 7 |0.0439|
|ws_dest_HourlyWindSpeed| 8 |0.0416|
|ORIGIN_Index| 9 |0.0331|
|ws_origin_HourlyAltimeterSetting| 10 |0.0273|
|ws_origin_OCV_CLR_lag| 11 |0.0263|
|ws_dest_HourlyVisibility| 12 |0.0224|
|ws_origin_HourlyDryBulbTemperature| 13 |0.0205|
|ws_origin_HourlyRelativeHumidity| 14 |0.0196|
|ORIGIN_CITY_NAME_Index| 15 |0.0189|
|ws_dest_HourlyDryBulbTemperature| 16 |0.0186|
|CRS_ELAPSED_TIME| 17 |0.0177|
|ws_dest_HourlyVisibility_lag| 18 |0.0173|
|OP_UNIQUE_CARRIER_Index| 19 |0.0170|
|ws_origin_HourlyAltimeterSetting_lag| 20 |0.0161|
|ws_dest_OCV_CLR_lag| 21 |0.0153|
|ws_dest_HourlyAltimeterSetting| 22 |0.0150|
|ws_origin_HourlyWetBulbTemperature_lag| 23 |0.0149|
|ws_origin_HourlyDewPointTemperature| 24 |0.0147|
|TAIL_NUM_Index| 25 |0.0139|
|ws_dest_HourlyAltimeterSetting_lag| 26 |0.0137|
|ws_dest_HourlyRelativeHumidity| 27 |0.0134|
|ws_dest_HourlyDryBulbTemperature_lag| 28 |0.0125|
|ws_origin_HourlyWetBulbTemperature| 29 |0.0120|
|ws_origin_OCV_CLR| 30 |0.0118|
|ws_origin_HourlyDewPointTemperature_lag| 31 |0.0106|
|ws_dest_HourlyRelativeHumidity_lag| 32 |0.0104|
|DEST_Index| 33 |0.0089|
|ws_origin_HourlyRelativeHumidity_lag| 34 |0.0087|
|ORIGIN_STATE_ABR_Index| 35 |0.0083|
|DEST_CITY_NAME_Index| 36 |0.0074|
|AIRPORT_ORIGIN_PAGERANK| 37 |0.0068|
|DEST_STATE_ABR_Index| 38 |0.0061|
|DISTANCE| 39 |0.0059|
|ws_dest_HourlyWindDirection_lag| 40 |0.0058|
|DAY_OF_MONTH| 41 |0.0043|
|ws_dest_HourlyDewPointTemperature| 42 |0.0042|
|DISTANCE_GROUP| 43 |0.0037|
|ws_dest_HourlyWetBulbTemperature_lag| 44 |0.0036|
|ws_dest_HourlyStationPressure| 45 |0.0032|
|ws_dest_HourlyStationPressure_lag| 46 |0.0030|
|ws_origin_HourlyStationPressure| 47 |0.0029|
|ws_origin_HourlyStationPressure_lag| 48 |0.0029|
|ws_dest_HourlyDewPointTemperature_lag| 49 |0.0029|
|ws_origin_HourlyWindDirection_lag| 50 |0.0022|
|ws_dest_HourlyWetBulbTemperature| 51 |0.0020|
|ws_dest_HourlyWindDirection| 52 |0.0019|
|ws_dest_OCV_CLR| 53 |0.0018|
|DAY_OF_WEEK| 54 |0.0016|
|OP_CARRIER_FL_NUM_Index| 55 |0.0015|
|ws_origin_HourlyWindDirection| 56 |0.0014|
|FL_DATE_IS_HOLIDAY_Index| 57 |0.0010|
|YEAR| 58 |0.0005|


##### 5.4.2.2 ROC Curve
<img src="https://user-images.githubusercontent.com/88794396/205857559-5a5b53da-02e1-45a7-9d0c-3038b94d5976.png?raw=true>" width=30%>

#### 5.5 Gradient-boosted Tree
We used pyspark.mk.classification.GBTClassifier to build a gradient-boosted tree model to build a model based on our training data to predict membership of the "delay" or "no delay" class.


##### 5.5.1 Experiment Results
| Experiment  | No. Train | No. Test | Hyperparameters| Metrics - Precision | Metrics - Recall | Metrics - F1 | Metrics - F2 |Execution time| Notes
|---|---|---|---|---|---|---|---|---|---|
| 1 | 13,377,810 | 2,221,195 |maxIter=[1, 2]|0.643 | 0.026 | 0.050 | 0.033 |43 mins|Data was not well downsampled for year 2020|
| 2 | 12,076,640| 2,028,276 |maxIter=[10, 20]| 0.499 | 0.213 | 0.298 | 0.240 |56 mins|Included all year 2020|
| 4 | 11,607,072| 2,028,276 |maxIter=[50, 100]| 0.552 | 0.489 | 0.518 | **0.500** |20 mins|Excluded Mar to Dec from year 2020|

There were not much grid searches being conducted with gradient-boosted tree; however, the performance improved significantly once we had the balanced data for cross validation.
<br>We still have a lot to improve towards gradient-boosted tree as it should be more predictive than Random Forest if trained properly. Due to the time and computing power constraints, we didn't conduct a lot experiments with gradient-boosted tree, but we did see its high potential and will try with it given more resources.

##### 5.5.2 Best result 
Experiment number 3 had the best F2 Metric = 0.500 from all Gradient-boosted Tree experiments

###### 5.5.2.1 ROC curve
<img src="https://user-images.githubusercontent.com/88794396/205848620-619452f9-1cce-474f-ab9c-ac10af304ae2.png?raw=true>" width=30%>

#### 5.6 Multilayer perceptron classifier - NN

By default the nodes in the output layer use the softmax function and the nodes in the intermediate layers use the sigmoid function. It was not clear from the documentation if these parameters could be updated to use different activation functions. Hence, all our experiments have similar architecture. We used 2 softmax output layers for the binary classification.


##### 5.6.1 Experiment Results

| Experiment  | No. Train rows | No. Test rows | Hyperparameters| Metrics - Precision | Metrics - Recall | Metrics - F1 | Metrics - F2 |Execution time| Notes
|---|---|---|---|---|---|---|---|---|---|
| 1 | 11,281,860 | 2,028,276 |n_layers = [[58, 16, 8, 2], [58, 20, 10, 2]], maxIter = [50, 60], blockSize = [10,10], 16 sigmoid - 2 Softmax|0.572 | 0.505 | 0.537 | 0.518 |1.10 hours|Excluded all data from year 2020|
| 2 | 11,607,072 | 2,028,276 |n_layers = [[58, 16, 8, 2], [58, 20, 10, 2]], maxIter = [50, 60], blockSize = [128, 128], 16 sigmoid - 2 Softmax|0.563 | 0.537 | 0.550 |**0.542** |55 mins|Excluded Mar to Dec from year 2020|
| 3 | 12,076,640 | 2,028,276 |n_layers = [[58, 16, 8, 2], [58, 20, 10, 2]], maxIter = [50, 60], blockSize = [128, 128], 16 sigmoid - 2 Softmax|0.551 | 0.533 | 0.542| 0.537 |24 mins|Included all year 2020|

Multiplayer Classifier performed consistently in terms of the result when using a different set of data for cross validation. <br>Although the impact from data is not material, the result from multi-layer is only better than gradient-boosted trees when comparing all the machine learning algorithms. Additionally, due to resource constraints, we were not able to run the model for more iterations which could potentially improve the results.
<br>It also suggested that we should be able to train this model better with more solid knowledge of this classifier.

##### 5.6.2 Best result

Experiment number 2 had the best F2 Metric = 0.542 from all  Multilayer perceptron classifier experiments

##### 5.6.2.1 ROC Curve
<img src="https://user-images.githubusercontent.com/88794396/205860890-fd94dedc-d1d5-4c8f-81f7-3fa21c16dd63.png?raw=true>" width=30%>

#### 5.7 Best Model Results Summary

|ML Algorithm| No. Train rows | No. Test rows | No. Features| Hyperparameters| Metrics - Precision | Metrics - Recall | Metrics - F1 | Metrics - F2 |Execution time|  Notes
|---|---|---|---|---|---|---|---|---|---|---|
| Logistic Regression |11,607,072   | 2,028,276|58| maxIter= [10, 5], regParam = 0, elasticNetParam = 1|0.563|0.594|0.579|**0.588**|6 mins|Excluded Mar to Dec from year 2020|

The logistic regression model gave us the best result in a relatively short execution time. For this reserach specifically, logistic regression is the best fit at both predictions and scaling.

### 6. Results 

___Discussion___

In phase 1, we started off with a simple baseline predictor model, which was set to always predict a delay, and a simplified logistic regression model using only the Month feature. The baseline model was built on the observation that that 17.4% of flights are delayed more than 15 minutes. The initial logistic regression model, on the other hand, showed that each month had a delayed flight percentage between 5-25%, which indicated to us that additional features were required to improve the predictive performance. 
 
As we continued experiments from phases 2 through 4, we saw gradual improvement as we used revised training data. Downsampling and minmax scaling were fundamental to our success. The overall experiment confirmed our assumption that excluding March - December 2020 would yield the best results due to the impact of covid-19 on the airlines industry. While logistic regression model produced the best result, the Random Forest and Multilayer perceptron classifiers were close in F2 score. It is noteworthy that the logistic regression model took significantly less time to execute: 6 minutes as opposed to an hour or more. However, the same experiments were completed with different run times, so execution time may not be a reliable metric for success.


___Gap Analysis___

True positives and false negatives are important to the business problem because the former represents successful flight delay detection and the latter represents undetected delays; both have an impact on whether the customers miss their flight, which has other financial implications to the airline company. By nature of this, we are evaluating precision and recall, with more of an emphasis on recall, and therefore, the F2 score (or F_beta, where beta = 2).

According to the leaderboard, many teams that are using precision-recall scored between 0.5 and 0.6 for both metrics with a similar number of records. Given that recall is more relevant for our team's goal, our current model can be considered reliable and we are also in good standing when benchmarked against our peers.

### 7. Conclusion

<br>___Project Focus___

The focus of our project is to accurately predict a flight delay by analyzing flight and weather data from 2015-2021. This analysis is important in order to proactively account for these delays and plan accordingly, allowing us to reduce costs associated with rerouting flights and the impact on our passengers.

<br>___Hypothesis___

Our main hypothesis states that a combination of weather conditions two hours before flight departure and route features can accurately forecast a flight delay greater than 15 minutes. 


<br>___Future / Thoughts___

Overall, the project was a successful, but there is room for improvement. We were able to create multiple models that are close in predictive performance and our evaluation metrics suggest that we can correctly predict a flight delay more than 50% of the time. According to the gap analysis, our model is consistent with those developed by our peers in the industry.

One of the challenges during this project was the limitation of the cluster size which led to slow run times during the join process. Because of this, we were not able to replicate our experiments that had the same inputs within a similar execution time. Also, due to time constraint, we were not able to complete an ensemble of the best predictions per model, which could potentially return better results.

Data from 2022 was not included as the year had not fully spanned by the end of this project. We downloaded a sample of flight data from January 2022 through September 2022 to understand how it might impact future models. Shown below is the percent of flights that were delayed per month; upon close examination, this follows a similar seasonal trend as with years 2015-2021 despite any possible increased flight activity due to lifted covid restrictions. Additionally, the amount of delayed flights vs on-delayed flights per state follows a similar pattern. The use of this data will not have any unexpected changes to our classifier models. Due to time constraint, explorary data analysis was not performed on 2022 weather information. We expect ws_origin_HourlyWindSpeed to remain the most predictive variable.

Our final recommendation is to continue this work in a future project with a larger cluster for better scaling, inclusion of more recent flight data, consideration of more event-based features, and more time planning models.

Percentage of Flights Delayed per State in 2022

<img src="https://user-images.githubusercontent.com/88794396/206003292-1ccf9003-8043-49f0-b6e5-91790339a902.png?raw=true>" width=65%>

### 8. Project Timeline
<img src="https://user-images.githubusercontent.com/88794396/206005854-faaf1f2c-e5ba-4674-9a55-77dcd10fc3de.png?raw=true>" width=65%>