# Flight Prediction - Phase 3

### Credit Assignment Plan
| Member's Name | Tasks | Start&End Date | Hours |
| --- | --- | --- | --- |
| Vinh Bui | Implemented hyperparameters tuning using hyperOpt, improve time series cross validation, ML pipeline, participating writing report | Apr 10 - 20 | 40 |
| Derek Dewald | Validated EDA on Complete Dataset, Finalized FE, Generated Data Sets., Recreated 5 Year Dataset, Finalized EDA Comments, FE Comments, Explored Creating 2022- 2024 Dataset | Apri 20 | 50 |
| Chi So| Abstract, Gap analysis, Model evaluation, Metric, Pagerank, ML Algorithms, Code Validation| April 8-21 | 41 |
| Michael B| ML Algorithms, Metric, Report Discussion, Results, Conclusion | April 8-21 | 35 |


## Abstract

Our project centers on predicting flight delays within a 2-hour window, crucial for optimizing airline operations. Utilizing a comprehensive dataset comprising flight schedules, airline details, weather conditions, and delay records, we conduct exploratory data analysis (EDA), feature selection, engineering, and parameter tuning before model training. For instance, engineering features like previous flight arrival times enhance predictive accuracy. While logistic regression, random forests, and neural networks via PySpark are employed, XGBoost is excluded due to ongoing distributed system experimentation, rendering it unsuitable for production. Our baseline logistic regression model achieves over an 80% F-beta score, underscoring its predictive capability. Evaluation primarily considers F-Beta with a beta value of 0.5 to ensure model reliability. Subsequent steps involve utilizing EDA to finalize the desired model, testing optimal feature composition, implementing pipelines for model optimization, and fine-tuning. Ultimately, our research aims to furnish airlines with actionable insights for operational enhancement and empower passengers with refined travel planning capabilities, thereby elevating the efficiency and convenience of air travel while minimizing disruptions.



## Evaluation Metrics

To evaluate the performance of our model, we utilize the F-beta score. This metric is particularly advantageous for datasets with imbalanced classes, such as those involving flight delays. The F-beta score allows for flexibility in emphasizing either precision or recall by adjusting the beta value according to business needs. For this project, we have selected a beta value of 0.5, prioritizing precision over recall because accurate prediction of events, such as flight delays, is considered more critical than the ability to capture all possible instances of delays. The F-beta score is calculated using the following formula:

$$ F_{\beta} = (1 + \beta^2) \cdot \frac{\text{precision} \cdot \text{recall}}{(\beta^2 \cdot \text{precision}) + \text{recall}} $$


## Time series cross validation

In addition to the F-beta score, we implement time series-based cross-validation to evaluate our training score. Traditional cross-validation methods are not suitable for time series data as they could disrupt the temporal sequence, leading to misleading model assessments. This specialized approach ensures that the temporal integrity of the data is maintained throughout the model evaluation process.

<img src ='https://vinhqbui.github.io/261-final-project/Images/2020-03-27-image1.png' width="1200" height="700">

To enhance the relevance of more recent data, which is often more reflective of current conditions and trends, we apply a weighting scheme to the F-beta scores of more recent observations. The weighted score calculation is as follows:

$$
score = \frac{\sum_{i=1}^{n} (F_{\beta} \times i)}{\sum_{i=1}^{n} i}
$$

This method assigns increasing weights to more recent data chunks, thereby aligning the model's evaluation with the most pertinent and timely information available.


##Data
<br>
<div align="center">
  <img src ='https://derek-dewald.github.io/DATASCI261_storage/data_sets.png' width="800" height="800">
</div>
<br>

The primary dataset utilized in the project comprised of approximately 32 million records related to domestic commercial airline flights occuring in the United States of America during the period 2015 - 2019. The dataset represents a rich collection of flight data including; date, carrier, destination airport, city, expected departure time, arrival time, delays, time in air. In total there are 109 fields representing various dimenisions and attributes associated with the individual flight circumstance. The file utilized for the purpose of this analysis was sourced from dbfs:/mnt/mids-w261/datasets_final_project_2022/parquet_airlines_data/, which represented a consolidation of monthly information which is warehoused and owned by the Department of Transportation.


## Exploratory Data Analysis

To understand our approach to prediction, it must start with an articulation of our belief, which is that we clearly define and then understand the problem:

##### Can we predict 2 hours in advance whether a scheduled departure will be delayed by more than 15 minutes?</span>

Flight departure delays are not determined by data, they are determined by process, people and circumstance. To effectively predict we need to explore the intersection of these items, it is only after we appreciate and understand this can we identify what data might be necessary to begin prediction. Utilizing these influences, we identified a number of factors which we believed warranted comprehensive consideration and analysis, including; the Airline, Route, Airport, Individual Plane, Time (hour, day of week, day of month, day of year and month of year) and weather.

Having now conceptualized the problem, we sought to more intimately understand the data. To accomplish this we created a series of artifacts to enable us to understand the data, these artifacts can be found in the attached Phase 2 Workbook, including a series of Dictionaries, and Dataframes. Specifically, we iterated over every column, returning counts of the number of populated observations, number of delayed flights, and number of unique observations for each variable. We encourage review of the code and dictionaries which contain more detailed information.

In order to implement, many assumptions were necessary in the processing and transformation of data, some assumptions were more operationally risky than others, while others were relatively straight forward. We have attempted to document directly in the code, however an overriding assumption which was made with respect to the completeness of the data. We have trusted the Department of Transportation to provide complete and fair datasets, an assumption that is near impossible to validate. While we have no reason to doubt the completeness, it is critically important that this risk is understood, as the analysis is continent on the data. An example of a less risky assumptions, would be with respect to the consistency in utiliziation of mapping codes (OP_CARRIER, OP_CARRIER_AIRLINE_ID, OP_UNIQUE_CARRIER) we have explicited stated our belief in the dicitoinary and built in controls to test a number of assumption to ensure consistency as the datasets incrementally grew in size (Evidenced in Phase 2 Workbook).

Our analysis started with understanding the airlines included within the dataset, the total number of flights, and their relative ontime performance.  

<br>
<img src ='https://derek-dewald.github.io/DATASCI261_storage/eda_3.png' width="500" height="500">
<br>

As illustrated above it is apparent, all airlines are not equal, differences in both quality (% Delays) and quantity (Number of Flights) can be immediately evidenced. Further we sought to understand how airline performance overtime might impact our analysis, which as noted in the below, we can see both variability, but also the immergence, and demise of a number of smaller airlines during our time period.

<br>
<img src ='https://derek-dewald.github.io/DATASCI261_storage/airline_perf_year.png' width="500" height="500">
<br>

Having confirmed our belief that Airlines are an important feature, we sought to understand more about their performance, and identified two interrelated features, routes and airports that we further believe would be meaningful in effectively capturing the relationship and impacting prediction. To incrementally understand this relationship, we first started with the economics principle of ceteris paribus and sought to understand impacts in isolation. We took a random sample of routes and sought to understand how different airlines performed on the same route, as if airlines performed simialarly on individual routes, the variances in overall performance might be led more by the routes they flew, opposed to how the performed. 

<br>
<img src ='https://derek-dewald.github.io/DATASCI261_storage/eda_2.png' width="500" height="500">
<br>

As evidenced, on the same route, over the five year period there was markedly different performance (which was largely mirrored across all evidenced samples), as such sufficient evidence exists to suggest that inclusion of routes is also an important feature for foundational consideration.


Extending the review holding airport constant, we sampled a number of small, medium and large airports, looking at the performance of all routes (undertaking distinct analysis for Departing and Arriving ) over the period, to search for potential variances in performance. While the mean value (as expected) for individual airports appears relatively similar, differences (specifically around the edges) was evidenced, and appears to warrant the inclusion for further testing. 

<br>
<img src ='https://derek-dewald.github.io/DATASCI261_storage/eda_4.png' width="500" height="500">
<br>

Finally, in looking at the combination between Airline, Aiport and Route we sought to understand if the phenonmena of "Make it up in the Air" was true, in that would there exist a material difference in departaure delay based on the length of travel. Initially this was a contested variable amongst the group, as individuals were of the belief that as departure (our focus), is the initial event, subsequent events including travel time are irrelevant as they occur after the fact. While this argument is interesting, we will later discussion and review evidence to suggest that what happened in the past strongly impacts the future, as such we continued forward. Future consideration for this might include lagging the distance travelled to understand the distance of the previous flight, however it is also noted that based on the operating model of the airline industry (using the same plane to travel back and forth from the same 2 points, simplifying assumptions can be made).

<br>
<img src ='https://derek-dewald.github.io/DATASCI261_storage/eda_1.png' width="500" height="500">
<br>

Contuning our review of features identified as pertinent, we look to time, which has many dimensions which require consideration, including hour of day, day of week, day of month, month of year. In looking at hour of the day, as en example, we can clearly see that increasing trend in both total number of flights and expectef delay when aggregated across all travel, however as we review individual airlines, or individual airports, the relationship shows uniqueness, depending upon location, which introduces a question of how to best capture the feature within the context of the model, specifically, we must be increasingly mindful of what is being captured and how. A theme we will review within feature engineering, and revisit both with next steps, and in the gap analysis.

<br>
<img src ='https://derek-dewald.github.io/DATASCI261_storage/activity_by_time.png' width="500" height="500">
<br>

With respect to the final variable weather, while we reviewed the impact of weather across a number of dimensions, specifically how to consider weather became a topic of discussion and debate. Intuitively no one argued against the negative impact of adverse weather on travel, the question of how explicitly to capture it is complicated, and one that ultimately our group maintains is adequately captured through other variables. Without getting into details before they have been presented (they will be discussed in Feature Engineeering), we believe that the impact of weather can be felt in two material locations (At the Airport, and during Travel) and that exclusion from the model is appropriate for both, but for different reasons. 







## Feauture Engineering

As discussed in the Exploratory Data Analysis, we identified several factors that could be meaningful when predicting delays. Accordingly, we needed to consider various ways to incorporate these into our models, and determine whether the data was readily available, could be sourced, or how it might be derived.

During the EDA, we considered the complex relationships among routes, airlines, and airports. While we initially analyzed these relationships sequentially, we now aim to more dynamically decouple the structure for modeling purposes.


<br> Airlines <br>

Ultimately, we believe the impact of individual airlines manifests in two distinct ways on performance. Firstly, it is based on the operational effectiveness of processes and procedures, and secondly, on the portfolio of individual routes serviced by the airlines. While we considered many different ways to capture the impact of operational effectiveness for individual carriers, given limited data and insights into their inner workings and composition, a binary relationship via one-hot encoding was determined as the optimal way to capture the uniqueness of operational efficiency. We believe this will also capture a portion of the variability related to route composition. However, we have explored implementing PageRank to further capture this impact.

<br> Airport <br>

In considering how an individual airport might impact delays, it is essential to examine the operational requirements of the airport and identify potential challenges. These include facilities, hangars, the number of terminals, and what we argue to be the most critical factor: the number of active runways. FAA regulations stringently control traffic volume to prevent potential crashes and minimize the effects of wake turbulence. Given the limited physical space, this constraint is not flexible in the short term. As such, our group believes that by modeling the recently departed and pending flight schedule at the time of prediction, we can provide critical input into our model, which will increase prediction power.
We propose to model this using four variables, under the assumption that this information is readily available for modeling purposes, except for one variable, which we believe to be a sound assumption:

<FLIGHTS_SCHEDULED_2HRS_OR_LESS_BEFORE_CRS_DEP<br> 
The total number of flights which were scheduled to depart the airport in the 2 hours leading into our prediction. <br>

<br>FLIGHTS_DELAYED_2HRS_OR_LESS_BEFORE_CRS_DEP<br>
The total number of flights which were delayed at the airport in the 2 hours leading into our prediction <br>

<br>FLIGHTS_SCHEDULED_2HRS_BEFORE_PREDICITION<br>
The total number of flights which are scheduled to occur during the 2 hours period between prediction and scheduled depature.

<br>FLIGHTS_DELAYED_2HRS_BEFORE_PREDICITION * <br>
The total number of flights which were delayed during the 2 hour period between prediction and scheduled depature. <br>
* This variable was not included into the model due to it not being available for prediction, however we believe that there exists an opportunity to enhance our model in the futre building this along with the other 3 into a comprehensive attribute which could be assigned to a airpot to dynamically build a unique capacity attribute.<br>

Through the inclusion of these variables, we believe it drastically simplifies the required model, by measuring the capacity of the airport to handle demand spikes relative to its steady state capacity, and through understanding how the airport has been performing in near real-time, we will increase predictive power and remove ambiguity. For example, the impact of a sudden or unexpected adverse weather event at an airport will manifest in a decrease in performance, which the model will be able to identify as anomalous (decrease in Delays in the Past 2 hours, relative to the number of total flights), and hopefully determine that as the past two hours have performed worse than expected, that the next two hours will be negatively impacted. We believe that this logic holds for event-based risks, such as holidays, major events (Super Bowl, Stanley Cup Playoffs (Go Hockey!), or lesser-known, but equally popular events, such as regional World Pokémon Go Fest events).

<br> Routes <br>


As discussed in the EDA, there was an identified uniqueness in routes, which is complex, as it appears less related to distance and more related to other circumstances. For example, as reviewed in the notebook, it can be evidenced that despite nearly identical distances and similar flight times, the performance between SF - JFK and JFK - SF is materially worse than LA - JFK, JFK - LA. We considered tackling this issue with one-hot encoding, however, given the sheer number of destinations (315) and the desire to uniquely capture direction within the model, PageRank, opposed to one-hot encoding, would be more appropriate. That being said, given our desire to also keep the uniqueness of airline, we experienced difficulties developing and implementing the model at scale, both due to logical constraints, but also process performance. As such we were not able to capture the fullness of this relationship into the models and will be striving to explore in future iterations (please refer to the notebook discussing our attempted PageRank implementation as considerable and valued development and experimental work).

<br>Individual Plane Performance<br>

During the EDA, we considered the relevant attributes at an individual plane level and determined that arrival was not directly applicable, as it was a condition subsequent to our prediction. However, if we consider what previously happened to the plane (i.e., did it arrive at the airport it is currently grounded at late), this could be insightful information. In considering the theory behind why this information might be relevant, we should consider the financial model of airlines, specifically reviewing their costs. The capital cost of planes is significant (likely the greatest expense next to employees), as such maximizing the utilization of these planes optimizes return on investment, as such airlines are highly incentivized to have planes full of passengers in the air. To minimize downtime, we must minimize idle time, which often means aggressive turnaround times from offboarding to onboarding, which means that when flights are delayed, this causes issues. However, we must be mindful, that not all delays are equal, as we have already seen travel between the hours of 12 pm - 6 am is limited, as such it stands to reason that arrival delays for flights arriving in the late evening/early morning are less consequential. That being said, we have brought forward the following variables:

<br>PREVIOUS_FLIGHT_ARRIVED_LATE<br>
A binary flag indicating whether the previous trip of the plan arrived at the airport late. Variable was created by bringing forward the departure delay from the previous flight utilzing the function create_lagged_column, which accounted for technical challenges (Specifically ensuring that only information related to the explicit TAIL_NUM were carried forward)

<br>PLANE_FORECAST_TURNAROUND_TIME<br>
A calculation between the time that the previous flight arrived and when the flight was next scheduled to depart. This variable is expected to provide additional context capture the expected material increase in magnitude of delayed flights in short turnaround situations.

<br>PREVIOUS_DIVERTED<br>
A binary flag to identify whether the previous flight had been diverted. This variable has been included for a similiar rationale as the above, whereas if a plane has a narrow turnaround window, and it did not land at the previous destination, the potential for it to leave ontime is minimal (depending upon time it could be re-routed back, or passangers could be included onto different flights, but it is a operational complexity)

<img src ='https://github.com/ChiBerkeley/chiberkeley.github.io/blob/main/pipeline.png?raw=true'>


### Data Augmentation

For categorical features, we implement one-hot encoding universally. While decision tree-based algorithms can handle categorical data natively, logistic regression and neural networks require categorical variables to be converted into a format they can interpret effectively. To ensure compatibility across all model types we plan to use, one-hot encoding is applied to all categorical features. Additionally, we standardize numerical features to improve model performance by aligning the scale of the data.

Furthermore, we have decided against using upsampling or downsampling strategies commonly applied to balance class distributions. Time series data requires a different approach to sampling because traditional methods could disrupt the temporal sequence, leading to potential biases in the model training process. Given that the class imbalance in our dataset is not pronounced, and considering that tree-based algorithms are inherently more tolerant to imbalanced data, we conclude that modifying the sampling of the training set is unnecessary at this stage. More consideration and analysis would be required to implement such techniques appropriately for time series data.


### Input features
'HourlySkyConditions', 'HourlyWetBulbTemperature', 'HourlyStationPressure', 'HourlyWindDirection', 'HourlyRelativeHumidity', 'HourlyWindSpeed', 'HourlyDewPointTemperature', 'HourlyDryBulbTemperature', 'HourlyVisibility', 'CRS_ELAPSED_TIME', 'ACTUAL_ELAPSED_TIME', 'OP_UNIQUE_CARRIER', and created feature "PREV_FLIGHT_DELAY"

### Loss Function
For **Logistic Regression**, we use Binary Cross-Entropy Loss with elastic regularization. 
$$
H(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]
$$

For **Random Forest** and **Gradient Boosted Trees** we use Loss function: $$Log \space Loss = 2 * \sum_{i=1}^{N} log * (1 + exp({-2 * y_iF(x_i)})$$

The Multilayer Perceptron Classifier (MLPC) is a **feedforward artificial neural networks** neural classifier. Leveraging its multilayer architecture, the MLPC exhibits the capability to discern intricate non-linear relationships within datasets, thereby enabling more precise predictions. Comprising multiple layers of neurons, including an input layer, an output layer housing two neurons, and a hidden layer accommodating ten neurons, the MLPC offers a comprehensive framework for data analysis. The input-to-output relationship within the MLPC is encapsulated in matrix form through the following formula:

$$y(x) = f_3(f_2(w_2^T f_1(w_1^Tx + b_1) + b_2) + b_3)$$

Acivation function for intermediate layers is **Sigmoid**:
$$f(z_i) = \frac{1}{1 + e^{-z_i}}$$

Activation function for output layer is **Softmax**:
$$f(z_i) = \frac{e^z_i}{\sum_{k=1}^{N}e^{z_k}}$$



### Experiments

We adopted a bottom-up approach, beginning with a minimal set of features and incrementally adding more to evaluate their impact on model performance.
Initially, we utilized only two features: `PREVIOUS_FLIGHT_ARRIVED_LATE` and `OP_CARRIER_AIRLINE_ID`. Below is the performance:

#### Previous delayed and carrier id

| Model                 | Train CV | Test  |
|-----------------------|----------|-------|
| Logistic Regression   | 0.5102   | 0.84  |
| Random Forest         | 0.517    | 0.7277|
| Gradient Boosted Trees| 0.5146   | 0.8099|
| Deep Neural Network   | 0.5144   | 0.8095|

Then we tried with weather data to see how much performance gain if we include the weather-related features that have little null values.
We can see that they add little predictive powers to the models.

#### With weather features
| Model | Train CV Score | Test Score |
| ---------------------- | ------ | ------ |
| Logistic Regression    | 0.7673 | 0.8065 |
| Random Forest          | 0.6764 | 0.7278 |
| Gradient Boosted Trees | 0.7775 | 0.8113 | 
| Deep Neural Network    | 0.7731 | 0.8113 | 

After evaluating weather-related and pagerank features, which did not improve our model, we excluded them from our final configuration.

Additional features were then introduced, including `MONTH`, `ORIGIN`, `DEST`, `PREVIOUS_DIVERTED`, `PLANE_FORECAST_TURNAROUND_TIME`, `FLIGHTS_DEPARTED_2HRS_BEFORE_PREDICTION`, `FLIGHTS_DELAYED_2HRS_BEFORE_PREDICTION`, and `FLIGHTS_SCHEDULED_2HRS_OR_LESS_BEFORE_CRS_DEP`. These enhanced our predictive accuracy.

We used default hyperparameters from the MLlib library as a baseline, followed by hyperparameter tuning with HyperOpt over a 12-month dataset:
- **Logistic Regression**: Tuned `elasticnet` parameter.
- **Random Forest**: Best parameters found were:
  - `{maxBins: 46, maxDepth: 20, minInfoGain: 0.9670683775019292, minInstancesPerNode: 2, numTrees: 239}`
- **GBT**: Best parameters found were:
  - `{'maxBins': 126, 'maxDepth': 11, 'minInstancesPerNode': 3, 'stepSize': 0.27286063089767754, 'subsamplingRate': 0.6445718010290064}`

- **Multilayer Perceptron**: Best Performing model werer: [639Sigmoid, 7Sigmoid, 2Softmax]  
  
| Model                 | Train CV | Test  |
|-----------------------|----------|-------|
| Logistic Regression   | 0.7294   | 0.7498|
| Random Forest         | 0.6765   | 0.7138|
| **Gradient Boosted Trees** | **0.8880**   | **0.9011** |
| Deep Neural Network   | TBD   | 87.67%|

GBT proved to be the best performer, trained on data from 2015 to 2018 and tested with 2019 data, achieving an F0.5 score of **0.891**.

#### Findings
GBT did not scale efficiently across multiple executors in our distributed setting, a significant limitation given our expectation of parallel processing benefits in PySpark environments. Each tree in the GBT model is dependent on the previous tree, constraining the ability to parallelize the model training effectively. This sequential dependency contrasts sharply with models like Random Forest, where trees are built independently and can be distributed across multiple nodes.

Additionally, the weather-related data did not provide as much predictive power as expected. The probable cause for this is the manner in which weather data are recorded. Advanced technological capabilities allow us to predict adverse weather conditions a few days in advance. Consequently, if flights are expected to be delayed due to weather disruptions, passengers are likely notified well before the 2-hour window prior to departure. This advance notification could explain why weather data failed to enhance our model's predictive accuracy.

Regarding the use of PageRank, we believe that our current approach lacks predictive power due to improper formulation of edge weights in the graph. It is crucial to invest more time in refining how we construct and weight the graphs. Additionally, executing PageRank on a dataset spanning three years is proving to be excessively time-consuming. Addressing this challenge will require a focused effort to optimize our computational strategies and potentially streamline the dataset for more efficient processing.


### Cluster Size and Training Time
**Cluster Configuration**: 224 GB RAM, 64 cores CPU

**Training and Testing Time**:
- Logistic Regression: 10 mins
- Random Forest Classifier: 8 mins
- Gradient Boosted Trees: 2 hrs
- Deep Neural Network (1 hidden layer): 1.2 hrs

#### Gradient Boosted Trees in PySpark
In PySpark, the sequential nature of Gradient Boosted Trees limits their scalability in distributed environments. Unlike Logistic Regression or Random Forest, where computations or tree-building can be distributed across multiple nodes concurrently, GBT requires each tree to be built sequentially. This makes it challenging to leverage the full computational power of a cluster, leading to longer training times and less efficient use of resources.


### Train test CV

Given the sequential nature of time series data, traditional cross-validation techniques are unsuitable as they risk violating the temporal order of the dataset. To preserve this order and ensure accurate model evaluation, it is essential to employ time series cross-validation using a sliding window approach.




To calculate the f-beta across cross validation, instead of mean, we use the following fomula to indicate that the closer to the current date will have higher weight:




### Train Test split
We utilize the initial three quarters of the dataset as the training set, reserving the final quarter for testing purposes.
By fitting the preprocessing steps beforehand, we aim to minimize redundant computations and ensure consistent feature transformations across the training and testing datasets.

Also, we will count the number of features after transformation to fit in the Multilayer Perceptron Classifier (MLPC) model

## Hyperparameters Tuning

Hyperparameters Tunning is a crucial role in our develpment by automating the process of hyperparameter tuning for the three classiviation algorithms: Logistic Regression, Random Forest, and Gradient Boosted Trees (GBT). Through hyperparameter tuning, the code systematically explores various combinations of model parameters to identify the configuration that optimizes the model's performance metrics, and Fbeta precision, and recall. This automated tuning process is essential as it helps in fine-tuning the models to achieve better predictive accuracy and generalization on unseen data.
By performing hyperparameter tuning using time series cross-validation, which appropriately avoides using future data to predict previously dated outcomes. 

## Logistic Regression (Baseline Model)

After hyperparameter tuning, we see that elastic net = 0.1 works best (so far). Here we rely on the default standadization of the Logistic Regression

## Random Forest

Tree base methods don't require standardization at all. So we can just fit in the model

### Gradient Boosted Decision trees

Another tree based ML algorithm, different from Random Forest built trees parallel. GBT uses `boosted` method, which build tree to correct the errors of previous trees.

## Multilayer Perceptrion

This is just another name of Deep Neural Network

## Results

The result are taken from the **Experiments** tab from databricks. It is perform with 3 folds.
 
#### 2 Features
| Model | Train CV Score | Test Score |
| --- | --- | --- |
| Logistic Regression | 51.02% | 84% |
| Random Forest | 51.7% | 72.77% |
| Gradient Boosted Trees | 51.46% | 80.99% | 
| Deep Neural Network | 51.14% | 80.95% | 

#### 14 Features
| Model | Train CV Score | Test Score |
| ---------------------- | ------ | ------ |
| Logistic Regression    | 76.73% | 80.65% |
| Random Forest          | 67.64% | 72.78% |
| Gradient Boosted Trees | 77.75% | 81.13% | 
| Deep Neural Network    | 77.31% | 81.13% | 

##DISCUSSION

In the context of time series machine learning, leakage occurs when information from the future is inadvertently included in the ML model, leading to overly optimistic performance metrics but unreliable when making predictions wiht new data that wasn't included creating the model. Leakage reasults in poor performance when deployed

Let's consider a hypothetical example with flight delay prediction: Suppose we're building a model to predict flight delays based on historical data. We have features like weather conditions, airport congestion, and departure time. Leakage could occur if, during feature engineering, we unknowingly include future information. For instance, adding the actual departure delay as a feature. This would lead the model to effectively "cheat" by using information that wouldn't actually be available at the time of prediction.

In the implementation of our machine learning pipeline, we've taken meticulous steps to ensure there is no data leakage, maintaining the integrity of our predictions during both training and testing phases. Our newly engineered features are crafted solely from data available before the prediction time, preserving the predictive power of our model.

While the classification of flight delays presents an imbalance, we've strategically addressed this challenge. Through the use of a weighted F-Beta score, with Beta = 0.5, we've meticulously evaluated our model's performance, ensuring it remains robust and reliable. Additionally, our final model, a Gradient Boosted Trees decision tree model, is adept at handling such class imbalances, further bolstering its effectiveness in real-world applications.


## Summary and Conclusion: 


In conclusion, this flight delay prediction machine learning project utilized currently available flight data to achieve a remarkable F-beta score of 90%, prioritizing precision over recall. Through extensive experimentation with state-of-the-art machine learning algorithms, including Random Forest, Gradient Boosted Trees (GBT), and Neural Networks, it was determined that the GBT model yielded the highest performance. Additionally, the feature engineering of the data considered external factors like holidays, major events, and airline-specific operations. Importantly, the design of this experiment meticulously avoided data leakage and adhered to best practices in machine learning, thereby mitigating the risk of cardinal sins in the ML process. This project not only demonstrates the efficacy of leveraging historical flight data for delay prediction but also underscores the importance of careful model selection and design in achieving accurate and reliable results in the aviation industry.

Looking ahead, the deployment of this model in real-world scenarios holds significant promise for improving operational efficiency and passenger satisfaction in the aviation industry. Moreover, expanding the model to include additional features, such as social media sentiment analysis or aircraft maintenance schedules, could further refine its accuracy and applicability. By continuously refining and updating the model, we aim to create a powerful tool that supports airlines and airports in making informed decisions, ultimately leading to smoother and more efficient flight operations.



##Gap Analysis
Feature importance analysis plays a crucial role in understanding the underlying mechanisms of predictive models. By discerning which features have the most significant impact on model outcomes, we gain insights into the factors driving predictions and can prioritize resources accordingly.

In our predictive model, we identified several features that contribute to the prediction of flight delay. However, upon closer examination, it became evident that one feature, 'PLANE_FORECAST_TURNAROUND_TIME,' dominates the majority of contribution, accounting for approximately 63% of the predictive power. This feature stands out as a pivotal factor in determining whether a flight will delay departure due to a high correlation between the time that the previous flight arrived and when the flight was next scheduled to depart.

<img src ='https://github.com/ChiBerkeley/chiberkeley.github.io/blob/main/FI.png?raw=true' width="500" height="500">



While 'PLANE_FORECAST_TURNAROUND_TIME' holds substantial importance, it's essential to acknowledge the presence of other features in our model. The additional features, such as 'FLIGHTS_DELAYED_2HRS_BEFORE_PREDICTION' and 'FLIGHTS_SCHEDULED_2HRS_OR_LESS_BEFORE_CRS_DEP,' despite their lower individual contributions, might still carry valuable information. However, it's worth noting that some features, particularly those resulting from one-hot encoding, exhibit sparse distributions. This sparsity can obscure their contribution to the model and may warrant further investigation into the encoding process or potential feature reduction techniques. This drastically impacts time and space complexity as we observed our model training time suffered from large dataset. Therefore, we should consider adding dimension reduction techique such as PCA or graph centrality analysis to remove majority of weak features. 

In addition to the core features mentioned above, our next model should consider incorporating supplementary variables such as weather conditions, and other flight information. While these features may not individually exert significant influence, their collective impact enhances the predictive capacity of the model by capturing contextual nuances and external factors that influence the model prediction. Furthermore, leveraging information from sources such as the Joint On-Time Performance Weather (OTPW) dataset provides valuable insights into weather patterns and their impact on flight operations. By integrating weather-related features, such as temperature, precipitation, and wind speed, our model could gain a more comprehensive understanding of the environmental factors.



The following figure illustrates the confusion matrix generated by our best-performing model when applied to the one-year dataset. The matrix provides a comprehensive overview of the model's predictive performance, particularly in distinguishing between on-time and delayed flights.
From the confusion matrix, several key observations emerge. Firstly, there is a notable class imbalance, with a significantly larger number of on-time flights compared to delayed flights in the ground truth dataset. This imbalance can skew the model's predictions, leading to a bias towards classifying instances as on-time. The imbalance in label distribution, particularly the scarcity of delayed flight entries, poses a significant challenge for the model. To address this issue, strategies such as oversampling techniques or adjusting class weights during model training may be employed to mitigate the impact of class imbalances and improve the model's ability to accurately classify delayed flights.

<img src ='https://github.com/ChiBerkeley/chiberkeley.github.io/blob/main/download.png?raw=true' width="500" height="500">


To illustrate the implications of feature importance on model performance, consider the following misclassification scenario. Despite the dominance of the 'PLANE_FORECAST_TURNAROUND_TIME' feature in our predictive model, misclassifications can still occur due to the complex interplay of multiple variables.
In an attempt to visualize the patterns of misclassification, we employed t-SNE (t-distributed Stochastic Neighbor Embedding) to reduce the dimensionality of the feature space and visualize the data points. However, our analysis revealed a lack of clear separation between correctly both misclassified examples (misclassified as delay and misclassified no delay). This suggests that misclassification instances are not easily explanable based on our current set of features alone.

<img src ='https://github.com/ChiBerkeley/chiberkeley.github.io/blob/main/tsne.png?raw=true' width="500" height="500">



These misclassification instances may occur due to the presence of confounding variables or unforeseen interactions between features. For example, certain combinations of weather conditions, flight schedules, and airport logistics may lead to outcomes that deviate from the model's predictions, contributing to misclassification errors. By conducting in-depth analyses of misclassified examples and identifying patterns or commonalities among them, we can iteratively enhance the model's robustness and predictive accuracy.

## Appendix

#### Sourced Data

Department of Transportation <br> https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr <br><br>
NOAA: National Oceanic and Atmospheric Administration <br>https://www.ncei.noaa.gov/data/local-climatological-data/archive/ <br><br>
Airport Codes <br> https://datahub.io/core/airport-codes <br><br>




###Notebooks







####Gap Analysis
https://adb-4248444930383559.19.azuredatabricks.net/?o=4248444930383559#notebook/2340268507058565/command/2340268507075308

#### Generation of Combined Dataset
https://adb-4248444930383559.19.azuredatabricks.net/?o=4248444930383559#notebook/2926108717727786/command/2340268507073628


#### Exploratory Data Analysis
https://adb-4248444930383559.19.azuredatabricks.net/?o=4248444930383559#notebook/3295227484016242/command/1093733074358004


#### ML Model
https://adb-4248444930383559.19.azuredatabricks.net/?o=4248444930383559#notebook/2926108717726595
