# Portfolio: Commercial Aviation Fatalities
**Mallory Cotton-Whitehorn**

This notebook is a portfolio for my Capstone in Data Science that synthesizes my analysis and findings from a competition on Kaggle called Reducing Commercial Aviation Fatalities. This portfolio contains several other Kaggle notebooks that show my analysis of this Kaggle dataset.

# Project Description

The project that I analyzed was called Reducing Commercial Aviation Fatalities from a Kaggle data source.

Most flight-related incidents come from pilots' loss of "airplane state awareness" during their flights. Loss of airplane state awareness is poor attention management by pilots due to distractions, drowsiness, and many other different states of cognitive awareness that are dangerous. 

This dataset uses real physiological data. In order to get this data, pilots were subjected to various distracting events.

There are three different cognitive states of awareness that each pilot undergoes while collecting the physiological data. These states include Channelized Attention (CA), Diverted Attention (DA), and Startle/Surprise (SS). Pilots can be only one of these said states at one time or in baseline, which is no event.

Three Cognitive States Defined:
1. Channelized Attention (CA) = State of being focused on one task with the exclusion of all others.

   *Benchmarked by playing a puzzle-based video game.*

2. Diverted Attention (DA) = State of being diverted by actions or thought processes associated with a decision

   *Benchmarked by solving a period math problem before returning to task*

3. Startle/Surprise (SS) = An unexpected event that causes a mild shock

  *Benchmarked by watching movie clips with jump scares*


**Goal** = My goal in this project is to build a model to detect these pilots' different states of cognitive awareness. 


# Exploratory Analysis

When I started this project, I initially had to look at the data to visualize its main characteristics and how I would approach this dataset. I browsed through the data to see how big the data set was and to get an actual look at the labels within the dataset. Below is a brief look into some of my initial findings while exploring the training dataset.

In [None]:
# Import numpy, pandas, and matplotlib using the standard aliases.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the following tools from sklearn: 
#     Pipeline, SimpleImputer, ColumnTransformer, OneHotEncoder, StandardScaler
#     LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier



from sklearn.model_selection import GridSearchCV
# Import joblib
import joblib
import os
import gc

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

import seaborn as sns

import warnings
import itertools
from tqdm import tqdm_notebook as tqdm
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, log_loss
warnings.simplefilter(action='ignore')
sns.set_style('whitegrid')

This dataset has 28 columns and 4,867,421 rows.

In [None]:
train = pd.read_csv('../input/reducing-commercial-aviation-fatalities/train.csv')
train = train.sample(frac=1, random_state=1)
print(train.shape)

Below is the list of column names in the original dataset provided in the Kaggle competition.

In [None]:
for col in train.columns:
    print(col)

# States of Awareness Labeled

The different states of awarness are label in the dataset from letters A-D.
* A = Baseline
* B = SS
* C = CA
* D = DA

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(train['event'])
plt.xlabel("State of the pilot", fontsize=15)
plt.ylabel("Count", fontsize=12)
plt.title("Target Labels", fontsize=20)
plt.show()

The graph below shows the different states of awareness and the baseline state between each state.

In [None]:
plt.figure(figsize=(15,10))
sns.countplot('experiment', hue='event', data=train)
plt.xlabel("State of Awarnesss and Experiment of the Pilot", fontsize=12)
plt.ylabel("Count (log)", fontsize=12)
plt.yscale('log')
plt.title("Target Labels for Different Experiments", fontsize=15)
plt.show()

**To take a more in-depth look into the exploratory analysis stage of this notebook. Please view the notebook linked below.**


***EDA Notebook: https://www.kaggle.com/code/mallorycw/eda-reducing-commercial-aviation-fatalities?scriptVersionId=94677397***

# Model Training


After exploring my data, I proceeded with this project by training different models. I trained two different models. Below is a clip of the models I trained.

To set up my model to be trained, I started by inserting the "event" column into the variable x in my training data. After this, I dropped the following labels from the training dataset:  crew,  experiment, time, seat, event. The event column is what I am predicting, and the other columns that I dropped are unneeded. Then I inserted all other columns except the event column into my y variable. I did an 80/20 split, meaning I trained on only 20% of the data. 

After processing and splitting my data, I began training on two different models. The two models that I chose were a decision tree model and a random forest model. For both of these models, I choose the following parameters.

                      'max_depth': [8,16, 24, 32, 40],
                      'min_samples_leaf': [8, 16, 24, 32, 40]
                      
**Note: Instead of scoring on neg_log_loss, my initial models were scored based on accuracy. Both of these scoring techniques did not influence my models as much; however, for the future of this project, I scored on neg_log_loss.**

Below is a look at the model performances.

**Decision Tree Model:**

                Best Parameters: {'max_depth': 40, 'min_samples_leaf': 8}
                Best CV Score:   0.9589346206679769
                Training Acc:    0.988669153541475
                Wall time: 19min 25s

**Random Forest Model:**

                  Best Parameters: {'max_depth': 40, 'min_samples_leaf': 8}
                  Best CV Score:   0.9283920104511193
                  Training Acc:    0.9646901305640092
                  Wall time: 2h 13min 46s
                
As you can see, both of these models had a good score output after training; however, the decision tree model had a slightly better score. I decided to test this decision tree model. As you can see, based on the training accuracy of both models, the Random Forest Model is not far off. The random forest model took a long time to run; however, I did not rule out the random forest model for future training.


image1 = Image.open("../input/images-aviation/Screenshot (214).png")
image1.show()

## Decision Tree Graph

In [None]:
from IPython.display import Image
img = '../input/images-aviation/Screenshot (214).png'
Image(filename=img)

## Random Forest Graph

In [None]:
from IPython.display import Image
img2 = '../input/images-aviation/Screenshot (215).png'
Image(filename=img2)

Based on the graphs, these models can be trained for a long time since the CV score steadily increases instead of being stagnant.

**Take a closer look into my initial model training by viewing the notebook attached within this cell.**

***Initial Training Notebook: https://www.kaggle.com/code/mallorycw/mcw-aviation-modeling-process***

### Decision Tree Testing

From my training notebook, I used the best parameters that you see above to test the model on the test data set and submitted it to the competition.



Decision Tree Model Scores:
        
        Private Score: 12.76630
        Public Score:  12.88538
        
This model did not give me a good public or private score; therefore, I knew I needed to train and test more models.
        
        
Below are two links to my model testing for the decision tree model. Both models are the same; however, the second link has a different graph.


***Decision Tree Model Testing 1: 
https://www.kaggle.com/code/mallorycw/mcw-aviation-submission-3-9-22***

***Decision Tree Model Testing 2: https://www.kaggle.com/code/mallorycw/mcw-decision-tree-model-reducing-aviation***

**Comments:** When I first decided to perform the same models on the test set, I encountered issues with uploading the test data. Overall, the test data was too large to import. I used my laptop, and Kaggle has limited storage, so I had to upload and test the data in chunks. I ended up using this method for all future testing of my models.

**Conclusion:** Based on my decision tree model, I knew that I needed to go in a different direction during my modeling process. My public and private score were too high. These scores were not where I wanted them to be when scored by Kaggle.

# Model Training With Optuna

Modeling with Optuna was the next step I decided to take throughout my modeling process.

What is Optuna? 
    
    Optuna is a hyperparameter optimization software. Optuna allows users to construct and manipulate hyperparameters that make overall increases in efficiency and provides optimal functionality. Optuna is very versatile.
    
How is Optuna beneficial for modeling within this dataset?

    As you can see in the last section, I had to choose the max depth and min samples leaf for both my models. Instead of choosing the parameters, Optuna allows me to submit a range to each parameter, and Optuna chooses for me. After each run, Optuna chooses a better parameter than the last.

I chose to train my dataset on a variety of models using Optuna. During all of these modeling processes, I used neg_log_loss scoring.

I used the same beginning processes as the regular models without Optuna by setting up the x and y training variables with an 80/20 split.

For the decision tree model, I chose a max_depth range of 2 to 500 and a min_samples_leaf range from 1 to 500.

For my Decision Tree Model with Optuna, my best parameters and scoring are as follows:

            Max Depth: 192
            Min_Samples_Leaf: 155
            Score: -0.1807557283078063
            Wall Time: 1h 49min 13s
   
***Decision Tree Model With Optuna: https://www.kaggle.com/code/mallorycw/mcw-aviation-dt-model-w-optuna***
            
Next, I chose to train a Light Gradient Boosting Machine (LightGBM) model. I chose a range of 20 to 150 for the number of estimators. I chose a max_depth range of 2 to 40. The learning rate was from 0 to 1 with log=False, and the subsample was from 0.6 to 1 with log=False. Conclusion: there are several different gradient boosting trees so that Optuna can decide between gbdt, dart, and goss.

For my LGBM Model with Optuna, my best parameters and scoring are as follows:

            n_estimators: 150
            Max Depth: 9
            learning_rate = 0.17833752251
            subsample: 0.6819212428783524
            boosting_type: gbdt
            Score: -0.15114824241046781
            Wall Time: 2h 16min 57s

            
            
***LGBM Model wtih Optuna: https://www.kaggle.com/code/mallorycw/mcw-aviation-lightgbm***

Next, I made a random forest model. I chose a max_depth range of 2 to 556 and a min_inst range from 1 to 64, and the number of estimators between 20 and 300.

Since my random forest model took a long time, I decided to test on three different models from the Optuna outputs. I used two of the best score given to test the model before the training Optuna model ultimately finished; therefore, I have three different random forest models. Here you can see the parameters I chose for each of the three models.

Model 1:

            n_estimators: 124
            max_depth: 156
            min_samples_leaf: 1
            Score: -0.2396593729280911

Model 2:

            n_estimators: 145
            max_depth: 138
            min_sampes_leaf = 6
Model 3:
    
            n_estimators: 248
            max_depth: 112
            min_sampes_leaf = 11           
            
***Random Forest Model With Optuna: https://www.kaggle.com/code/mallorycw/mcw-aviation-rf-model***

Lastly, I made an extra random tree model. I chose a max_depth range of 2 to 300 and a min_inst range from 1 to 64, and the number of estimators between 20 and 300. I choose features between 4 and 12. The extra-random tree criteria are gini or entropy.

For my Extra Random Tree Model with Optuna, my best parameters and scoring are as follows:

            n_estimators: 135
            max_depth: 525
            min_inst: 1
            features: 10
            criterion: entropy
            Score: --0.10811535177329061
            Wall Time: 2h 22min 17s

***Extra-Random Trees: https://www.kaggle.com/code/mallorycw/mcw-extra-random-trees-aviation*** 

# Testing Optuna Model

**Decision Tree Score:**

                Private Score: 10.92707
                Public Score:  10.50096
                
***Decision Tree Optuna Submission: https://www.kaggle.com/code/mallorycw/mcw-dt-submission-w-optuna?scriptVersionId=92201062***
 



**LGBM Score:**

                Private Score: 1.25951
                Public Score:  0.81812


***LGBM Model with Optuna Submission: https://www.kaggle.com/code/mallorycw/mcw-lgbm-submission-w-optuna?scriptVersionId=92202554***


                            

**Random Forest Score:**


**Model 1:**
 
                 Private Score: 0.82668
                 Public Score:  0.62620

***Random Forest Model 1 Submission: https://www.kaggle.com/code/mallorycw/mcw-rf-submission-version3-w-optuna?scriptVersionId=92772167***                 
 

**Model 2:**
 
                 Private Score: 0.72443
                 Public Score:  0.57131

***Random Forest Model 2 Submission: https://www.kaggle.com/code/mallorycw/mcw-rf-submission-w-optuna?scriptVersionId=92364284***   


**Model 3:**
 
                 Private Score: 0.70192
                 Public Score:  0.55555
 

***Random Forest Model 3 Submission: https://www.kaggle.com/code/mallorycw/mcw-rf-submission-version2-w-optuna?scriptVersionId=92364768***  
 
 
***After submitting all of the models to the Kaggle competition, the best model overall was the Random Forest: Model 3.***
 

# Feature Engineering


**What is feature engineering?**

Feature engineering is beneficial in developing models. Depending on the dataset, you can use various techniques to transform the dataset already given.

**What is the best way to perform feature engineering for my model?**

The "Reducing Commercial Aviation Fatalities" dataset uses physiological data from different pilots. EEG is shorthand for electroencephalography, a method to record electrical activity on the scalp. The column labels within my model are as follows; eeg_fp1, eeg_f7, eeg_f8, eeg_t4, etc. Each of these columns symbolizes a different electrode from the EEG data collected.

Doctors do not use just one electrode to detect patterns; however, they use electrode pairs to detect activities. There are numerous ways to pair these electrodes; however, I found a picture of different montages that show how I can combine these different EEG columns.

**Here is a depiction of the Montages:**

In [None]:
from IPython.display import Image
img3 = '../input/images-aviation/Screenshot (216).png'
Image(filename=img3)

I attached a link below with some visualizations and Montages 1 through 3. I did not use all of the montages; however, I did construct all of these montages within this Kaggle Notebook to start the feature engineering process.

Initial Feature Engineering Notebook:
https://www.kaggle.com/code/mallorycw/mcw-aviation-feature-engineering/edit/run/93027900

# Training Notebooks

Instead of using all of the montages depicted in the photo for my models. I decided to use the top three photos. Here is the training notebooks below.

**Random Forest Model Optuna with Feature Engineering:**

            Best Score: -0.20848149619455
            n_estimators: 240
            max_depth: 398
            min_inst: 1

***Random Forest Optuna Model with Montage 2: 
https://www.kaggle.com/code/mallorycw/mcw-aviation-rf-model-w-feature-engineering***

**Extra Random Trees with Feature Engineering (Original Features + Montage 2):**

            Best Score: -0.09885626885827817
            n_estimators: 256
            max_depth: 125
            min_inst: 1
            features: 11
            criterion: 'entropy'

***Extra Random Trees Optuna Model with Original Features and Montage 2): https://www.kaggle.com/code/mallorycw/mcw-extra-random-trees-w-feature-engineering***

**Extra Random Trees with Feature Engineering (Montage 1 + 3):**

            Best Score: -0.07354581038472197
            n_estimators: 115
            max_depth: 96
            min_inst: 1
            features: 12
            criterion: 'entropy'

***Extra Random Trees Optuna Model with Montage 1 and 3: https://www.kaggle.com/code/mallorycw/mcw-ert-w-features-montage-1-and-3***


# All Original Features and Montages

I submitted the notebook using all of the features and the parameters from extra random trees parameters from the extra random trees notebook with the following parameters.
            
            n_estimators: 115
            max_depth: 96
            min_inst: 1
            features: 12
            criterion: 'entropy'
            
**Submission Score:**
            
            Private Score: 0.77650
            Public Score:  0.57646

***Testing on All Features and Montages: https://www.kaggle.com/code/mallorycw/mcw-2-submission-ert-w-features-montage-1***


# Final Conclusion

Based on the score of all of these models. My Random Forest Model with Optuna is the best overall model. Here is the score below.

**Model 3:**
 
                 Private Score: 0.70192
                 Public Score:  0.55555
                 

Since this model was my best model, I made a notebook to get a full view of the model and how it was performing. I used 70% of the training data and the parameters from this model.
    The model was a Random Forest Model with a max depth of 112, min_samples_leaf of 11, n_estimators of 248, and a random_state of 1. I used cross_val_predict to estimate each of our three states and the possibility of no event.
    After that, I did a classification report of the data. Overall, I had a 94% accuracy. My weighted average is 93% which is good since it calculates the number of true positives in all classes.
    
***Analysis of Best Model: https://www.kaggle.com/code/mallorycw/best-model-accuracy-2?scriptVersionId=94761300***


***Reference to the Original Random Forest Model 3 Submission suing Optuna: https://www.kaggle.com/code/mallorycw/mcw-rf-submission-version2-w-optuna?scriptVersionId=92364768***  

# Future Input

If someone were to continue this project, I would suggest looking more into feature engineering. I would suggest feature engineering on a random forest model using optuna, although it would require a longer run time. I would also suggest approaching ensembling.
