In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.linear_model import LogisticRegression, Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

#import spacy
#spacy.prefer_gpu()

## Notebook Objectives

The objective of this notebook is to fine-tune the model created in 3-modelling. As a general overview:
* The data is subsetted to accidents where a second pilot is present. This is used as a proxy metric to look at commercial flights. The Federal Aviation Administration (FAA) requires two pilots at all times for most aircraft that exceed 12,500 pounds (source below).
* Data pre-process (including NLP) is repeated, utilizing functions from 3-modelling notebook
* y variable is changed - now, we are classifying Fatal or Serious accidents (y=1) versus Minor or No Accidents (y=0)
* Model parameters are tuned in gridsearch slightly differently to adjust for new data
* Results are interpreted

Source: https://flygv.com/blog/the-importance-and-benefits-of-utilizing-dual-pilot-operations#:~:text=The%20Federal%20Aviation%20Administration%20

(the functions below are the same as in 3-modelling notebook and used for pre-processing and modelling in this notebook):

In [2]:
def get_relevant_lemmas(text):
    '''
    '''
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    return " ".join([lemmatizer.lemmatize(word) for word in text.split() if not word.lower() in stop_words])

In [3]:
def get_relevant_words(doc):
    '''
    This function creates a string of lemmas in a SpaCy doc that are
    not stop words or punctuation, in the order that they appeared.
    
    Parameters
    ----------
    doc: A SpaCy doc object
    
    Returns
    -------
    A string that contain non-stop, non-puntuation, lemmatized words
    '''
    return " ".join([token.lemma_ for token in doc if (not token.is_stop) and (not token.is_punct)])

In [4]:
def convert_text_df_to_vectors(X, n_gram_range=(1,1)):
    '''
    Converts a dataframe consisting of a single column of text into
    a dataframe of vectorized text data with stop words and
    conjugation removed.
    
    Parameters
    ----------
    X: A dataframe consisting of a single column of text data
    n_gram_range: A tuple containing the range of word chunks size to
                  use when grouping words for vectorization. Defaults
                  to (1,1).
    
    Returns
    -------
    A dataframe with columns for each vectorized n-gram of text, with
    values containing that text's tfidf value for the document in the
    corresponding row of the original dataframe.
    '''
    
    # Convert text entries into strings of relevant words
    X_lemmas = X.applymap(get_relevant_lemmas)
    
    # Vectorize our text dataset and store it as a DataFrame
    tfidf = TfidfVectorizer(ngram_range=n_gram_range)
    X_nlp_sparse = tfidf.fit_transform(X_lemmas[0])
    X_tfidf = pd.DataFrame(X_nlp_sparse.todense(), columns=tfidf.get_feature_names_out())
    return X_tfidf

In [5]:
def convert_text_to_vectors(X, should_combine_nouns=False, n_gram_range=(1,1)):
    '''
    Converts a dataframe consisting of a single column of text into
    a dataframe of vectorized text data with stop words, punctuation,
    and conjugation removed.
    
    Parameters
    ----------
    X: A dataframe consisting of a single column of text data
    should_combine_nouns: A boolean indicating whether or not
                          adjacent nouns in the text data should
                          be vectorized together. Defaults to False.
    n_gram_range: A tuple containing the range of word chunks size to
                  use when grouping words for vectorization. Defaults
                  to (1,1).
    
    Returns
    -------
    A dataframe with columns for each vectorized n-gram of text, with
    values containing that text's tfidf value for the document in the
    corresponding row of the original dataframe.
    '''
    
    # Loading in SpaCy english model
    nlp = spacy.load('en_core_web_sm')
    # Deciding whether or not to merge noun chunks
    if should_combine_nouns:
        nlp.add_pipe("merge_noun_chunks")
    
    # Converting text data to SpaCy doc objects
    X_nlp = X.applymap(nlp)
    # Convert text features into strings of relevant words
    X_lemmas = X_nlp.applymap(get_relevant_words)
    
    # Vectorize our reddit dataset and store it as a DataFrame
    tfidf = TfidfVectorizer(ngram_range=n_gram_range)
    X_nlp_sparse = tfidf.fit_transform(X_lemmas[0])
    X_tfidf = pd.DataFrame(X_nlp_sparse.todense(), columns=tfidf.get_feature_names_out())
    return X_tfidf

# Data Pre-Processing

In [6]:
# Importing cleaned dataset
accident_df = pd.read_pickle('./data/cm_vehicles_flattened_joined')
accident_df.columns = accident_df.columns.str.lower()

The data is subsetted to accidents where a second pilot is present below. This is used as a proxy metric to look at commercial flights.

In [7]:
# look at values for secondpilotpresent
accident_df['secondpilotpresent'].value_counts().to_frame()

Unnamed: 0,secondpilotpresent
False,21691
True,4651


In [8]:
# create new dataframe for secondpilotpresent data
spp_true = accident_df.loc[accident_df['secondpilotpresent']==True].copy()

In [9]:
# look at shape of new data
spp_true.shape

(4651, 58)

Note that the new data where a second pilot was present has **4651 rows**, which is significantly smaller than our starting dataset which we modelled with in 3-modelling notebook.

In [10]:
# drop secondpilot present as we already filtered on it and we do not need the column anymore
spp_true.drop(columns='secondpilotpresent', inplace=True)

The first thing that we must do is identify the columns of data that are of interest to us and remove or replace any missing values (also known as NaNs) from them. Having any NaN values in our dataset will make it impossible for our future model to understand and accept our data for training. As such, it is crucial that they be addressed beforeheand.

However, we should not simply remove all NaNs from our dataset. If we remove any row of data that has a NaN in it, we will be losing a significant amount of data that would otherwise be in the other columns of those rows. To prevent an extreme loss of data, we should only focus on cleaning the columns that we plan to use in our training.

Since our model is attempting to determine the factors that affect accident severity, we can use the probable cause reports and analysis narrative reports reports for each accident to train our model on the aspects that distinguish each accident. We should not include the factual narrative report however, as it contains too much text and would make our dataset too large to store in a notebook such as this for use in training.

In addition to this text data, we can train our model on the latitude, longitude, time, state, and plane manufacturer (make) of each accident. These contextual factors for each accident are commonly believed to play a significant role in accident severity, so they should be included in our model so we can infer whether or not these factors are as important as many believe them to be.

In [11]:
# Defining columns of interest
text_columns_of_interest = ['cm_probablecause','analysisnarrative','factualnarrative']
non_text_columns_of_interest = ['cm_latitude', 'cm_longitude', 'cm_eventdate', 'cm_state', 'make']

Having selected our columns of interest, with an eye towards minimizing the number of columns we use to avoid overfitting our model later on, we can inspect the NaNs present in each column to determine whether or not to remove or replace them.

In the cell below, we can see that the cm_probablecause column has an extremely small number of NaNs (<0.1% of the dataset). As such, we can safely remove these NaNs without meaningfully affecting the distribution of our data.

In [12]:
# Displaying presence of NaN values in key text columns
print(f'The probable cause column has {spp_true.cm_probablecause.isna().sum()} NaN values.')
print(f'The analysis narrative column has {spp_true.analysisnarrative.isna().sum()} NaN values.')
print(f'The factual narrative column has {spp_true.factualnarrative.isna().sum()} NaN values.')

The probable cause column has 9 NaN values.
The analysis narrative column has 0 NaN values.
The factual narrative column has 0 NaN values.


In [13]:
# Removing NaNs from columns with text
spp_true = spp_true.dropna(subset=text_columns_of_interest)

Moving on to the contextual feature columns, we can see that the cm_latitide, cm_longitude, and cm_state columns are each missing approximately 1% of the dataset's values. As this is an extremely small proportion of the data, these entries can be safely dropped without fear of misrepresenting our dataset.

In [14]:
# Displaying presence of NaN values in key non-desciptive columns
spp_true[non_text_columns_of_interest].isna().sum()

cm_latitude     68
cm_longitude    69
cm_eventdate     0
cm_state        20
make             4
dtype: int64

In [15]:
# Removing NaNs from columns without descriptive text
spp_true = spp_true.dropna(subset=non_text_columns_of_interest)

In [16]:
spp_true.shape

(4558, 57)

## Converting the Data

Now that our data has been cleaned of NaN values, we can begin converting it into an AI-friendly format that will allow us to better train our model.

First, we can convert the date and time string in the cm_eventdate column from a string of text into the number of seconds that have elapsed since epoch (Jan, 1, 1970). This allows us to represent our datetime data as a simple integer value, which our future model will be better able to interpret.

In [None]:
# Converting accident date information from strings to seconds since epoch
spp_true.loc[:,'cm_eventdate'] = pd.to_datetime(spp_true.cm_eventdate,
                                                  format='%Y-%m-%dT%H:%M:%SZ').astype('int64')//1e9

Below, we are removing self-referential words from our two text columns because we are not interested in words that would obviously highly influence the classification outcome. For example, "fatal" would probably be more related to fatal/serious accidents than minor/no accidents.

In [18]:
# remove common words
spp_true['analysisnarrative'] = spp_true['analysisnarrative'].str.lower()
removed_words = [' fatally', ' wreckage', ' fatal', ' autopsy', ' surviving', ' survive', ' survived', \
                 ' died', ' death', ' witnesses', ' witness', ' medical', ' medicine', ' medication', ' toxicology', \
                 ' toxicological', ' destroyed', ' crash', ' crashed', 'injury', 'injured', 'injuries']
for word in removed_words:
#spp_true['analysisnarrative'] 
    spp_true['analysisnarrative']  = spp_true['analysisnarrative'].map(lambda x: x.replace(word, ' '))
    spp_true['cm_probablecause']  = spp_true['cm_probablecause'].map(lambda x: x.replace(word, ' '))

In [19]:
# Splitting data into train and test sets as well as feature and target sets
spp_true_train, spp_true_test = train_test_split(spp_true)

In [20]:
spp_true_train.shape, spp_true_test.shape

((3418, 57), (1140, 57))

In order to further prepare our data for use in our aviation model, we must separate it based on the kind of processing that it will need.

The columns cm_probablecause and cm_analysisnarrative both contain large quantities of string data that correspond to a report on the events of a given accident. In order to use this in a model, natural language processing (NLP) methods will need to be applied to convert these strings of text into vectors of word frequencies. As these NLP steps are fundamentally different from the preprocessing that will be required for the remainder of the data, we ought to separate these columns out.

In [21]:
# Selecting relevant data from dataset

# Separating out target data
spp_true_severity = spp_true_train[['cm_highestinjury']]
spp_true_severity_test = spp_true_test[['cm_highestinjury']]
# Separating out text data for vectorization
spp_true_text_columns = spp_true_train[text_columns_of_interest]
spp_true_text_columns_test = spp_true_test[text_columns_of_interest]
# Selecting relevant non-text columns
spp_true_trimmed = spp_true_train[non_text_columns_of_interest]
spp_true_trimmed_test = spp_true_test[non_text_columns_of_interest]

#### Converting and Combining Data

When performing NLP preprocessing, it is important to collect all the text data that we are interested in into a single place. The process of NLP and vectorization undertaken in this project does not rely on the structure of any given sentence, and instead focuses on the frequency with which a word appears. As such, we can feel free to combine our probable cause and analysis narrative report columns into a single column without affecting the final result of our NLP.

In [22]:
# Combining text columns
spp_true_text = pd.DataFrame(spp_true_text_columns.cm_probablecause + " " + \
                   spp_true_text_columns.analysisnarrative
                   #+ " " + accident_df_text_columns.factualnarrative
                   )
spp_true_text_test = pd.DataFrame(spp_true_text_columns_test.cm_probablecause + " " + \
                                     spp_true_text_columns_test.analysisnarrative
                                     #+ " " + accident_df_text_columns_test.factualnarrative
                                     )


### Natural Language Pre-Processing
In this natural language processing phase, we will correct three issues that negatively affect the quality of our report data with regards to machine learning. These issues are stop words, conjugation, and the string format.

The vast majority of english sentences contain numerous words that mean little in and of themselves and serve only to maintain the grammar and syntax of an english sentence. Such words include 'is' and 'to'. These words are known as stop words, and our fture model will attempt to train on them if they are not removed from our dataset. Since these words to not actually carry any meaning, our model's attempts to fit to these words will only result in overfitting and increased training time, resulting in a worse model overall.

In addition to this, many english that do have meaning come in multiple forms depending on the syntax of the sentence around it. Verb conjugation and plural forms are both examples of this. If multiple versions of the same word are allowed to exist in our dataset, our model will attempt to train on each of them as if they were completely different words. This would lead to model attempting to fit to multiple highly correlated features and dividing up its weights across the various forms of a word that appear in the dataset. In order to combat this, we can attempt to convert every word in our dataset to its syntactically neutral form, known as a lemma, through the process of lemmatization.

And lastly, all of our report data is represented in string format. That is to say, as blocks of text. This is a significant issue to us, as our machine learning model is incapable of training on any data that is not purely numeric. In order to use our accident report text data in our machine learning model, we need to convert it into a numeric format. This can be accomplished through the process of vectorization, which creates a column for each unique word in our dataset and stores the frequency with which that word appears in each row of our dataset. In this format, our future model will be perfectly capable of training on our report data and inferring the factors that contribute most to accident severity.

(repeat from 3-modelling notebook, functions are called in at the top of the notebook)

In [23]:
# Lemmatizing text data and removing stop words
spp_true_text_lemmas = spp_true_text.applymap(get_relevant_lemmas)
spp_true_text_test_lemmas = spp_true_text_test.applymap(get_relevant_lemmas)

In [24]:
# Vectorizing lemmatized text data and storing it as a DataFrame
tfidf = TfidfVectorizer()
tfidf.fit(spp_true_text_lemmas[0])
spp_true_tfidf = pd.DataFrame(tfidf.transform(spp_true_text_lemmas[0]).todense(),
                                 columns = tfidf.get_feature_names_out())
spp_true_tfidf_test = pd.DataFrame(tfidf.transform(spp_true_text_test_lemmas[0]).todense(),
                                      columns = tfidf.get_feature_names_out())

## Numeric and Categorical Pre-Processing

And with our report data addressed, we can move on to proccessing our contextual data. All of our contextual data falls into one of two groups: numeric data and categorical data. Both of which have their own issues that must be corrected to maximize future model performance.

Numeric data such as latitude, longitude, and event time are all represented purely as numbers. Since data types are already compatible with our machine learning models, they theoretically do not need to be modified in order to be used. However, our future models will make use of a group of techniques known as 'regularization' to reduce overfitting and improve performance, and regularization is sensitive to the scale of the data given to the model. To maximize the accuracy of our final model, it would there fore be benefitial to compress all of our numeric data down to one order of magnitude while preserving the variance within the data that our model needs in order to learn. This can be accomplished through the process of scaling.

By contrast, categorical data such as the state the accident took place in and the plane's manufacturer are all represented as strings of text, like with our reports. However, these simple strings of text do not need to be pre-processed with NLP. Instead, we can focus directly on encoding them in a numeric format. One extremely direct way is to simply create a column for each possible string of text within cm_state and make that contains a 1 if a row contained that string and 0 if it did not. This is known as One-Hot Encoding, and it is the simplest approach to numerically encoding nominal categorical data.

The following cell used a column transformer to scale numeric columns and encode categorical columns.

In [25]:
# One-Hot encoding non-NLP categorical data
# and scaling non-NLP numeric data
pre_processor = ColumnTransformer(
    [('scaler', StandardScaler(), ['cm_latitude', 'cm_longitude', 'cm_eventdate']),
     ('onehot', OneHotEncoder(drop = 'first', handle_unknown = 'ignore'), ['cm_state', 'make'])],
    verbose_feature_names_out = False
)
pre_processor.fit(spp_true_trimmed)
# Storing processed non-NLP data for merging with NLP data
spp_true_trimmed_transformed = pd.DataFrame(pre_processor.transform(spp_true_trimmed).todense(), 
                                               columns = pre_processor.get_feature_names_out())
spp_true_trimmed_transformed_test = pd.DataFrame(pre_processor.transform(spp_true_trimmed_test).todense(), 
                                                    columns = pre_processor.get_feature_names_out())



#### Combining and Saving the Data

Data is saved but not read as this notebook goes right into modelling

In [26]:
# Combining datasets and saving
spp_true_train = pd.concat([spp_true_trimmed_transformed, 
                               spp_true_tfidf], axis = 1)
spp_true_test = pd.concat([spp_true_trimmed_transformed_test, 
                              spp_true_tfidf_test], axis = 1)

In [27]:
# Saving data for use in modelling
spp_true_train.to_pickle('./data/spp_true_train')
spp_true_test.to_pickle('./data/spp_true_test')
spp_true_severity.to_pickle('./data/spp_true_train_severity')
spp_true_severity_test.to_pickle('./data/spp_true_test_severity')

In [28]:
# # Importing cleaned and vectorized datasets
# X_train_tfidf = pd.read_pickle('./data/spp_true_train')
# X_test_tfidf = pd.read_pickle('./data/spp_true_test')
# y_train = pd.read_pickle('./data/spp_true_train_severity')
# y_test = pd.read_pickle('./data/spp_true_test_severity')

## Iteration on Modelling

We have a working model from 3-modelling notebook, however, we have made some changes to the data. We are looking at flights where a second pilot is present, resulting in a dataframe with 4,558 rows. Here, we will also look at changing the predictor classes before iterating on our previous model

Here we are changing our `y` variable compared to 3-modelling notebook. 
We are making:
* the class `y=1` as `cm_highestinjury` = `Fatal` or `Serious`
* the class `y=0` as `cm_highestinjury` = `None` or `Minor`

We did this as we want to look at fatal or serious cases as "bad accidents"

In [29]:
# give categories to ys based on cm_highestinjury
y_train = ((y_train=='Fatal')|(y_train=='Serious')).astype('int').cm_highestinjury
y_test = ((y_test=='Fatal')|(y_test=='Serious')).astype('int').cm_highestinjury

In [31]:
# look at value counts for y_train
y_train.value_counts().to_frame()

Unnamed: 0,cm_highestinjury
0,2397
1,1021


Since our processed data is both extremely large and extremely sparse, we can convert it to a sparse format in order to improve model training speed by representing our data in a more compact way.

In [32]:
# Converting feature datasets into sparse matrices for time efficiency
X_train_sparse = X_train_tfidf.astype(pd.SparseDtype("float64",0)).sparse.to_coo()
X_test_sparse = X_test_tfidf.astype(pd.SparseDtype("float64",0)).sparse.to_coo()

#### Establishing the Baseline

When training a classifier model, it is important to establish a baseline accuracy that our model must surpass in order to be considered meaningful. This baseline is defined for classifers as being the accuracy that you would get by simply predicting the most common classification (in this case, 'non-fatal') every single time.

The following cell displays the baseline accuracy as ~ 0.677. Note that this is much lower than our previous baseline of ~ 0.819.

In [33]:
# Checking the majority class (baseline)
y_test.value_counts(normalize=True)

0    0.677193
1    0.322807
Name: cm_highestinjury, dtype: float64

In [34]:
y_train.value_counts(normalize=True)

0    0.701287
1    0.298713
Name: cm_highestinjury, dtype: float64

## Logistic Regression

When selecting a model to train, its important to keep in mind what we need our model to do. While popular models such as Gradient Boosting, Random Forest, and Neural Networks can all boast high predictive power, they do so at the cost of interpretability. Such models can rarely explain the reasoning behind their decisions accurately, especiallly when given sparse data like our own.

However, there is a simpler model whose priorities better alight with our own. Logistic Regression is a simple classifier model that employs the principles of linear regression to apply a single numeric weight to each word in our dataset that it uses to predict the odds of an accident being fatal when that word is present. This model allows us to see, in quantifiable terms, the affect that a word has on our model's decision to classify an accident as fatal or not. And when given binary target data in particular, Logistic Regression can communicate which classification a particular word is associated with through the sign of its numeric weight.

For this project, where we aim to determine the factors that affect accident severity by examining a trained model, interpretability is thee most important consideration. As such, we will be using Logistic Regression in our modelling process.

The following cell trains several Logistic Regression models and selects the one with the best performance for use in future inferrence.

The options for `C` in the gridsearch below are adjusted from 3-modelling notebook to search for the best `C`

In [35]:
# Defining logistic regression parameters to sweep over
logreg_params = {
    'penalty':['l1','l2'],
    'C':[0.01,0.1,1,2,3,4,5,6,7,8,10,12,14,100]
}
# GridSearching logistic regression classifiers
logreg_grid = GridSearchCV(LogisticRegression(solver='liblinear'), logreg_params, n_jobs=-1)
logreg_grid.fit(X_train_sparse, y_train)
# Printing out train and test accuracy scores
print(f'Training accuracy: {logreg_grid.score(X_train_sparse, y_train)}')
print(f'Testing accuracy: {logreg_grid.score(X_test_sparse, y_test)}')

Training accuracy: 0.9482153306026916
Testing accuracy: 0.8517543859649123


Looking at the above testing accuracy of approximately 0.85, we can see that our model has outperformed the baseline accuracy of 0.677 and thus has meaningfully learned to predict severity from our data. 

From the previous model (from 3-modelling notebook), we see an increase in variance as the training accuracy is higher than the testing accuracy by more. However, given that we removed self-referential words and have much a smaller dataset, this makes sense. Also, the baseline for this iteration is much lower. We select this as the final version of our model as it addressess our problem statement better.

Given this, we can now look at the features within the model that have the highest weight in order to make inferrences about factors that most contribute to fatal accidents.

In [40]:
# checking the best estimator
logreg_grid.best_estimator_

LogisticRegression(C=5, solver='liblinear')

In [41]:
parameter_weight_df = pd.DataFrame(logreg_grid.best_estimator_.coef_[0], index = X_train_tfidf.columns)

In [42]:
parameter_weight_df.sort_values(0, ascending=False).head(60) # radar, initiate, attendant, airframe, ankle, canopy, rule, bar, transmission, turbulence, spin

Unnamed: 0,0
accident,5.647813
attendant,5.409655
passenger,4.624972
impact,4.203972
likely,3.721031
preimpact,3.450029
radar,3.421872
serious,3.287594
ankle,3.0139
turbulence,2.9816


## Results Interpretation

Below, we are looking at features which have high feature importance in our model. We also want to consider the frequency that the feature occurs in our data (a highly important feature that only occurs once likely is not worth making suggestions to regulators on). The function below prints the frequency of the feature in our data (of 4,558 rows) and prints the fraction where the feature occured for severe/fatal accidents

In [43]:
def print_word_occurence(df, word):
    occurence = len(df.loc[(df['cm_probablecause'].str.contains(word))|(df['analysisnarrative'].str.contains(word))])
    serious_fatal_occurence = len(df.loc[((df['cm_probablecause'].str.contains(word))|(df['analysisnarrative'].str.contains(word)))\
            &(df['cm_highestinjury'].isin(['Serious', 'Fatal']))])
    
    print(f'There are {occurence} rows where cm_probablecause or analysisnarrative contains {word}.')
    print(f'{serious_fatal_occurence}, or {round(100*(serious_fatal_occurence/occurence), 2)} % of those, are serious or fatal.')

#### "Radar"

In [44]:
print_word_occurence(spp_true, 'radar')

There are 281 rows where cm_probablecause or analysisnarrative contains radar.
235, or 83.63 % of those, are serious or fatal.


In [45]:
radar_df = spp_true.loc[((spp_true['cm_probablecause'].str.contains('radar'))|(spp_true['analysisnarrative'].str.contains('radar')))\
            &(spp_true['cm_highestinjury'].isin(['Serious', 'Fatal']))][['cm_probablecause', 'analysisnarrative']] #235 rows

In [46]:
# radar data altitude
np.exp(10.590722)

39764.18850836774

The word "radar" was of high importance. A one word increase in the occurence of "radar" in the flight narrative meant the flight being serious or fatal was 39764 times as likely. 
This is because in cases where accidents were serious or fatal, radar data was often investigated. Specifically, it was used to see the flight's altitude and if/when radar connection was lost with the flight.

An example is:

In [47]:
for i in radar_df.iloc[2]:
    print(i)
    print('\n')

the pilot's improper in flight planning/decision making, his flight into known icing conditions, and his failure to maintain adequate airspeed which resulted in the inadvertent stall/spin and impact with terrain. Factors contributing to the accident were the pilot's improper pre-flight planning/preparation, the icing conditions, and the inadvertent stall/spin.


the airplane departed las vegas, nevada, approximately 0919, on an ifr flight plan to midland, texas. the pilot climbed to an initial cruising altitude of 13,000 feet. at 1005, the pilot contacted albuquerque artcc (zab) and reported that he was level at 13,000 feet. at 1009, the pilot requested to climb to 15,000 and the zab controller approved the request. at 1013:55, the pilot contacted albuquerque flight watch and reported that he was approximately 23 miles west of flagstaff, arizona at 15,000 feet, and that about 20 miles west of his position, at 13,000 feet, he encountered "light mixed icing."  the pilot requested any pir

#### "Airframe"

In [49]:
print_word_occurence(spp_true, 'airframe')

There are 651 rows where cm_probablecause or analysisnarrative contains airframe.
344, or 52.84 % of those, are serious or fatal.


In [50]:
np.exp(6.982968)

1078.1134638873286

"Airframe" refers to the mechanical structure of an aircraft and was of high importance in classifying fatal/serious accidents. A one word increase in the occurence of "airframe" in the flight narrative or probable cause meant the flight being serious or fatal was 1078 times as likely.

source: https://en.wikipedia.org/wiki/Airframe

#### "Transmission"

In [152]:
#transmission
print_word_occurence(spp_true, 'transmission')

There are 91 rows where cm_probablecause or analysisnarrative contains transmission.
56, or 61.54 % of those, are serious or fatal.


In [164]:
np.exp(4.448531)

85.5012503660406

A one word increase in the occurence of "transmission" in the flight narrative or probable cause meant the flight being serious or fatal was 86 times as likely. The word "transmission" in terms of aircraft can refer to radio transmission or engine transmission, both of which were found in the data. In particular, many flights which were serious/fatal have information about radio transmission in their flight narratives. This is because the radio transmissions are an important component in investigating aircraft accidents and why they happened. 

In [153]:
transmission_df = spp_true.loc[((spp_true['cm_probablecause'].str.contains('transmission'))|(spp_true['analysisnarrative'].str.contains('transmission')))\
            &(spp_true['cm_highestinjury'].isin(['Serious', 'Fatal']))][['cm_probablecause', 'analysisnarrative']] #235 rows

In [166]:
for i in transmission_df.iloc[9]:
    print(i)
    print('\n')
# engine transmission, radio transmission

The pilot's failure to maintain terrain clearance while executing an instrument approach.  A factor was the night instrument meteorological conditions.


while executing an ils approach in night instrument meteorological conditions, the approach controller instructed the pilot that radar services were terminated, and to switch to the advisory frequency.  the pilot acknowledged the instruction, and no further transmissions were received from the airplane.  review of radar data revealed that the airplane intercepted the final approach course for the runway 3 localizer, where it began a gradual descent.  about 4 minutes prior to the accident, the airplane was recorded on the localizer course, at a ground speed of 115 knots; however, radar coverage was subsequently lost.  the airplane impacted a wooded area about 1/2 mile west of the runway, approximately abeam the 500-foot markers painted on the runway surface.  the  path was oriented approximately 90 degrees left of the inbound approach 

#### "Turbulence"

In [172]:
print_word_occurence(spp_true, 'turbulence')

There are 242 rows where cm_probablecause or analysisnarrative contains turbulence.
194, or 80.17 % of those, are serious or fatal.


In [173]:
np.exp(4.425944)

83.59168053440504

A one word increase in the occurence of "turbulence" in the flight narrative or probable cause meant the flight being serious or fatal was 84 times as likely. 

In [167]:
turbulence_df = spp_true.loc[((spp_true['cm_probablecause'].str.contains('turbulence'))|(spp_true['analysisnarrative'].str.contains('turbulence')))\
            &(spp_true['cm_highestinjury'].isin(['Serious', 'Fatal']))][['cm_probablecause', 'analysisnarrative']] #235 rows

In [171]:
for i in turbulence_df.iloc[6]:
    print(i)
    print('\n')

The pilot not maintaining aircraft control while encountering windshear during initial climbout, resulting in a stall at a low altitude.


while on initial climb after takeoff the airplane encountered windshear and the airplane stalled, subsequently impacting the terrain.  the pilot reported the takeoff and initial climb were normal until approximately 300 feet above ground level when the airplane encountered "severe turbulence, causing left turn and loss of lift."  the pilot stated the engine was producing "full power" when the airplane impacted the terrain.  the winds were 020 degrees magnetic at 12 knots, with gusts of 17 knots.&#x0d;
&#x0d;
&#x0d;




#### "Installation" - important!

In [178]:
# canopy
print_word_occurence(spp_true, 'installation')

There are 189 rows where cm_probablecause or analysisnarrative contains installation.
60, or 31.75 % of those, are serious or fatal.


In [183]:
np.exp(3.970811)

53.02751872341782

A one word increase in the occurence of "installation" in the flight narrative or probable cause meant the flight being serious or fatal was 53 times as likely. There are many accidents which report that the probably cause was incorrect installation of some part of the aircraft. 

It is imperative to improve the training of maintenance personnel/mechanics who install parts on aircraft.

In [185]:
installation_df = spp_true.loc[((spp_true['cm_probablecause'].str.contains('installation'))|(spp_true['analysisnarrative'].str.contains('installation')))\
            &(spp_true['cm_highestinjury'].isin(['Serious', 'Fatal']))][['cm_probablecause']]

Some examples are:

In [196]:
for i in installation_df.iloc[8]:
    print(i)
for i in installation_df.iloc[10]:
    print(i)
for i in installation_df.iloc[19]:
    print(i)
for i in installation_df.iloc[22]:
    print(i)

The improper installation of the left tailrotor control cable by company maintenance personnel.
The improper installation of the helicoil by the non-certificated mechanic, and the inadequate inspection of the installation by the certificated mechanic.
A shorted terminal lug on the landing gear hydraulic pump which resulted in a cabin fire. Contributing to the accident was the lack of clear installation procedures for the hydraulic pump.
A loss of engine power due to the in-flight separation of the 1-3-5 cylinder induction tube elbow, which was caused by the improper installation of the induction tube elbow by maintenance personnel.


#### Other terms: "ankle", "bar", "canopy"

These terms had high feature importance but were not very frequent in the data.

In [174]:
# ankle
print_word_occurence(spp_true, 'ankle')

There are 56 rows where cm_probablecause or analysisnarrative contains ankle.
55, or 98.21 % of those, are serious or fatal.


In [175]:
# bar
print_word_occurence(spp_true, ' bar ')
# steering bar, control bar, etc.

There are 30 rows where cm_probablecause or analysisnarrative contains  bar .
10, or 33.33 % of those, are serious or fatal.


In [None]:
# canopy
print_word_occurence(spp_true, 'canopy')

There are 31 rows where cm_probablecause or analysisnarrative contains canopy.
20, or 64.52 % of those, are serious or fatal.


## Summary of Findings

* "Radar", "transmission", and "airframe" are common words used when investigating serious/fatal aircraft accidents. It is good that the industry uses these metrics consistently in such cases
* "Installation" is both high-importance and high-frequency as a feature. Looking through examples, we can see that many of the cases for serious/fatal accidents cited "improper installment" of some aircraft part as the probable cause for the accident. This should be the focus for regulators and researchers
* No make or model of the aircrafts had both high feature importance and high frequency
* No weather related terms had high feature importance (this was our guess initially)