In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Preliminary Modelling

### Importing data

With our data fully cleaned, examined, and processed, we can move on to training an interpretable classifier model to determine the factors that affect accident severity.

We will begin by bringing in our completed data

In [2]:
# Importing cleaned and vectorized datasets
X_train_tfidf = pd.read_pickle('./data/train')
X_test_tfidf = pd.read_pickle('./data/test')
y_train = pd.read_pickle('./data/train_severity')
y_test = pd.read_pickle('./data/test_severity')

Since our processed data is both extremely large and extremely sparse, we can convert it to a sparse format in order to improve model training speed by representing our data in a more compact way.

In [3]:
# Converting feature datasets into sparse matrices for time efficiency
X_train_sparse = X_train_tfidf.astype(pd.SparseDtype("float64",0)).sparse.to_coo()
X_test_sparse = X_test_tfidf.astype(pd.SparseDtype("float64",0)).sparse.to_coo()

Similarly, we can convert our target data, accident severity, into a binary numeric format to increase the interpretability of our model's results. Since we are primarly concerned with extremely severe accidents, we can numerically encode our severity as either 'fatal' or 'non-fatal', as done in the cell below.

In [4]:
# Converting multi-class severity target data into a binary of fatal vs non-fatal
y_train = (y_train == 'Fatal').astype('int').cm_highestinjury
y_test = (y_test == 'Fatal').astype('int').cm_highestinjury

### Establishing the Baseline

When training a classifier model, it is important to establish a baseline accuracy that our model must surpass in order to be considered meaningful. This baseline is defined for classifers as being the accuracy that you would get by simply predicting the most common classification (in this case, 'non-fatal') every single time.

The following cell displays the baseline accuracy.

In [7]:
# Checking the majority class
print(f'The baseline accuracy is {y_test.value_counts(normalize=True)[0]}')

The baseline accuracy is 0.8192307692307692


## Logistic Regression

When selecting a model to train, its important to keep in mind what we need our model to do. While popular models such as Gradient Boosting, Random Forest, and Neural Networks can all boast high predictive power, they do so at the cost  of interpretability. Such models can rarely explain the reasoning behind their decisions accurately, especiallly when given sparse data like our own.

However, there is a simpler model whose priorities better alight with our own. Logistic Regression is a simple classifier model that employs the principles of linear regression to apply a single numeric weight to each word in our dataset that it uses to predict the odds of an accident being fatal when that word is present. This model allows us to see, in quantifiable terms, the affect that a word has on our model's decision to classify an accident as fatal or not. And when given binary target data in particular, Logistic Regression can communicate which classification a particular word is associated with through the sign of its numeric weight.

For this project, where we aim to determine the factors that affect accident severity by examining a trained model, interpretability is thee most important consideration. As such, we will be using Logistic Regression in our modelling process.

The following cell trains several Logistic Regression models and selects the one with the best performance for use in future inferrence.

In [8]:
# Defining logistic regression parameters to sweep over
logreg_params = {
    'penalty':['l1','l2'],
    'C':[0.01,0.1,1,10,100]
}
# GridSearching logistic regression classifiers
logreg_grid = GridSearchCV(LogisticRegression(solver='liblinear'), logreg_params, n_jobs=-1)
logreg_grid.fit(X_train_sparse, y_train)
# Printing out train and test accuracy scores
print(f'Training accuracy: {logreg_grid.score(X_train_sparse, y_train)}')
print(f'Testing accuracy: {logreg_grid.score(X_test_sparse, y_test)}')

Training accuracy: 0.9917367146317139
Testing accuracy: 0.9484330484330484


Looking at the above testing accuracy of approximately 0.95, we can see that our model has outperformed the baseline accuracy of 0.81 and thus has meaningfully learned to predict severity from our data.

Given this, we can now look at the features within the model that have the highest weight in order to make inferrences about factors that most contribute to fatal accidents.

In [11]:
# Displaying the model paraemetrs that most determine if an accident is predicted to be fatal
parameter_weight_df = pd.DataFrame(logreg_grid.best_estimator_.coef_[0], index = X_train_tfidf.columns)
parameter_weight_df.columns = ['Weight']
parameter_weight_df.sort_values('Weight', ascending=False).head(30)

Unnamed: 0,Weight
witnesses,13.295267
witness,12.087108
fatally,10.945271
wreckage,10.257947
likely,9.327795
would,7.404621
consistent,6.941478
airframe,6.845285
fatal,6.666822
radar,6.287115


Looking at the above parameters, we can see three major things that can guide further improvements in our model.

Firstly and most importantly, our model _did_ successfully identify words such as airframe that provide insight into flight safety. Looking into the reports that mention the word 'airframe' shows that accidents that involved airframe damage worthy of reporting were much more likley to be fatal.

However, these words are being obsured by the presence of self-referrential terms such as 'fatally' and 'wreckage' which directly state that an accident was fatal and by procedural terms such as 'witness' and 'detected' that appear more often in fatal accident reports but hold no insight in and of themselves. We could improve our model's inferrential abilities by removing these terms from our data to force the model to learn more insightful ways of determining which accidents were fatal.

And equally importantly, we can see that the make of smaller private planes are considered to be important predictors by our model. This, along with our EDA, highlights the fact that our dataset's overrepresentation of private planes is likely having a substantial effect on our model's predictions. Since the project's overall goal is to provide recommendations for what parts of _commercial_ air travel should be regulated or researched, we would likely get more relevant results by excluding private planes from our model's training.

## Continued Work

Since both our model and our data could be further altered to achieve more meaningful results, it would be predunt to refrain from drawing any inferences jut yet and instead focus on implementing these improvements in our next notebook!