# Classifying the Severity of Road Accidents

FARS is a collection of statistics of US road traffic accidents. The class label (target variable) is about the severity of the accident. It has 20 features and over 100K examples.

We will first carry out exploratory data analysis (EDA) on the FARS dataset and normalize the data. Then, we will create four different machine learning pipelines for classifying the severity of the accident and evaluate them. Finally, we will present the results of our experiment and discuss them.

We will start out by importing the required packagaes.

In [None]:
!pip install ydata_profiling

Collecting ydata_profiling
  Downloading ydata_profiling-4.16.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting visions<0.8.2,>=0.7.5 (from visions[type_image_path]<0.8.2,>=0.7.5->ydata_profiling)
  Downloading visions-0.8.1-py3-none-any.whl.metadata (11 kB)
Collecting htmlmin==0.1.12 (from ydata_profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting phik<0.13,>=0.11.1 (from ydata_profiling)
  Downloading phik-0.12.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting multimethod<2,>=1.4 (from ydata_profiling)
  Downloading multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Collecting statsmodels<1,>=0.13.2 (from ydata_profiling)
  Downloading statsmodels-0.14.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.2 kB)
Collecting typeguard<5,>=3 (from

In [None]:
import pandas as pd
from ydata_profiling import ProfileReport
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

from IPython.display import display
from IPython.display import HTML

We will then upload the FARS dataset.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

from google.colab import files
uploaded = files.upload()

Mounted at /content/drive


Saving fars.csv to fars.csv


In [None]:
# Importing the FARS dataset to a DataFrame
fars = pd.read_csv('fars.csv')

We will now look at the first five rows of the dataset.

In [None]:
fars.head()

Unnamed: 0,CASE_STATE,AGE,SEX,PERSON_TYPE,SEATING_POSITION,RESTRAINT_SYSTEM-USE,AIR_BAG_AVAILABILITY/DEPLOYMENT,EJECTION,EJECTION_PATH,EXTRICATION,...,DRUG_TEST_RESULTS_(2_of_3),DRUG_TEST_TYPE_(3_of_3),DRUG_TEST_RESULTS_(3_of_3),HISPANIC_ORIGIN,TAKEN_TO_HOSPITAL,RELATED_FACTOR_(1)-PERSON_LEVEL,RELATED_FACTOR_(2)-PERSON_LEVEL,RELATED_FACTOR_(3)-PERSON_LEVEL,RACE,INJURY_SEVERITY
0,Alabama,34,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Air_Bag_Available_but_Not_Deployed_for_this_Seat,Totally_Ejected,Unknown,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White,Fatal_Injury
1,Alabama,20,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Deployed_Air_Bag_from_Front,Totally_Ejected,Unknown,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White,Fatal_Injury
2,Alabama,43,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black,Fatal_Injury
3,Alabama,38,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Front_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Incapaciting_Injury
4,Alabama,50,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Deployed_Air_Bag_from_Front,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black,Fatal_Injury


We will now explore the FARS dataset.

In [None]:
profile_fars = ProfileReport(fars, title='FARS Dataset')
html = profile_fars.to_notebook_iframe()

Output hidden; open in https://colab.research.google.com to view.

The pandas profile report shows us that the FARS dataset has no missing values. However, there are a lot of duplicate values. So, we will drop the duplicate rows before proceeding any further.

In [None]:
fars = fars.drop_duplicates()

We will take a look at the data types in the FARS dataset now. We will proceed by ensuring that all the data types are what we want them to be, that is int64 for numerical data and category for categorical data.

In [None]:
fars.dtypes

Unnamed: 0,0
CASE_STATE,object
AGE,int64
SEX,object
PERSON_TYPE,object
SEATING_POSITION,object
RESTRAINT_SYSTEM-USE,object
AIR_BAG_AVAILABILITY/DEPLOYMENT,object
EJECTION,object
EJECTION_PATH,object
EXTRICATION,object


In [None]:
fars['CASE_STATE'] = fars['CASE_STATE'].astype('category')
fars['SEX'] = fars['SEX'].astype('category')
fars['PERSON_TYPE'] = fars['PERSON_TYPE'].astype('category')
fars['SEATING_POSITION'] = fars['SEATING_POSITION'].astype('category')
fars['RESTRAINT_SYSTEM-USE'] = fars['RESTRAINT_SYSTEM-USE'].astype('category')
fars['AIR_BAG_AVAILABILITY/DEPLOYMENT'] = fars['AIR_BAG_AVAILABILITY/DEPLOYMENT'].astype('category')
fars['EJECTION'] = fars['EJECTION'].astype('category')
fars['EJECTION_PATH'] = fars['EJECTION_PATH'].astype('category')
fars['EXTRICATION'] = fars['EXTRICATION'].astype('category')
fars['NON_MOTORIST_LOCATION'] = fars['NON_MOTORIST_LOCATION'].astype('category')
fars['POLICE_REPORTED_ALCOHOL_INVOLVEMENT'] = fars['POLICE_REPORTED_ALCOHOL_INVOLVEMENT'].astype('category')
fars['METHOD_ALCOHOL_DETERMINATION'] = fars['METHOD_ALCOHOL_DETERMINATION'].astype('category')
fars['ALCOHOL_TEST_TYPE'] = fars['ALCOHOL_TEST_TYPE'].astype('category')
fars['POLICE-REPORTED_DRUG_INVOLVEMENT'] = fars['POLICE-REPORTED_DRUG_INVOLVEMENT'].astype('category')
fars['METHOD_OF_DRUG_DETERMINATION'] = fars['METHOD_OF_DRUG_DETERMINATION'].astype('category')
fars['DRUG_TEST_TYPE_(1_of_3)'] = fars['DRUG_TEST_TYPE_(1_of_3)'].astype('category')
fars['DRUG_TEST_TYPE_(2_of_3)'] = fars['DRUG_TEST_TYPE_(2_of_3)'].astype('category')
fars['DRUG_TEST_TYPE_(3_of_3)'] = fars['DRUG_TEST_TYPE_(3_of_3)'].astype('category')
fars['HISPANIC_ORIGIN'] = fars['HISPANIC_ORIGIN'].astype('category')
fars['TAKEN_TO_HOSPITAL'] = fars['TAKEN_TO_HOSPITAL'].astype('category')
fars['RELATED_FACTOR_(1)-PERSON_LEVEL'] = fars['RELATED_FACTOR_(1)-PERSON_LEVEL'].astype('category')
fars['RELATED_FACTOR_(2)-PERSON_LEVEL'] = fars['RELATED_FACTOR_(2)-PERSON_LEVEL'].astype('category')
fars['RELATED_FACTOR_(3)-PERSON_LEVEL'] = fars['RELATED_FACTOR_(3)-PERSON_LEVEL'].astype('category')
fars['RACE'] = fars['RACE'].astype('category')
fars['INJURY_SEVERITY'] = fars['INJURY_SEVERITY'].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fars['CASE_STATE'] = fars['CASE_STATE'].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fars['SEX'] = fars['SEX'].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fars['PERSON_TYPE'] = fars['PERSON_TYPE'].astype('category')
A value is trying to be set on a copy of

The profile report shows us that there are quite a few strong correlations among the explanatory variables. This is how we will deal with them:

i) DRUG_TEST_RESULTS_(1_of_3) is highly overall correlated with DRUG_TEST_RESULTS_(2_of_3) and DRUG_TEST_RESULTS_(3_of_3). Furthermore, DRUG_TEST_RESULTS_(2_of_3) and DRUG_TEST_RESULTS_(3_of_3) are mostly zeros, 88.4% of values and 89.5% of values respectively. So, we will drop DRUG_TEST_RESULTS_(2_of_3) and DRUG_TEST_RESULTS_(3_of_3).

ii) EJECTION is highly overall correlated with EJECTION_PATH. We will decide to drop EJECTION as EJECTION_PATH has more distinct values and thus, holds more information.

iii) HISPANIC_ORIGIN is highly correlated with RACE. However, we will keep both variables as they contain dissimilar information about the individual.

iv) POLICE-REPORTED_DRUG_INVOLVEMENT is overall highly correlated with POLICE_REPORTED_ALCOHOL_INVOLVEMENT. In this case, we will drop both columns as: POLICE-REPORTED_DRUG_INVOLVEMENT has 74.0% values that say Not_Reported and 6.2% values that say Reported_Unknown; and POLICE_REPORTED_ALCOHOL_INVOLVEMENT has 45.0% values saying Not_reported and 10.7% values saying Unknown_(Police_Reported). Therefore, these two columns do not add any valuable information to the dataset.

v) SEX is highly overall correlated with AGE, but we will keep both variables as they contain dissimilar information about the individual.

vi) TAKEN_TO_HOSPITAL is highly overall correlated with INJURY_SEVERITY, but we will keep both the variables as TAKEN_TO_HOSPITAL is an explanatory variable while INJURY_SEVERITY is the response variable.

In [None]:
fars = fars.drop(columns=[
    'DRUG_TEST_RESULTS_(2_of_3)',
    'DRUG_TEST_RESULTS_(3_of_3)',
    'EJECTION',
    'POLICE-REPORTED_DRUG_INVOLVEMENT',
    'POLICE_REPORTED_ALCOHOL_INVOLVEMENT'
    ])

Furthermore:

i) DRUG_TEST_TYPE_(1_of_3) has 63.5% of values saying Not_Tested_for_Drugs and 15.7% of values saying Unknown_if_Tested_for_Drugs; DRUG_TEST_TYPE_(2_of_3) has 88.4% of values saying Not_Tested_for_Drugs and 9.2% of values saying Unknown_if_Tested_for_Drugs; and, DRUG_TEST_TYPE_(3_of_3) has 89.5% of values saying Not_Tested_for_Drugs and 9.2% of values saying Unknown_if_Tested_for_Drugs. Therefore, we will drop these three columns as they do not add much valuable information to the dataset.

ii) METHOD_ALCOHOL_DETERMINATION has 83.4% of values saying Not_Reported. Therefore, we will drop this column as it does not add much valuable information to the dataset.

iii) METHOD_OF_DRUG_DETERMINATION has 93.9% of values saying Not_Reported. Therefore, we will drop this column as it does not add much valuable information to the dataset.

iv) RELATED_FACTOR_(1)-PERSON_LEVEL has 95.6% values saying Not_Applicable_-_Driver/None_-_All_Other_Persons; RELATED_FACTOR_(2)-PERSON_LEVEL has 98.3% values saying Not_Applicable_-_Driver/None_-_All_Other_Persons; and RELATED_FACTOR_(3)-PERSON_LEVEL has 99.4% of values saying Not_Applicable_-_Driver/None_-_All_Other_Persons. Therefore, we will drop these columns as they do not add much valuable information to the dataset.

In [None]:
fars = fars.drop(columns=[
    'DRUG_TEST_TYPE_(1_of_3)',
    'DRUG_TEST_TYPE_(2_of_3)',
    'DRUG_TEST_TYPE_(3_of_3)',
    'METHOD_ALCOHOL_DETERMINATION',
    'METHOD_OF_DRUG_DETERMINATION',
    'RELATED_FACTOR_(1)-PERSON_LEVEL',
    'RELATED_FACTOR_(2)-PERSON_LEVEL',
    'RELATED_FACTOR_(3)-PERSON_LEVEL'
    ])

In [None]:
fars.columns

Index(['CASE_STATE', 'AGE', 'SEX', 'PERSON_TYPE', 'SEATING_POSITION',
       'RESTRAINT_SYSTEM-USE', 'AIR_BAG_AVAILABILITY/DEPLOYMENT',
       'EJECTION_PATH', 'EXTRICATION', 'NON_MOTORIST_LOCATION',
       'ALCOHOL_TEST_TYPE', 'ALCOHOL_TEST_RESULT',
       'DRUG_TEST_RESULTS_(1_of_3)', 'HISPANIC_ORIGIN', 'TAKEN_TO_HOSPITAL',
       'RACE', 'INJURY_SEVERITY'],
      dtype='object')

We will now move on to normalizing the numerical data using feature standardization.

In [None]:
fars['AGE'] = (fars['AGE'] - np.mean(fars['AGE'])) / np.std(fars['AGE'], ddof=1)
fars['ALCOHOL_TEST_RESULT'] = (fars['ALCOHOL_TEST_RESULT'] - np.mean(fars['ALCOHOL_TEST_RESULT'])) / np.std(fars['ALCOHOL_TEST_RESULT'], ddof=1)
fars['DRUG_TEST_RESULTS_(1_of_3)'] = (fars['DRUG_TEST_RESULTS_(1_of_3)'] - np.mean(fars['DRUG_TEST_RESULTS_(1_of_3)'])) / np.std(fars['DRUG_TEST_RESULTS_(1_of_3)'], ddof=1)

In [None]:
fars.dtypes

Unnamed: 0,0
CASE_STATE,category
AGE,float64
SEX,category
PERSON_TYPE,category
SEATING_POSITION,category
RESTRAINT_SYSTEM-USE,category
AIR_BAG_AVAILABILITY/DEPLOYMENT,category
EJECTION_PATH,category
EXTRICATION,category
NON_MOTORIST_LOCATION,category


Now we will move on to preparing the data for machine learning.

First, we will seperate the dataset into the explanatory variables and the response variables. Then, we will One Hot Encode the categorical explanatory variables and Label Encode the response variable.

In [None]:
# Seperating the dataset
X = fars.iloc[:, :-1]
y = fars.iloc[:, -1]

# Encoding the variables
X = pd.get_dummies(X).values
le = LabelEncoder()
y = le.fit_transform(y)

We will build four different models: a Random Forest Classifier, a Decision Tree Classifier, a Support Vector Classifier, and a Logistic Regression Classifier.


We will now instantiate the classifiers and initialize the parameters for its grid search. Since the profile report showed us that there were a lot of variables with class imbalance, we will set the class_weight parameter of all our machine learning models to balanced to carry out cost-sensitive classification.

In [None]:
# Random Forest Classifier and its associated parameters for grid search
rfc = RandomForestClassifier(class_weight="balanced")
param_grid_rfc = dict(max_depth = [2, 3, 4], n_estimators = [100, 200, 500])

# Decision Tree Classifier and its associated parameters for grid search
dtc = DecisionTreeClassifier(class_weight="balanced")
param_grid_dtc = dict(max_depth = [2, 4, 10])

# Support Vector Classifier and its associated parameters for grid search
svc = SVC(class_weight="balanced", max_iter = 100)
param_grid_svc = dict(C = [0.01, 0.1, 1.0])

# Logistic Regression Classifier and its associated parameters for grid search
lr = LogisticRegression(class_weight="balanced")
param_grid_lr = dict(C = [0.01, 0.1, 1.0])

For each of the models:

We will now run the Five-Fold Stratified Cross-Validation process. For each fold, we will split the data into training and test sets. We will then use Grid Search Cross-Validation to choose the model with the best performing parameters. Finally, we will store the predictions, the F1 scores, the precision scores, and the recall scores.

Afterwards, we will calculate the avergage of these scores to compare the models.

In [None]:
# We will create a function to automate the process described above.

def process(X, y, classifier, param_grid):

  print("The following is for ", classifier, ":")

  f1_scores = []
  precision_scores = []
  recall_scores = []

  # Initialize the five-fold stratified cross-validation
  kf = StratifiedKFold(n_splits = 5, shuffle = True)

  for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Carrying out grid search cross-validation
    grid_search = GridSearchCV(
        classifier,
        param_grid = param_grid,
        cv = 5,
        scoring = "f1_weighted",
        n_jobs = -1)
    grid_search.fit(X_train, y_train)
    # Choosing the best estimator
    estimator = grid_search.best_estimator_
    # Predicting the test data with the best estimator
    predictions = estimator.predict(X_test)
    # Storing the f1 scores
    f1_score_1 = f1_score(y_test, predictions, average = "weighted")
    f1_scores.append(f1_score_1)
    # Storing the precision scores
    precision_score_1 = precision_score(y_test, predictions, average = "weighted")
    precision_scores.append(precision_score_1)
    # Storing the recall scores
    recall_score_1 = recall_score(y_test, predictions, average = "weighted")
    recall_scores.append(recall_score_1)

  print("Average F1 Score: {0}".format(np.average(f1_scores)))
  print("Average Precision Score: {0}".format(np.average(precision_scores)))
  print("Average Recall Score: {0}".format(np.average(recall_scores)))

In [None]:
process(X, y, rfc, param_grid_rfc)

The following is for  RandomForestClassifier(class_weight='balanced') :


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Average F1 Score: 0.7394957820406888
Average Precision Score: 0.7790617355615987
Average Recall Score: 0.7600426188209328


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
process(X, y, dtc, param_grid_dtc)

The following is for  DecisionTreeClassifier(class_weight='balanced') :
Average F1 Score: 0.7395038780470266
Average Precision Score: 0.7927526374174689
Average Recall Score: 0.7582039999005719


In [None]:
process(X, y, svc, param_grid_svc)

The following is for  SVC(class_weight='balanced', max_iter=100) :


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Average F1 Score: 0.44964306836017875
Average Precision Score: 0.4743750012804785
Average Recall Score: 0.4527439506374093


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
process(X, y, lr, param_grid_lr)

The following is for  LogisticRegression(class_weight='balanced') :


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Average F1 Score: 0.7707956293620686
Average Precision Score: 0.7805380865722025
Average Recall Score: 0.7746656405916436


To conclude:

Among the classifiers, the Support Vector Classifier performed especially poorly. This is because we capped the maximum number of iterations to 100. Had we let the classifier run till convergence, it might have scored better, but we decided to not let that be as even after running for two hours it had not converged.

The other classifiers performed better and moderately well. They all had average F1, average precision, and average recall scores greater than 0.7 and less than 0.8. Among these three, the Logistic Regression classifier performed the best as its average F1 and average recall scores were higher than those of the other two.