# Predicting Road Traffic Accident Severity


###Introduction
FARS dataset is a collection of statistics of US road traffic accidents. It has total 30 features and over 100k examples. The objective of this project is to identify the severity of accident. In this project we build the machine learning pipelines to classify the injury severity into one of the 7 class labels using appropriate data cleaning, preprocessing, modelling and hyperparameter tuning steps.

###Business Understanding
In the context of the FARS dataset, the business understanding revolves around improving road safety by leveraging machine learning to predict and classify the severity of traffic accidents. The goal is to assist relevant stakeholders, such as transportation authorities and emergency services, in efficiently allocating resources and implementing targeted preventive measures. Success criteria for this project would involve developing a robust machine learning pipeline that accurately classifies accident severity into one of the seven predefined labels. Key performance indicators include high classification accuracy, precision, and recall rates, ensuring the model's ability to distinguish between different levels of injury severity. The success of the project would ultimately be measured by its real-world impact, contributing to the enhancement of road safety and reducing the overall human and economic costs associated with traffic accidents.

### Research Question
"Can we identify the injury severity accurately using the given dataset?"

# Importing Libraries

In [None]:
!pip install sweetviz



In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import sweetviz as sv
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE
from collections import Counter


# Import the dataset

In [None]:
##importing the dataset
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:

# Reading the data into a Pandas DataFrame
df = pd.read_csv('/content/drive/MyDrive/fars.csv', delimiter=',')


# Data Formatting and Cleaning

The initial step in cleaning the FARS dataset involved checking for null values, which, upon inspection, were found to be absent. Subsequently, duplicate entries were identified and removed to ensure data integrity. The dataset's target variable, injury severity, comprised eight classes, with one designated as an unknown class. Recognizing that the unknown class did not contribute to a meaningful understanding of injury severity outcomes, a strategic decision was made to exclude it from the analysis, resulting in a refined target variable with seven distinct classes. This preprocessing step aimed to enhance the model's predictive accuracy and interpretability by focusing on relevant and meaningful injury severity categories for subsequent machine learning model development and evaluation.


In [None]:
percent_missing = round(df.isnull().sum()/len(df)*100,2)
percent_missing

CASE_STATE                             0.0
AGE                                    0.0
SEX                                    0.0
PERSON_TYPE                            0.0
SEATING_POSITION                       0.0
RESTRAINT_SYSTEM-USE                   0.0
AIR_BAG_AVAILABILITY/DEPLOYMENT        0.0
EJECTION                               0.0
EJECTION_PATH                          0.0
EXTRICATION                            0.0
NON_MOTORIST_LOCATION                  0.0
POLICE_REPORTED_ALCOHOL_INVOLVEMENT    0.0
METHOD_ALCOHOL_DETERMINATION           0.0
ALCOHOL_TEST_TYPE                      0.0
ALCOHOL_TEST_RESULT                    0.0
POLICE-REPORTED_DRUG_INVOLVEMENT       0.0
METHOD_OF_DRUG_DETERMINATION           0.0
DRUG_TEST_TYPE_(1_of_3)                0.0
DRUG_TEST_RESULTS_(1_of_3)             0.0
DRUG_TEST_TYPE_(2_of_3)                0.0
DRUG_TEST_RESULTS_(2_of_3)             0.0
DRUG_TEST_TYPE_(3_of_3)                0.0
DRUG_TEST_RESULTS_(3_of_3)             0.0
HISPANIC_OR

In [None]:
df = df.drop_duplicates()

In [None]:
df = df[df['INJURY_SEVERITY'] != 'Unknown']

# Data Exploration

The dataset has 92605 entries and 29 features. The correlation matrix indicates the relationships between variables in the FARS dataset. 'AGE' shows a weak negative correlation with 'ALCOHOL_TEST_RESULT,' implying a slight decrease in positive alcohol test results with increasing age. The correlation between 'AGE' and 'DRUG_TEST_RESULTS_(1_of_3)' is very weakly positive. 'ALCOHOL_TEST_RESULT' has a weak positive correlation with 'DRUG_TEST_RESULTS_(2_of_3),' suggesting a slight association between alcohol and the second drug test result. The strongest correlation is between 'DRUG_TEST_RESULTS_(2_of_3)' and 'DRUG_TEST_RESULTS_(3_of_3),' indicating a highly positive relationship between the second and third drug test results. These findings provide insights into potential associations but do not establish causation, emphasizing the need for a comprehensive analysis of all relevant factors in predicting accident severity.

The use of the Sweetviz library for exploratory data analysis (EDA) has unveiled important insights into the FARS dataset, particularly highlighting that 24 out of 30 features are categorical, with many instances of unknown or not reported values. The EDA report emphasizes the prevalence of such values in certain features, like 'Police Reported Drug Involvement,' where 90% of the data is not reported. This observation underscores the need for feature selection or dimensionality reduction techniques. Techniques such as removing features with a high percentage of missing values or employing methods like Principal Component Analysis (PCA) can be considered to streamline the dataset and retain only those features deemed most informative for predicting accident severity. This step is crucial to improve model interpretability, reduce computational complexity, and enhance the overall efficiency of the machine learning pipeline.

The identified class imbalance in the injury severity variable, particularly with a significant disparity in entries between the major class (41442 entries for fatal injuries) and the minority class (only 9 entries for "died prior to the accident"), necessitates the implementation of class imbalance correction techniques. The goal is to ensure that the machine learning model is not biased toward predicting the majority class and can accurately capture patterns in the minority classes, ultimately improving the model's performance in predicting injury severity across all classes.

In [None]:
df.shape

(92605, 30)

In [None]:
correlation_matrix =df
correlation_matrix = df.corr()
for column in correlation_matrix.columns:
    print(f"Correlation of '{column}' with other columns:")
    print(correlation_matrix[column])
    print("\n")

Correlation of 'AGE' with other columns:
AGE                           1.000000
ALCOHOL_TEST_RESULT          -0.080741
DRUG_TEST_RESULTS_(1_of_3)    0.031057
DRUG_TEST_RESULTS_(2_of_3)    0.024676
DRUG_TEST_RESULTS_(3_of_3)    0.025657
Name: AGE, dtype: float64


Correlation of 'ALCOHOL_TEST_RESULT' with other columns:
AGE                          -0.080741
ALCOHOL_TEST_RESULT           1.000000
DRUG_TEST_RESULTS_(1_of_3)    0.035253
DRUG_TEST_RESULTS_(2_of_3)    0.081232
DRUG_TEST_RESULTS_(3_of_3)    0.104368
Name: ALCOHOL_TEST_RESULT, dtype: float64


Correlation of 'DRUG_TEST_RESULTS_(1_of_3)' with other columns:
AGE                           0.031057
ALCOHOL_TEST_RESULT           0.035253
DRUG_TEST_RESULTS_(1_of_3)    1.000000
DRUG_TEST_RESULTS_(2_of_3)    0.618618
DRUG_TEST_RESULTS_(3_of_3)    0.612122
Name: DRUG_TEST_RESULTS_(1_of_3), dtype: float64


Correlation of 'DRUG_TEST_RESULTS_(2_of_3)' with other columns:
AGE                           0.024676
ALCOHOL_TEST_RESULT        

  correlation_matrix = df.corr()


In [None]:
report = sv.analyze(df)
report.show_html('report.html')

                                             |          | [  0%]   00:00 -> (? left)

Report report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [None]:
df['INJURY_SEVERITY'].value_counts()

Fatal_Injury                      41442
No_Injury                         15642
Incapaciting_Injury               14230
Nonincapaciting_Evident_Injury    12945
Possible_Injury                    8104
Injured_Severity_Unknown            233
Died_Prior_to_Accident                9
Name: INJURY_SEVERITY, dtype: int64

# Data Preprocessing

The data preprocessing pipeline includes an 80-20 split for training and testing sets, ensuring distinct datasets for model training and evaluation. Numerical features undergo MinMax scaling to standardize their ranges, while categorical features are encoded using One-Hot Encoder to transform them into binary vectors. The response variable, representing injury severity, is encoded using Label Encoder with labels assigned based on the severity of the accident, ranging from 1 to 7 in an increasing order. To address class imbalance, Synthetic Minority Over-sampling Technique (SMOTE) is employed, equalizing the entries for each severity class to 33,174 instances. The preprocessing pipeline collectively aim to enhance the performance and generalizability of the machine learning model by ensuring a balanced representation of classes and appropriate feature transformations.


## Splitting the data

In [None]:

# Splitting the data into training and testing sets
X = df.drop('INJURY_SEVERITY', axis=1)  # Features
y = df['INJURY_SEVERITY']  # Target variable

# Splitting while ensuring shuffling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)


In [None]:
np.unique(y_train)

array(['Died_Prior_to_Accident', 'Fatal_Injury', 'Incapaciting_Injury',
       'Injured_Severity_Unknown', 'No_Injury',
       'Nonincapaciting_Evident_Injury', 'Possible_Injury'], dtype=object)

## Scaling

In [None]:
numerical_columns = X_train.select_dtypes(include=np.number).columns.tolist()

# # Initialize MinMaxScaler
scaler = MinMaxScaler()

# # Fit and transform the numerical columns in X_train
X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])

# # Apply the same transformation to numerical columns in X_test
X_test[numerical_columns] = scaler.transform(X_test[numerical_columns])


## Encoding

In [None]:
# Extracting categorical columns
categorical_columns = X_train.select_dtypes(include=['object']).columns.tolist()
# Reset index to None before encoding
X_train.reset_index(drop=True, inplace=True)

# One-hot encoding without dropping first column
X_train = pd.get_dummies(X_train, columns=categorical_columns, drop_first=False)
X_train.shape

(74084, 354)

In [None]:

# Extract categorical columns in X_test
categorical_columns_test = X_test.select_dtypes(include=['object']).columns.tolist()

# Reset index to None before encoding
X_test.reset_index(drop=True, inplace=True)

# One-hot encoding without dropping first column using the same columns as X_train
X_test = pd.get_dummies(X_test, columns=categorical_columns_test, drop_first=False)

# Ensure the columns in X_test match the columns in X_train after encoding
# Add missing columns in X_test (if any) with zeros
missing_cols = set(X_train.columns) - set(X_test.columns)
for col in missing_cols:
    X_test[col] = 0

# Reorder columns in X_test to match the order of columns in X_train
X_test = X_test[X_train.columns]
X_test.shape

(18521, 354)

In [None]:

# Assuming y_train is a Series after resetting the index
label_mapping = {
    'Fatal_Injury': 6,
    'No_Injury': 2,
    'Incapaciting_Injury': 5,
    'Nonincapaciting_Evident_Injury': 4,
    'Possible_Injury': 3,
    'Injured_Severity_Unknown': 1,
    'Died_Prior_to_Accident': 7
}

# Initialize LabelEncoder
encoder = LabelEncoder()

# Transform the 'INJURY_SEVERITY' values using the label mapping
y_train = y_train.map(label_mapping)

# Fit and transform using LabelEncoder
y_train = encoder.fit_transform(y_train)
# Adding 1 to shift the labels to start from 1
y_train += 1
y_train.shape

(74084,)

In [None]:
# Transform the 'INJURY_SEVERITY' values in y_test using the label mapping
y_test = y_test.map(label_mapping)

# Transform y_test using the same encoder as used on y_train
y_test = encoder.transform(y_test)
# Adding 1 to shift the labels to start from 1
y_test += 1
y_test.shape

(18521,)

## Handling Class Imbalance

In [None]:

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data
X_train, y_train = smote.fit_resample(X_train, y_train)


print('Original dataset shape', Counter(y))
print('Resampled dataset shape', Counter(y_train))


Original dataset shape Counter({'Fatal_Injury': 41442, 'No_Injury': 15642, 'Incapaciting_Injury': 14230, 'Nonincapaciting_Evident_Injury': 12945, 'Possible_Injury': 8104, 'Injured_Severity_Unknown': 233, 'Died_Prior_to_Accident': 9})
Resampled dataset shape Counter({6: 33174, 3: 33174, 5: 33174, 4: 33174, 2: 33174, 1: 33174, 7: 33174})


# Logistic Regression with LDA and K-fold cross validation
The initial implementation of a logistic regression model with LDA (Linear Discriminant Analysis) and k-fold cross-validation provides a promising starting point for assessing the accuracy of the model on the validation set. The reported validation set accuracy of 72% suggests that the model is performing reasonably well during training. However, it's crucial to evaluate the model's generalization performance on an independent dataset.

Upon testing the model on the test set, the accuracy slightly decreases to 70%. This indicates that the model may have a slight drop in performance when applied to new, unseen data. Additionally, considering alternative models or fine-tuning hyperparameters might be explored to enhance performance further. Nonetheless, the implemented logistic regression with LDA serves as a valuable baseline, and subsequent iterations can build upon this foundation for improved predictive accuracy.

### Model training

In [None]:


# Set up the pipeline including LDA and Logistic Regression
pipeline = Pipeline([
    ('lda', LinearDiscriminantAnalysis()),  # LDA step
    ('classifier', LogisticRegression(max_iter=1000))  # Logistic Regression step
])

# Convert Pandas DataFrame to NumPy arrays
X_train_np = X_train.to_numpy()

# Set up k-fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform k-fold cross-validation without hyperparameter tuning
for train_index, val_index in kfold.split(X_train_np):
    X_train_fold, X_val_fold = X_train_np[train_index], X_train_np[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]

    # Fit the model on the training fold
    pipeline.fit(X_train_fold, y_train_fold)

    # Evaluate the model on the validation fold
    y_pred = pipeline.predict(X_val_fold)
    accuracy = accuracy_score(y_val_fold, y_pred)
    print("Accuracy on Validation Set:", accuracy)


Accuracy on Validation Set: 0.7198777021789682
Accuracy on Validation Set: 0.7241409008698648
Accuracy on Validation Set: 0.720631297907157
Accuracy on Validation Set: 0.7244364059169305
Accuracy on Validation Set: 0.724005770514394


###Model Visualization

In [None]:
pipeline

###Model Evaluation

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Predict on the test set using the trained pipeline
y_pred_test = pipeline.predict(X_test)

# Calculate Accuracy on the test set
accuracy_test = accuracy_score(y_test, y_pred_test)
print(f'Accuracy on Test Set: {accuracy_test:.4f}')

# Calculate Precision, Recall, and F1-score on the test set
precision_test = precision_score(y_test, y_pred_test, average='weighted')
recall_test = recall_score(y_test, y_pred_test, average='weighted')
f1_test = f1_score(y_test, y_pred_test, average='weighted')

print(f'Precision on Test Set: {precision_test:.4f}')
print(f'Recall on Test Set: {recall_test:.4f}')
print(f'F1-score on Test Set: {f1_test:.4f}')



Accuracy on Test Set: 0.7074


  _warn_prf(average, modifier, msg_start, len(result))


Precision on Test Set: 0.7114
Recall on Test Set: 0.7074
F1-score on Test Set: 0.7026


# Implementing PCA on train and test set
For the next set of models, we decided to use PCA (Principal Component Analysis) for selecting important features. Essentially, PCA helps us simplify and focus on the most critical aspects of the data, making it easier for the models to learn and make predictions.

In [None]:

# Initialize PCA and fit it to the scaled training data
pca = PCA(n_components=0.95)  # Choose the number of components or explained variance ratio
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)


#SVM with k-fold cross validation and PCA
I applied Support Vector Machine (SVM) with Principal Component Analysis (PCA) and 5-fold cross-validation for feature selection and model evaluation. The accuracy on the validation set reached 79%, and this performance was consistent when testing on the separate test set, also achieving 79% accuracy. Additionally, precision, recall, and F1-score metrics on the test set were computed to provide a more detailed evaluation of the model's performance. The precision was found to be 0.7827, recall was 0.7918, and the F1-score reached 0.7856. Support Vector Machines (SVM) are powerful algorithms for classification tasks like predicting injury severity in this project. They work by finding the optimal hyperplane that best separates different classes in the feature space. The inclusion of PCA aids in reducing the dimensionality of the dataset while maintaining essential information, contributing to improved efficiency and computational performance.



###Model Training

In [None]:

# Set up the pipeline including SVM with scaling
svm_pipeline = Pipeline([
    ('svm', SVC(kernel='rbf', random_state=42))  # SVM Classifier step
])

# Set up k-fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform k-fold cross-validation
for train_index, val_index in kfold.split(X_train_pca):
    X_train_fold, X_val_fold = X_train_pca[train_index], X_train_pca[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]

    # Fit the model on the training fold
    svm_pipeline.fit(X_train_fold, y_train_fold)

    # Evaluate the model on the validation fold
    y_pred = svm_pipeline.predict(X_val_fold)
    accuracy = accuracy_score(y_val_fold, y_pred)
    print("Accuracy on Validation Set:", accuracy)


Accuracy on Validation Set: 0.792136766859013
Accuracy on Validation Set: 0.7923951425372492
Accuracy on Validation Set: 0.7904357936439583
Accuracy on Validation Set: 0.7909911073789376
Accuracy on Validation Set: 0.7942639364382146


###Model Interpretation

In [None]:
svm_pipeline

###Model Evaluation

In [None]:

# Predict on the test set using the trained pipeline
y_pred_test = svm_pipeline.predict(X_test_pca)

# Calculate Accuracy on the test set
accuracy_test = accuracy_score(y_test, y_pred_test)
print(f'Accuracy on Test Set: {accuracy_test:.4f}')

# Calculate Precision, Recall, and F1-score on the test set
precision_test = precision_score(y_test, y_pred_test, average='weighted')
recall_test = recall_score(y_test, y_pred_test, average='weighted')
f1_test = f1_score(y_test, y_pred_test, average='weighted')

print(f'Precision on Test Set: {precision_test:.4f}')
print(f'Recall on Test Set: {recall_test:.4f}')
print(f'F1-score on Test Set: {f1_test:.4f}')




Accuracy on Test Set: 0.7918
Precision on Test Set: 0.7827
Recall on Test Set: 0.7918
F1-score on Test Set: 0.7856


  _warn_prf(average, modifier, msg_start, len(result))


#Decision Tree
A Decision Tree model with k-fold cross-validation is implemented, initially without hyperparameter tuning, achieving an accuracy of 81% on the validation set. However, when evaluating on the test set, the accuracy was 72.07%, with precision, recall, and F1-score values of 0.7334, 0.7207, and 0.7266, respectively.

After performing hyperparameter tuning, the best parameters were identified as {'max_depth': 15, 'min_samples_leaf': 2, 'min_samples_split': 5}. This refinement led to an increased accuracy of 75.42% on both the validation and test sets. Precision on the test set improved to 0.7569, while recall and F1-score reached 0.7542 and 0.7466, respectively.

Decision Trees are a powerful and interpretable machine learning algorithm widely used for classification and regression tasks. They work by recursively partitioning the data into subsets based on feature conditions, forming a tree-like structure. Hyperparameter tuning helps in optimizing the decision tree's parameters, such as the maximum depth and minimum samples for leaf and split nodes, leading to a more accurate and generalized model. In this case, the tuning process contributed to a notable improvement in the accuracy of the test set.

## Decision Tree with k-fold cross validation and without hyperparameter tuning


###Model Training

In [None]:

# Set up the pipeline including Decision Tree Classifier
dt_pipeline = Pipeline([
    ('dt', DecisionTreeClassifier(random_state=42))  # Decision Tree Classifier step
])


# Set up k-fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform k-fold cross-validation without hyperparameter tuning
for train_index, val_index in kfold.split(X_train_pca):
    X_train_fold, X_val_fold = X_train_pca[train_index], X_train_pca[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]

    # Fit the model on the training fold
    dt_pipeline.fit(X_train_fold, y_train_fold)

    # Evaluate the model on the validation fold
    y_pred = dt_pipeline.predict(X_val_fold)
    accuracy = accuracy_score(y_val_fold, y_pred)
    print("Accuracy on Validation Set:", accuracy)


Accuracy on Validation Set: 0.8141202308156059
Accuracy on Validation Set: 0.8181896477478253
Accuracy on Validation Set: 0.8158857979502196
Accuracy on Validation Set: 0.8156880477144026
Accuracy on Validation Set: 0.8193699804060892


###Model Interpretation

In [None]:
dt_pipeline

###Model Evaluation

In [None]:

# Predict on the test set using the trained pipeline
y_pred_test = dt_pipeline.predict(X_test_pca)

# Calculate Accuracy on the test set
accuracy_test = accuracy_score(y_test, y_pred_test)
print(f'Accuracy on Test Set: {accuracy_test:.4f}')

# Calculate Precision, Recall, and F1-score on the test set
precision_test = precision_score(y_test, y_pred_test, average='weighted')
recall_test = recall_score(y_test, y_pred_test, average='weighted')
f1_test = f1_score(y_test, y_pred_test, average='weighted')

print(f'Precision on Test Set: {precision_test:.4f}')
print(f'Recall on Test Set: {recall_test:.4f}')
print(f'F1-score on Test Set: {f1_test:.4f}')



Accuracy on Test Set: 0.7207
Precision on Test Set: 0.7334
Recall on Test Set: 0.7207
F1-score on Test Set: 0.7266


##Decision Tree with k-fold cross validation and hyperparameter tuning

###Model Training

In [None]:

# Define hyperparameters to tune
param_grid = {
    'max_depth': [10, 15],
    'min_samples_split': [5, 10],
    'min_samples_leaf': [2, 4]
}
dt = DecisionTreeClassifier()
# Use GridSearchCV for hyperparameter tuning with 5-fold cross-validation
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train_pca, y_train)

# Best hyperparameters found
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Train the model with the best parameters
best_dt = DecisionTreeClassifier(**best_params)
best_dt.fit(X_train_pca, y_train)

# Evaluate model performance on test set
accuracy = best_dt.score(X_test_pca, y_test)
print(f"Accuracy on test set: {accuracy:.4f}")


Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best hyperparameters: {'max_depth': 15, 'min_samples_leaf': 2, 'min_samples_split': 5}
Accuracy on test set: 0.7546


###Model Evaluation

In [None]:

# Predict on the test set using the best model found
y_pred = best_dt.predict(X_test_pca)

# Calculate Accuracy on the test set
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy on Test Set: {accuracy:.4f}')

# Calculate Precision, Recall, and F1-score on the test set
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Precision on Test Set: {precision:.4f}')
print(f'Recall on Test Set: {recall:.4f}')
print(f'F1-score on Test Set: {f1:.4f}')


Accuracy on Test Set: 0.7546
Precision on Test Set: 0.7578
Recall on Test Set: 0.7546
F1-score on Test Set: 0.7474


  _warn_prf(average, modifier, msg_start, len(result))


# Random forest
A Random Forest classifier is implemented with k-fold cross-validation, initially without hyperparameter tuning, achieving an accuracy of 84% on the validation set with 100 estimators. On the test set, the accuracy was 75.44%, with precision, recall, and F1-score values of 0.7502, 0.7544, and 0.7521, respectively.

After performing hyperparameter tuning, the best parameters were determined as {'max_depth': 15, 'n_estimators': 200}. This refinement resulted in improved performance, with the accuracy on the validation and test sets both reaching 78.81%. Precision on the test set increased to 0.7750, while recall and F1-score were 0.7881 and 0.7779, respectively.

Random Forest is an ensemble learning method that builds a multitude of decision trees during training and outputs the mode of the classes for classification tasks. It excels in handling complex datasets, capturing intricate relationships between features, and often requires less hyperparameter tuning compared to individual decision trees. The use of k-fold cross-validation helps in robustly evaluating model performance, and hyperparameter tuning further refines the model's parameters to enhance the accuracy of the test set.

##Random Forest with k-fold cross validation and without hyperparameter tuning


###Model Training

In [None]:
# Create a pipeline with PCA and Random Forest Classifier
rf_pipeline = Pipeline([
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Perform k-fold cross-validation without hyperparameter tuning
for train_index, val_index in kfold.split(X_train_pca):
    X_train_fold, X_val_fold = X_train_pca[train_index], X_train_pca[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]

    # Fit the model on the training fold
    rf_pipeline.fit(X_train_fold, y_train_fold)

    # Evaluate the model on the validation fold
    y_pred = rf_pipeline.predict(X_val_fold)
    accuracy = accuracy_score(y_val_fold, y_pred)
    print("Accuracy on Validation Set:", accuracy)


Accuracy on Validation Set: 0.8440702781844802
Accuracy on Validation Set: 0.8445654982344328
Accuracy on Validation Set: 0.8425846180346224
Accuracy on Validation Set: 0.8435070947182568
Accuracy on Validation Set: 0.8463062248347437


###Model Interpretation

In [None]:
rf_pipeline

###Model Evaluation

In [None]:

# Predict on the test set using the trained pipeline
y_pred_test = rf_pipeline.predict(X_test_pca)

# Calculate Accuracy on the test set
accuracy_test = accuracy_score(y_test, y_pred_test)
print(f'Accuracy on Test Set: {accuracy_test:.4f}')

# Calculate Precision, Recall, and F1-score on the test set
precision_test = precision_score(y_test, y_pred_test, average='weighted')
recall_test = recall_score(y_test, y_pred_test, average='weighted')
f1_test = f1_score(y_test, y_pred_test, average='weighted')

print(f'Precision on Test Set: {precision_test:.4f}')
print(f'Recall on Test Set: {recall_test:.4f}')
print(f'F1-score on Test Set: {f1_test:.4f}')




Accuracy on Test Set: 0.7544
Precision on Test Set: 0.7502
Recall on Test Set: 0.7544
F1-score on Test Set: 0.7521


  _warn_prf(average, modifier, msg_start, len(result))


##Random Forest with k-fold cross validation and hyperparameter tuning

###Model Training

In [None]:
# Define hyperparameters to tune
param_grid = {
    'n_estimators': [100,200],
    'max_depth': [10, 15],
}
rf = RandomForestClassifier(random_state=42)
# Use GridSearchCV for hyperparameter tuning with 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train_pca, y_train)

# Best hyperparameters found
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Train the model with the best parameters
best_rf = RandomForestClassifier(**best_params)
best_rf.fit(X_train_pca, y_train)

# Evaluate model performance on test set
accuracy = best_rf.score(X_test_pca, y_test)
print(f"Accuracy on test set: {accuracy:.4f}")


Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best hyperparameters: {'max_depth': 15, 'n_estimators': 200}
Accuracy on test set: 0.7881


###Model Evaluation

In [None]:

# Predict on the test set using the best model found
y_pred = best_rf.predict(X_test_pca)

# Calculate Accuracy on the test set
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy on Test Set: {accuracy:.4f}')

# Calculate Precision, Recall, and F1-score on the test set
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Precision on Test Set: {precision:.4f}')
print(f'Recall on Test Set: {recall:.4f}')
print(f'F1-score on Test Set: {f1:.4f}')


Accuracy on Test Set: 0.7881
Precision on Test Set: 0.7750
Recall on Test Set: 0.7881
F1-score on Test Set: 0.7779


  _warn_prf(average, modifier, msg_start, len(result))


# Conclusion
In this project, we aimed to predict the severity of road traffic accidents using the FARS dataset, which consists of 30 features and over 100,000 examples. The project involved an extensive preprocessing phase to enhance the quality of the data for machine learning. The key preprocessing steps included an 80-20 train-test split, MinMax scaling for numerical features, One-Hot encoding for categorical features, and addressing class imbalance using SMOTE.

For model implementation, several algorithms were employed. Logistic Regression with LDA, Support Vector Machine (SVM) with PCA, Random Forest, and Decision Tree models were implemented. Each model underwent k-fold cross-validation to ensure robust evaluation. After initial model implementation, hyperparameter tuning was performed on Random Forest and Decision Tree models to optimize their performance.

Support Vector Machine (SVM) with PCA achieved a validation set accuracy of 79.18%, and this performance was consistent on the test set. The SVM model is a strong contender, demonstrating competitive accuracy levels in predicting accident severity in comparison to other models implemented in the project. The project showcases a systematic approach from data preprocessing to model implementation and tuning, ultimately leading to a well-performing machine learning model for predicting accident severity.