# Assignment 5

## Driving Behavior Analysis

### Authored by:
Ganti Uday

### Introduction and Overview

Road rage is a very common occurance. In a survay conducted in 2019 it was revealed that 82% of the participants have commited an act of road rage in the past year. Shockingly, close to over 200 murders adn more than 12 thousand injuries and 218 murders have been attributed to road rage over the last 7 years in the United States alone. Most of these incidents occur due to drivers with impaired judgements either speedding, accelerating or breaking too quickly, or swirving in one directon too abruptly.

This data includes the measures of acceleration in either X,Y or Z axis from an accelerometer, the readings of a gyroscope sensor and timestamp.

This dataset was taken from Kaggle. It can be found at "https://www.kaggle.com/datasets/outofskills/driving-behavior"

### Installation and import necessary packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler


from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier


from sklearn.metrics import  confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn import preprocessing


from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder


from sklearn.tree import plot_tree
from sklearn.tree import export_text
from sklearn.utils import resample

  from pandas import MultiIndex, Int64Index


### Loading and Preprocessing

In [2]:
random_seed = 1
np.random.seed(random_seed)

In [3]:
df = pd.read_csv("motion_data.csv")
print(df.head())
print(df.columns)

       AccX      AccY      AccZ     GyroX     GyroY     GyroZ       Class  \
0  0.758194 -0.217791  0.457263  0.000000  0.000000  0.000000  AGGRESSIVE   
1  0.667560 -0.038610  0.231416 -0.054367 -0.007712  0.225257  AGGRESSIVE   
2  2.724449 -7.584121  2.390926  0.023824  0.013668 -0.038026  AGGRESSIVE   
3  2.330950 -7.621754  2.529024  0.056810 -0.180587 -0.052076  AGGRESSIVE   
4  2.847215 -6.755621  2.224640 -0.031765 -0.035201  0.035277  AGGRESSIVE   

   Timestamp  
0     818922  
1     818923  
2     818923  
3     818924  
4     818924  
Index(['AccX', 'AccY', 'AccZ', 'GyroX', 'GyroY', 'GyroZ', 'Class',
       'Timestamp'],
      dtype='object')


In [4]:
df["GyroX"] = np.radians(df["GyroX"])
df["GyroY"] = np.radians(df["GyroY"])
df["GyroZ"] = np.radians(df["GyroZ"])

df = df.drop(columns = ["Timestamp"])

df.head()

Unnamed: 0,AccX,AccY,AccZ,GyroX,GyroY,GyroZ,Class
0,0.758194,-0.217791,0.457263,0.0,0.0,0.0,AGGRESSIVE
1,0.66756,-0.03861,0.231416,-0.000949,-0.000135,0.003931,AGGRESSIVE
2,2.724449,-7.584121,2.390926,0.000416,0.000239,-0.000664,AGGRESSIVE
3,2.33095,-7.621754,2.529024,0.000992,-0.003152,-0.000909,AGGRESSIVE
4,2.847215,-6.755621,2.22464,-0.000554,-0.000614,0.000616,AGGRESSIVE


The reason for the removal of the timestamp variable is subjective. We felt that it is not a very unrelyable variable which isn't a factor for roadrage when evaluated logically.

In [5]:
df.isnull().sum()

AccX     0
AccY     0
AccZ     0
GyroX    0
GyroY    0
GyroZ    0
Class    0
dtype: int64

### Encoding and Verification

In [6]:
df['Class']=df['Class'].astype('category')

enc = LabelEncoder() 
df['Class']=enc.fit_transform(df['Class'])

In [7]:
print(pd.unique(df['Class']))
df['Class'].value_counts()

[0 1 2]


2    2604
1    2197
0    1927
Name: Class, dtype: int64

### Spliting the Data

In [8]:
target = 'Class'

predictors = list(df.columns)
predictors.remove(target)

X=df[predictors]
y=df[target]


rus = RandomUnderSampler(random_state=1)
X_res, y_res = rus.fit_resample(X, y)
y_res.value_counts()

0    1927
1    1927
2    1927
Name: Class, dtype: int64

In [9]:
train_X, test_X, train_y, test_y = train_test_split(X,y, test_size=0.3, random_state=1)

## KNN Model

In [10]:
scaler = StandardScaler()
scaler.fit(train_X)
X_train = scaler.transform(train_X)
X_test = scaler.transform(test_X)

In [11]:
score_measure = 'accuracy'


knn = KNeighborsClassifier(n_neighbors = int(len(train_y)**(1/2)), metric='euclidean')
knn.fit(train_X, train_y)
y_pred = knn.predict(test_X)

param_grid = {
    'n_neighbors': list(range(1,(76*2),2)),
    'metric': ['euclidean', 'cosine']
}
gridSearch = GridSearchCV(KNeighborsClassifier(), param_grid, scoring=score_measure,
                          n_jobs=-1)

gridSearch.fit(train_X, train_y)
print(score_measure, 'score: ', gridSearch.best_score_)
print('parameters: ', gridSearch.best_params_)

KNNAccuracy = gridSearch.best_score_

accuracy score:  0.44022170027368457
parameters:  {'metric': 'euclidean', 'n_neighbors': 127}


In [12]:
confusion_matrix(test_y, y_pred)

array([[168, 100, 317],
       [ 78, 135, 445],
       [ 56, 150, 570]], dtype=int64)

In [13]:
print(classification_report(test_y, y_pred))

              precision    recall  f1-score   support

           0       0.56      0.29      0.38       585
           1       0.35      0.21      0.26       658
           2       0.43      0.73      0.54       776

    accuracy                           0.43      2019
   macro avg       0.44      0.41      0.39      2019
weighted avg       0.44      0.43      0.40      2019



#### Analysis:

The highest accuracy of '44.02%' was generated with a k-value of 127. Although this is a reasonably low, it can be attributed to the quality of data provided. More comprehensive data with more variables could have provided a more accurate model.

## Decision Tree

In [14]:
score_measure = 'accuracy'

dtree=DecisionTreeClassifier(random_state=random_seed)
_ = dtree.fit(train_X, train_y)

y_pred = dtree.predict(test_X)

param_grid = { 
              'max_depth': [2, 5, 10, 20, 30, 40],
              'min_samples_split': [6, 10,20],       
              'min_samples_leaf': [1, 3,  4]
              }
gridSearch = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=score_measure,
                          n_jobs=-1)

gridSearch.fit(train_X, train_y)
print(score_measure, 'score: ', gridSearch.best_score_)
print('parameters: ', gridSearch.best_params_)

DTreeAccuracy = gridSearch.best_score_

accuracy score:  0.43172755188837825
parameters:  {'max_depth': 5, 'min_samples_leaf': 3, 'min_samples_split': 10}


#### Analysis:

Even with hyperparameter tuning parameters like max_depth, min_samples_leaf and min_samples_split, the accuracy was only '43.17'. The difference between the different observations is minute. The data present is not a good fit for the decision model.

## Random Forest

In [15]:
score_measure = 'accuracy'


rforest = RandomForestClassifier(random_state=random_seed)
_ = rforest.fit(train_X, train_y)

y_pred = rforest.predict(test_X)

param_grid = {
    'max_depth': [2,5,3,4,7,8], 
    'min_samples_split': [20,30, 40,50, 60], 
    'min_impurity_decrease': [ 0.001, 0.0005, 0.0007,0.0003], 
}
gridSearch = GridSearchCV(RandomForestClassifier(random_state=random_seed), param_grid, scoring=score_measure,
                          n_jobs=-1)

gridSearch.fit(train_X, train_y)
print(score_measure, 'score: ', gridSearch.best_score_)
print('parameters: ', gridSearch.best_params_)

RandomForestAccuracy = gridSearch.best_score_

accuracy score:  0.4502015969820244
parameters:  {'max_depth': 8, 'min_impurity_decrease': 0.0003, 'min_samples_split': 60}


#### Analysis:

Random forest gave us the accuracy of '45.02%' which is the highest among all the other models that were run. The data quality is questionable so this the best that can be done with the defaule n_estimators and the other hyperparameters. 

## AdaBoost Classifier

In [16]:
score_measure = 'accuracy'


aboost = AdaBoostClassifier(random_state=random_seed)
_ = aboost.fit(train_X, train_y)
y_pred = aboost.predict(test_X)


param_grid = { 
              'n_estimators':[5,20,50,100],
              }

gridSearch = GridSearchCV(AdaBoostClassifier(random_state=random_seed), param_grid, scoring=score_measure,
                          n_jobs=-1)
gridSearch.fit(train_X, train_y)
print(score_measure, 'score: ', gridSearch.best_score_)
print('parameters: ', gridSearch.best_params_)

AdaAccuracy = gridSearch.best_score_

accuracy score:  0.44043491700341375
parameters:  {'n_estimators': 5}


#### Analysis:

This model performs better for data that creates weak learners. It still generated an accuracy score of '44.04%'.

## GradientBoosting Classifier

In [17]:
score_measure = 'accuracy'


gboost = GradientBoostingClassifier(random_state=random_seed)
_ = gboost.fit(train_X, train_y)
y_pred = gboost.predict(test_X)


param_grid = {
    'n_estimators': (3,5,10),
    'learning_rate': (0.1,0.2,0.3,0.4,0.5)
}
gridSearch = GridSearchCV(GradientBoostingClassifier(random_state=random_seed), param_grid, scoring=score_measure,
                          n_jobs=-1)
gridSearch.fit(train_X, train_y)
print(score_measure, 'score: ', gridSearch.best_score_)
print('parameters: ', gridSearch.best_params_)

GBoostAccuracy = gridSearch.best_score_

accuracy score:  0.4423489037952578
parameters:  {'learning_rate': 0.2, 'n_estimators': 10}


#### Analysis:

In Gradient Boosting, lower learning_rate usually requires higher number of n_estimators. It stayed with the default value of 10 inthis case though.
With this, it generated the accuracy score of '44.23%'.

## XGBoosting Classifier

In [18]:
score_measure = 'accuracy'


xgboost = XGBClassifier(random_state=random_seed)
_ = xgboost.fit(train_X, train_y)
y_pred = xgboost.predict(test_X)


param_grid = {
    'max_depth': (1,2,3),
    'max_leaves': (1,2,3,4),
    'learning_rate': (0.1,0.15,0.05),
}
gridSearch = GridSearchCV(XGBClassifier(random_state=random_seed), param_grid, scoring=score_measure,
                          n_jobs=-1)
gridSearch.fit(train_X, train_y)
print(score_measure, 'score: ', gridSearch.best_score_)
print('parameters: ', gridSearch.best_params_)

XGBoostAccuracy = gridSearch.best_score_



  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


accuracy score:  0.44447204604578855
parameters:  {'learning_rate': 0.1, 'max_depth': 2, 'max_leaves': 1}


#### Analysis:

XGBoost Classifier is a very powerful modeling technique which uses advanced ML algoritms to create extremely effective and accurate models. It's limitation is that it needs atleast a reasonable amount of data and variables to provide quality outputs.
The model was fit with a learning_rate of 0.1, max_depth of 2 and max_leaves of 1.

Here, with the data it was provided, it created a model with an accuracy score of '44.44%'.

In [22]:
print(f"KNN Model Accuracy = {KNNAccuracy})")
print(f"Decision Tree Accuracy = {DTreeAccuracy}")
print(f"Random Forest Accuracy = {RandomForestAccuracy}")
print(f"AdaBoost Accuracy = {AdaAccuracy}")
print(f"Gradient Boost Accuracy = {GBoostAccuracy}")
print(f"XG Boost Accuracy: = {XGBoostAccuracy}")

KNN Model Accuracy = 0.44022170027368457)
Decision Tree Accuracy = 0.43172755188837825
Random Forest Accuracy = 0.4502015969820244
AdaBoost Accuracy = 0.44043491700341375
Gradient Boost Accuracy = 0.4423489037952578
XG Boost Accuracy: = 0.44447204604578855


### Reason for choosing 'Accuracy' as the metric to evaluate the models:

In this case, TP & TN are self explainatory. They are accurate precictions of driving behavior. 

With FP & FN, the model incorrectly assumes the driver is driving slowly or normally, when they are actually driving aggressively (and the other 2 combinations of this senario).

Here, FP and FN are equally dangerous and one of them isn't less important than the other. The data is also balanced. Hence, "accuracy" is the ideal metric to measure model quality.

# Conclusion

There is a common trend that can be seen throughout our analysis of all the model. The quality of data in the dataset is not very good. The reason for this is that all this data was collected from a smartphone present in the car during the drive. There were no specialized equipment to take these readings.

As smartphones are how alerts will be sent to the driver if they have an incident of road rage this seems like an unavoidable compromise. That said, modern cars are being equiped with more accurate and relyable hardware which will providde more variety of data which will certainly improve the output we generate through each model.

In the end, Random Forest Classifer model was the model that we felt was was best fit for this data as it has the highest accuracy score compared to the rest with an accuracy score of '45.02%'. The average accuracy score for all the models combined was close to 44%