<h1>Introduction/Business Problem</h1> 

The idea of this project is to analyze the severity of an accident in the United States of America. We're trying to engineer a model to predict the severity vehicle accidents throughout the States. Millions Of people die everyday due to accidents. This project could help solve this problem to a big extent.

<b>Problem Statement</b>: What is the magnitude of severity for an accident that occurs in USA?

<h1> Data </h1>

This is a countrywide car accident dataset, which covers 49 states of the USA. The dataset contains the driving conditions, number of people and vehicles involved in crash, and the severity of crash.

The data can be found at : https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

Some entries were missing crucial data that were required. Some colomns were filled with "Unknown" in the number of people injured. To rectify, i dropped the entire row as a mean of the crashes wouldn't have been accurate.

<h1>Code</h1>

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

In [6]:
df=pd.read_csv(r"https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv")
df.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N


<h1>Data Cleaning</h1>

In [7]:
perimeters = ['SPEEDING','SEVERITYCODE','ROADCOND']
df=df[perimeters]
df.shape

(194673, 3)

We are assuming that some road conditions are unknown and drivers are not speeding. 

In [8]:
df['SPEEDING'] = df['SPEEDING'].fillna('N')
df['ROADCOND'] = df['ROADCOND'].fillna('Unknown')

Lets assume only 'Dry','Unknown', and 'Others' are safe road conditions

In [9]:
df['ROADCOND'].replace(to_replace=['Wet','Dry','Unknown','Snow/Slush','Ice','Other','Sand/Mud/Dirt','Standing Water','Oil'], value = ['Bad','Good','Good','Bad','Bad','Good','Bad','Bad','Bad'], inplace=True)

In [10]:
df["SPEEDING"].replace(to_replace=['N', 'Y'], value=[0,1], inplace=True)
df['ROADCOND'].replace(to_replace=['Good','Bad'],value=[0,1],inplace=True)
test_condition = df[['SPEEDING','ROADCOND']]
test_condition.head()

Unnamed: 0,SPEEDING,ROADCOND
0,0,1
1,0,1
2,0,0
3,0,0
4,0,1


<h1>Data Analysis</h1>

<h5>Speed Test:- The Accident Severity/L2 is higher on '1' which means with speeding severity increases</h5>

In [11]:
speed_analysis = df.groupby(['SPEEDING'])['SEVERITYCODE'].value_counts(normalize=True)
speed_analysis

SPEEDING  SEVERITYCODE
0         1               0.705099
          2               0.294901
1         1               0.621665
          2               0.378335
Name: SEVERITYCODE, dtype: float64

<h5>Road Conditions Test:- The Accident Severity/L2 is higher on '1' which means with worser road conditions severity increases</h5>

In [12]:
road_analysis = df.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True)
road_analysis

ROADCOND  SEVERITYCODE
0         1               0.710389
          2               0.289611
1         1               0.674176
          2               0.325824
Name: SEVERITYCODE, dtype: float64

<h3> Road Conditions and Speeding certainly have a big effect on accident severity </h3>

In [13]:
x = test_condition
y = df['SEVERITYCODE'].values.astype(str)
x = preprocessing.StandardScaler().fit(x).transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1234)

print("Training set: ", x_train.shape, y_train.shape)
print("Testing set: ", x_test.shape, y_test.shape)

Training set:  (155738, 2) (155738,)
Testing set:  (38935, 2) (38935,)


  return self.partial_fit(X, y)
  This is separate from the ipykernel package so we can avoid doing imports until


<h2>KNN Test I</h2> 

In [14]:
KNN_model = KNeighborsClassifier(n_neighbors = 4).fit(x_train, y_train)
predicted = KNN_model.predict(x_test)
KNN_f1 = f1_score(y_test, predicted, average='weighted')
KNN_acc = accuracy_score(y_test, predicted)
KNN_acc

0.696750995248491

<h2>Decison Tree II </h2>

In [19]:
Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_model.fit(x_train, y_train)
predicted = Tree_model.predict(x_test)
Tree_f1 = f1_score(y_test, predicted, average='weighted')
Tree_acc = accuracy_score(y_test, predicted)
Tree_acc

  'precision', 'predicted', average, warn_for)


0.6996789520996533

<h2>Logistic Regression III</h2>

In [17]:
LR_model = LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)
predicted = LR_model.predict(x_test)

LR_f1 = f1_score(y_test, predicted, average='weighted')
LR_acc = accuracy_score(y_test, predicted)
LR_acc

  'precision', 'predicted', average, warn_for)


0.6996789520996533

<h1> Results </h1>

In [21]:
table = {
    "Algorithm": ["KNN", "Decision Tree", "LogisticRegression"],
    "F1-score": [KNN_f1, Tree_f1, LR_f1],
    "Accuracy": [KNN_acc, Tree_acc, LR_acc]
}

table = pd.DataFrame(table)
table

Unnamed: 0,Algorithm,F1-score,Accuracy
0,KNN,0.591378,0.696751
1,Decision Tree,0.576051,0.699679
2,LogisticRegression,0.576051,0.699679


Using LR model, Finding Intercept and Co-effecient.

In [22]:
table = {
    "Intercept": LR_model.intercept_,
    "Coef:SPEEDING ": LR_model.coef_[:,0],
    "Coef:ROADCOND ": LR_model.coef_[:,1],
}

table = pd.DataFrame(table)
table

Unnamed: 0,Intercept,Coef:SPEEDING,Coef:ROADCOND
0,-0.853729,0.067702,0.068295


Both the Coeffecients are positive, leading me to conclude that they have an imperative effect on the severity of accidents

<h1>Conclusion</h1>

The model provides enough evidence to show to the tremendous effect of Road Conditions and Speeding on Severity of Car Accidents in the States