# IBM Data Analysis Capstone

### This notebook will be used to analyse traffic accident statistcs based on previous data

## 1.The Problem

The idea is to create a classification model capable to predict if there is significant probability of a colision to happen under a few pre-determined conditions.

The modelcan be used by traffic apps which might want to redirect clients to roads or junctions with less probablity of having an accident.

## 2.Data Description

The dataset used was provided by Coursera and it contains all collisions provided by SPD and recorded by Traffic Records, including all types of collision since 2004 to present.

It provides many columns with details on each type of colision. We excluded most of them and kept the columns that represents information data that would be available to some kind of traffic app, for example, in real time. The selected columns was:

SEVERITYCODE - This is our target, it represents the severity of each colision

ADDRTYPE - If it is a block or a junction

LOCATION - The colision address

JUNCTIONTYPE - The juntion type

WEATHER - How was the weather

ROADCOND - Road conditions

LIGHTCOND - Light conditions

INCDTTM - Timestamp of the colision

INCDTTM won't be used directly on the mode. Three features was created instead:

TIME - What time of the day it happened ('Late Night', 'Early Morning','Morning','Noon','Eve','Night')

WEEKEND - Was it weekend

HOLIDAY - was it a holiday

Thats is more significative than using the timestamp itself once we are bininng the time and checking if the accident occured on the days where the roads are usually busier.

## 3.Methodology




Once the data is proccessed and the new features are created the DF will be separated into X and Y where:

X = The entry Value columns for the training methods;

Y = The result collumn

Every X columns will be converted into numerical values and normilized later. The data will be divided into Train and Test groups at arate of 0.3 which means 70% of the DF will be used for Training and 30% for the Test.

Considering this is a classification task, the 3 supervised methods covered in this traning will be considered:

Decision Tree

Support Vector Machine

Logistic Regression

KNN was also  coveredat the course, but considering the size of the DF it was considered not very efficient.

After we have results from all tree methods the following metrics will be used to determine the best result:

f1_score

log_loss

jaccard_score

## 4.Results

In [34]:
df_resTb

Unnamed: 0,Algorithm,Jaccard,F1-score,LogLoss
0,Decision Tree,0.692851,0.567141,
1,SVM,0.692851,0.567141,
2,LogisticRegression,0.688192,0.574677,0.592133


Considering the results from the table above the best Scores are from Decision Tree and SVM (Support Vector Machines)

## 5.Discussion

For the Decision Tree training a loop was used to determine the best Depth result. The same method was used to find out the best C value for the Logistic Regeression Trainning.

## 6.Conclusion

If the processing time is also a parameter used to determine which Model to use, Decision Tree should be the one, once it was by far the faster and had one of the best results.

## 7.Implementation

The implementation can be found in details below:

In [23]:
import pandas as pd
import numpy as np

from datetime import date 
import holidays

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import metrics

from sklearn.metrics import confusion_matrix

from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import jaccard_score

### Importing the Data

In [4]:
df_data=pd.read_csv('Data-Collisions.csv')
df_data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


### Removing unecessary Fields

In [5]:
df_data.drop(['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'INTKEY', 'EXCEPTRSNCODE', 
              'EXCEPTRSNDESC', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 
              'INCDATE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 
              'SPEEDING', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR','SEVERITYCODE.1'], axis = 1, inplace=True)


### Removing Missing Data

In [6]:
missing_data = df_data.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")   
    
df_data.dropna(axis=0, inplace=True)

SEVERITYCODE
False    194673
Name: SEVERITYCODE, dtype: int64

ADDRTYPE
False    192747
True       1926
Name: ADDRTYPE, dtype: int64

LOCATION
False    191996
True       2677
Name: LOCATION, dtype: int64

INCDTTM
False    194673
Name: INCDTTM, dtype: int64

JUNCTIONTYPE
False    188344
True       6329
Name: JUNCTIONTYPE, dtype: int64

WEATHER
False    189592
True       5081
Name: WEATHER, dtype: int64

ROADCOND
False    189661
True       5012
Name: ROADCOND, dtype: int64

LIGHTCOND
False    189503
True       5170
Name: LIGHTCOND, dtype: int64



### Converting INCDTTM into useful features

In [7]:
df_data['INCDTTM'] = pd.to_datetime(df_data['INCDTTM'])

b = [0,4,8,12,16,20,24]
l = ['Late Night', 'Early Morning','Morning','Noon','Eve','Night']
df_data['TIME'] = pd.cut(df_data['INCDTTM'].dt.hour, bins=b, labels=l, include_lowest=True)
df_data['WEEKEND'] = df_data['INCDTTM'].dt.weekday>=5

def findHoliday(date):
    us_holidays = holidays.US()
    return date in us_holidays

df_data['HOLIDAY'] = df_data['INCDTTM'].apply(findHoliday)

df_data.drop(['INCDTTM'], axis = 1, inplace=True)


df_data.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,TIME,WEEKEND,HOLIDAY
0,2,Intersection,5TH AVE NE AND NE 103RD ST,At Intersection (intersection related),Overcast,Wet,Daylight,Noon,False,False
1,1,Block,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,Eve,False,False
2,1,Block,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Mid-Block (not related to intersection),Overcast,Dry,Daylight,Morning,False,False
3,1,Block,2ND AVE BETWEEN MARION ST AND MADISON ST,Mid-Block (not related to intersection),Clear,Dry,Daylight,Morning,False,False
4,2,Intersection,SWIFT AVE S AND SWIFT AV OFF RP,At Intersection (intersection related),Raining,Wet,Daylight,Early Morning,False,False


### Preparing data X and Y for training

In [8]:
X = df_data[['ADDRTYPE', 'LOCATION', 'JUNCTIONTYPE', 'WEATHER',
       'ROADCOND', 'LIGHTCOND', 'TIME', 'WEEKEND', 'HOLIDAY']].values

le_ADDRTYPE = preprocessing.LabelEncoder()
le_LOCATION = preprocessing.LabelEncoder()
le_JUNCTIONTYPE = preprocessing.LabelEncoder()
le_WEATHER = preprocessing.LabelEncoder()
le_ROADCOND = preprocessing.LabelEncoder()
le_LIGHTCOND = preprocessing.LabelEncoder()
le_TIME = preprocessing.LabelEncoder()
le_WEEKEND = preprocessing.LabelEncoder()
le_HOLIDAY = preprocessing.LabelEncoder()

le_ADDRTYPE.fit(list(df_data['ADDRTYPE'].unique()))
le_LOCATION.fit(list(df_data['LOCATION'].unique()))
le_JUNCTIONTYPE.fit(list(df_data['JUNCTIONTYPE'].unique()))
le_WEATHER.fit(list(df_data['WEATHER'].unique()))
le_ROADCOND.fit(list(df_data['ROADCOND'].unique()))
le_LIGHTCOND.fit(list(df_data['LIGHTCOND'].unique()))
le_TIME.fit(list(df_data['TIME'].unique()))
le_WEEKEND.fit(list(df_data['WEEKEND'].unique()))
le_HOLIDAY.fit(list(df_data['HOLIDAY'].unique()))

X[:,0] = le_ADDRTYPE.transform(X[:,0]) 
X[:,1] = le_LOCATION.transform(X[:,1]) 
X[:,2] = le_JUNCTIONTYPE.transform(X[:,2]) 
X[:,3] = le_WEATHER.transform(X[:,3]) 
X[:,4] = le_ROADCOND.transform(X[:,4]) 
X[:,5] = le_LIGHTCOND.transform(X[:,5]) 
X[:,6] = le_TIME.transform(X[:,6]) 
X[:,7] = le_WEEKEND.transform(X[:,7]) 
X[:,8] = le_HOLIDAY.transform(X[:,8]) 

X= preprocessing.StandardScaler().fit(X).transform(X)

X[0:5]

array([[ 1.37307378, -0.48635266, -1.24527244,  0.37926868,  1.50508616,
         0.39235517,  1.40238921, -0.58576387, -0.15425967],
       [-0.72829298, -0.20049292,  0.93462954,  1.11215024,  1.50508616,
        -1.4086009 , -0.9817997 , -0.58576387, -0.15425967],
       [-0.72829298, -0.59651989,  0.93462954,  0.37926868, -0.69413864,
         0.39235517,  0.21029476, -0.58576387, -0.15425967],
       [-0.72829298, -1.10271408,  0.93462954, -0.72005365, -0.69413864,
         0.39235517,  0.21029476, -0.58576387, -0.15425967],
       [ 1.37307378,  1.59184311, -1.24527244,  1.11215024,  1.50508616,
         0.39235517, -1.57784692, -0.58576387, -0.15425967]])

In [9]:
y = np.asarray(df_data["SEVERITYCODE"])
y[0:5]

array([2, 1, 1, 1, 2], dtype=int64)

In [10]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=4)

### Selecting the best classifier

Decision Tree

In [19]:
dTreeScore = 0
useDp = 1
for dp in range(1,50):
    dTree = DecisionTreeClassifier(criterion="entropy", max_depth = dp)
    dTree.fit(X_train,y_train)
    predTree = dTree.predict(X_test)
    if metrics.accuracy_score(y_test, predTree) > dTreeScore:
        dTreeScore = metrics.accuracy_score(y_test, predTree)
        useDp = dp

dTree = DecisionTreeClassifier(criterion="entropy", max_depth = useDp)
dTree.fit(X_train,y_train)
predTree = dTree.predict(X_test)
    
print('Depth: ', useDp, ' Accuracy=', metrics.accuracy_score(y_test, predTree))

Depth:  1  Accuracy= 0.6928508867965842


Support Vector Machine

In [20]:
svc = svm.SVC(kernel='rbf')
svc.fit(X_train, y_train)
svc_yhat = svc.predict(X_test)

print('Accuracy=', metrics.accuracy_score(y_test, svc_yhat))

Accuracy= 0.6928508867965842


Logistic Regression

In [27]:
regRange = np.arange(0.01,1.01,0.01)
regScore = np.zeros(len(regRange))
for i, reg in enumerate(regRange):
    LR = LogisticRegression(C=reg, solver='liblinear').fit(X_train,y_train)
    LRyhat = LR.predict(X_test)
    regScore[i] = metrics.accuracy_score(y_test, LRyhat)
print('Best Accuracy was C=', regRange[regScore.argmax()])

LR = LogisticRegression(C=regRange[regScore.argmax()], solver='liblinear').fit(X_train,y_train)
LRyhat = LR.predict(X_test)
print('Accuracy=', metrics.accuracy_score(y_test, LRyhat))

Best Accuracy was C= 0.02
Accuracy= 0.6897854171228377


### Testing the results

In [33]:
lrTestProb = LR.predict_proba(X_test)
df_resTb = pd.DataFrame(columns=['Algorithm','Jaccard','F1-score','LogLoss'])
df_resTb = df_resTb.append({'Algorithm':'Decision Tree','Jaccard':jaccard_score(y_test, predTree),'F1-score':f1_score(y_test, predTree, average='weighted'),'LogLoss':np.nan}, ignore_index=True)
df_resTb = df_resTb.append({'Algorithm':'SVM','Jaccard':jaccard_score(y_test, svc_yhat),'F1-score':f1_score(y_test, svc_yhat, average='weighted'),'LogLoss':np.nan}, ignore_index=True)
df_resTb = df_resTb.append({'Algorithm':'LogisticRegression','Jaccard':jaccard_score(y_test, LRyhat),'F1-score':f1_score(y_test, LRyhat, average='weighted'),'LogLoss':log_loss(y_test, lrTestProb)}, ignore_index=True)
df_resTb

Unnamed: 0,Algorithm,Jaccard,F1-score,LogLoss
0,Decision Tree,0.692851,0.567141,
1,SVM,0.692851,0.567141,
2,LogisticRegression,0.688192,0.574677,0.592133
