# This is the jupyter notebook for the capstone project

## problem description

The amount of road traffic incidents is immense, it causes great damage to all the US families and is estimated a loss of $810bn every year. Therefore, being able to identify the factors that lead to a greater number and severity of accidents is really important and has a big incentive. 

One way that this problem can be addressed is with data science tools, using machine learning to create models that can forecast concentration of road accidents, for the people to prepare. It seems pretty obvious that the more important factors will be about the climate, day of the week, the influence of substances, and some other few. But the important thing is to be able to determine how these variables interact with each other. 

This could lead to a far better understanding of the traffic accidents, and being able to predict them with much higher accuracy. So in this project, we will be using a dataset of the severity and conditions of different accidents to create and test a model that is able to predict the severity of possible car accidents.  

In [1]:
import pandas as pd
import numpy as np

print("Hello Capstone Project Course!")

Hello Capstone Project Course!


In [2]:
data = pd.read_csv('Data-Collisions.csv')
data.head(5)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [3]:
data.describe()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,65070.0,194673.0,194673.0,194673.0,194673.0,194673.0,194673.0,114936.0,194673.0,194673.0
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,37558.450576,1.298901,2.444427,0.037139,0.028391,1.92078,13.867768,7972521.0,269.401114,9782.452
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,51745.990273,0.457778,1.345929,0.19815,0.167413,0.631047,6.868755,2553533.0,3315.776055,72269.26
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,23807.0,1.0,0.0,0.0,0.0,0.0,0.0,1007024.0,0.0,0.0
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,28667.0,1.0,2.0,0.0,0.0,2.0,11.0,6040015.0,0.0,0.0
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,29973.0,1.0,2.0,0.0,0.0,2.0,13.0,8023022.0,0.0,0.0
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,33973.0,2.0,3.0,0.0,0.0,2.0,14.0,10155010.0,0.0,0.0
max,2.0,-122.238949,47.734142,219547.0,331454.0,332954.0,757580.0,2.0,81.0,6.0,2.0,12.0,69.0,13072020.0,525241.0,5239700.0


In [4]:
data_usefull = data[['SEVERITYCODE','COLLISIONTYPE','WEATHER','ROADCOND','LIGHTCOND','UNDERINFL']]
data_usefull.head()

Unnamed: 0,SEVERITYCODE,COLLISIONTYPE,WEATHER,ROADCOND,LIGHTCOND,UNDERINFL
0,2,Angles,Overcast,Wet,Daylight,N
1,1,Sideswipe,Raining,Wet,Dark - Street Lights On,0
2,1,Parked Car,Overcast,Dry,Daylight,0
3,1,Other,Clear,Dry,Daylight,N
4,2,Angles,Raining,Wet,Daylight,0


## Data description

The data is about collisions in Seattle, provided by the SDTO. It contains around 200,000 samples, with 37 different characteristics from the crashes. The idea of this project is to train a machine learning model to be able to predict the severity of the crashes, that have value 1(property damage) or 2(injury damage). So it will be a 
binary classification model. There are a lot of missing values, around 40% of the samples have at least one NaN value. But many characteristics probably don't affect the outcome of the model, so before doing something about the missing values is better to remove the columns that are not relevant. The variables that will be included in the model will be
- If the driver was under alcohol or drugs
- Collision type
- Weather conditions
- The light conditions
- The condition of the road

After removing all the other variables there are only 3% of the samples with missing values, so those can be removed with no problem. So then the data needs to have further processing, for the training and testing of the model. Like for example balancing the dataset and transforming the categorical values into numerical ones.

## Removing NaN

In [5]:
data_usefull.shape

(194673, 6)

In [6]:
data_cut = data_usefull.dropna()
data_cut.shape

(189316, 6)

## Transforming categorical to numerical

In [7]:
data_cut['UNDERINFL'].replace(to_replace=['N','Y','0'], value=[0,1,0],inplace=True)

from sklearn.preprocessing import LabelEncoder

features = data_cut[['COLLISIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'UNDERINFL']]

for feature in ['COLLISIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND']:
    features[feature] = features[feature].astype('|S') 
    features[feature] = LabelEncoder().fit_transform(features[feature])

features.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,COLLISIONTYPE,WEATHER,ROADCOND,LIGHTCOND,UNDERINFL
0,0,4,8,5,0
1,9,6,8,2,0
2,5,4,0,5,0
3,4,1,0,5,0
4,0,6,8,5,0


In [8]:
X = features
y = data_cut['SEVERITYCODE'].values

## Spliting and normalization of the data

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

from sklearn import preprocessing
X= preprocessing.StandardScaler().fit(X).transform(X)
X_train = preprocessing.StandardScaler().fit(X_train).transform(X_train.astype(float))
X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test.astype(float))

## training the diferent model types

In [10]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

### KNN
model_KNN = KNeighborsClassifier(n_neighbors = 2).fit(X_train, y_train)

## Logistic Regression
model_LR = LogisticRegression(C=0.0001, solver='liblinear')
model_LR.fit(X_train, y_train)

## Decision Tree
model_DT = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
model_DT.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

## Testing the models

In [18]:
from sklearn import metrics
import numpy as np
#from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score

# KNN
yhat = model_KNN.predict(X_test)
yhat_KNN = yhat
#jaccard = jaccard_similarity_score(y_test, yhat)
f1_score_KNN = f1_score(y_test, yhat, average='weighted')
precision_KNN = precision_score(y_test, yhat, average='weighted')
#KNN_report = ['KNN', round(jaccard,2), round(f1_score_KNN,2), round(precision_KNN,2)]
KNN_report = ['KNN', round(f1_score_KNN,2), round(precision_KNN,2)]

# Decission tree 
yhat = model_DT.predict(X_test)
yhat_DT = yhat
#jaccard = jaccard_similarity_score(y_test, yhat)
f1_score_DT = f1_score(y_test, yhat, average='weighted')
precision_DT = precision_score(y_test, yhat, average='weighted')
#DT_report = ['Decision Tree', round(jaccard,2), round(f1_score_DT,2), round(precision_DT,2)]
DT_report = ['Decision Tree', round(f1_score_DT,2), round(precision_DT,2)]

# Logistic regression
yhat_proba = model_LR.predict_proba(X_test)
yhat = model_LR.predict(X_test)
yhat_LR = yhat
#jaccard = jaccard_similarity_score(y_test, yhat)
f1_score_LR = f1_score(y_test, yhat, average='weighted')
precision_LR = precision_score(y_test, yhat, average='weighted')
#LR_report = ['Logistic Regression', round(jaccard,2), round(f1_score_LR,2), round(precision_LR,2)]
LR_report = ['Logistic Regression', round(f1_score_LR,2), round(precision_LR,2)]


In [19]:
report = pd.DataFrame(data=np.array([KNN_report, DT_report, LR_report]), 
                      columns=['Algorithm', 'F1-score', 'Precision'])
report

Unnamed: 0,Algorithm,F1-score,Precision
0,KNN,0.67,0.71
1,Decision Tree,0.69,0.78
2,Logistic Regression,0.58,0.68


## Conclusion 

After apliying all the content learned on the course specialization, we where able to do a proper the data preparation, model training, model testing and report evaluation. 

On this report we can see that the best option to predict the outcomes of the severity of the crashes(with the data that was used for training and testing) is the Decision tree.