In [147]:
## Vehicle crashes occur all over the world, with varying severity, traffic rules, vehicle characteristics, 
## driving behavior, weather etc., each of which play a significant role in the outcome of the crash.
## While most of the data science applications for crash analysis have been only in advanced visualization, there is limited study of 
## developing advanced machine learning algorithms that can help predict future crashes and its severities based on several parameters.
## The limited existing models are developed for much larger areas which may result in overfit/underfit issue.
## Hence, 'One model fits all' is an outdated strategy and will not be beneficial in providing actionable insights to zone specific crashes which may have different parameters. 

## Through my project at the Data Incubator, I wish to develop an automated system which in return develops machine learning models 
## to predict future fatal or non fatal crashes for any chosen area. These models would be focussed towards very specific zones/areas/road corridors, which I term as micro-crash models.
## The intention here is to develop custom models that are built solely on the data that is represented by that zone/area
## rather than a much larger set. Having such area specific models through an automated system will be beneficial to a lot of state or federal
## Department's of Transport who can have access to a tool that can deliver key actionable insights,
## and help direct their crash preventive measures more effectively in future.
## This can help direct their budget, initiatives, and efforts to specific locations that need the most attention.
## Outcomes of such models can be imposing traffic regulations during peak crash hours, zones, restricitng certain category of vehicles etc.

## Data availability:
## A lot of Transportation agencies, City DOT's, Federal DOT's provide traffic and crash related open data that can be easily acquired
## to develop this bigger idea. I wish to develop a full fledged application that can be responsive in real time and update the
## models with live feed of additional data through pipelines whenever available. As of today, there is no such automated model development application in Transportation.

# Asset 1:

## 1. In an effort to emphasize on this idea, I have worked on the following project which aims to build a simple machine learning model to classify crashes with certain parameters into fatal or non-fatal.
## 2. This model can be considered as just a small subset of the larger idea where micro-models can be developed for micro-zones/areas.
## 3. By using crash data for NYC provided by City of New York at (https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95)
## this project aims to identify the underlying relationship between different parameters such as time of day, month, factors contributing to crash,
## vehicle type etc, and build a classifier to help identify if certain crash parameters in future can result into a fatal or non-fatal crash.



In [None]:
# We will import the needed libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [148]:
# Read the data into a pandas data frame
alldf = pd.read_csv("E:\\Data Science\\The_Data_Incubator\\Motor_Vehicle_Collisions_-_Crashes.csv")
print(alldf.shape)

  interactivity=interactivity, compiler=compiler, result=result)


(277620, 34)


In [149]:
# lets take a look at the columns
alldf.columns

Index(['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE',
       'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME',
       'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
       'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
       'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
       'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
       'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
       'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
       'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
       'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
       'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'],
      dtype='object')

In [150]:
# Describe function provides a quick summary of key statistics of individual variable in the dataset
alldf.describe()

Unnamed: 0,LATITUDE,LONGITUDE,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,COLLISION_ID
count,1484782.0,1484782.0,1687725.0,1687711.0,1687742.0,1687742.0,1687742.0,1687742.0,1687742.0,1687742.0,1687742.0
mean,40.69174,-73.87087,0.2653833,0.001193332,0.05096454,0.0006422783,0.02129295,8.650611e-05,0.1929537,0.000463341,2847570.0
std,1.149829,2.349715,0.6614285,0.03650929,0.2323322,0.02589028,0.1456992,0.009363957,0.6231047,0.02346951,1503889.0
min,0.0,-201.36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0
25%,40.6688,-73.97664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2817793.0
50%,40.72246,-73.92912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3480008.0
75%,40.76835,-73.86671,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3902179.0
max,43.34444,0.0,43.0,8.0,27.0,6.0,4.0,2.0,43.0,5.0,4324549.0


In [151]:
# Lets take a look if we have null objects in the provided data
alldf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687742 entries, 0 to 1687741
Data columns (total 29 columns):
CRASH DATE                       1687742 non-null object
CRASH TIME                       1687742 non-null object
BOROUGH                          1172462 non-null object
ZIP CODE                         1172256 non-null object
LATITUDE                         1484782 non-null float64
LONGITUDE                        1484782 non-null float64
LOCATION                         1484782 non-null object
ON STREET NAME                   1352781 non-null object
CROSS STREET NAME                1108481 non-null object
OFF STREET NAME                  242593 non-null object
NUMBER OF PERSONS INJURED        1687725 non-null float64
NUMBER OF PERSONS KILLED         1687711 non-null float64
NUMBER OF PEDESTRIANS INJURED    1687742 non-null int64
NUMBER OF PEDESTRIANS KILLED     1687742 non-null int64
NUMBER OF CYCLIST INJURED        1687742 non-null int64
NUMBER OF CYCLIST KILLED        

In [153]:
## A micro-model can be applied to any user defined zone/area. It can be taken down to the level of individual streets provided we have enough detailed data
## In our micro-modelling strategy, lets choose a zone/area based on Borough, and for a sample lets choose Manhattan
df = alldf[alldf['BOROUGH'] == 'MANHATTAN']

In [154]:
## We will try to model total fatalities that occur for a particular incident
## In our modelling case, we would like to have strong predictors. We can obtain few indicators from the date column
## We can split the date column and obtain crucial information such as month and weekday/weekend which may provide crucial insights
df['month'] = pd.DatetimeIndex(df['CRASH DATE']).month
df['pd_datetime'] = pd.to_datetime(df['CRASH DATE'])
df['weekday'] = df['pd_datetime'].dt.dayofweek
df['hour'] = pd.to_datetime(df['CRASH TIME']).dt.hour
df.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: htt

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5,month,pd_datetime,weekday,hour
0,10/13/2016,11:23,MANHATTAN,10011,40.7441,-73.99565,POINT (-73.99565 40.7441),WEST 23 STREET,7 AVENUE,,...,3540699,Box Truck,Bulk Agriculture,,,,10,2016-10-13,3,11
1,11/01/2016,22:10,MANHATTAN,10013,40.722584,-74.00636,POINT (-74.00636 40.722584),VARICK STREET,CANAL STREET,,...,3553200,Sedan,Sedan,,,,11,2016-11-01,1,22
4,10/14/2016,23:15,MANHATTAN,10034,40.86955,-73.91519,POINT (-73.91519 40.86955),WEST 215 STREET,10 AVENUE,,...,3545177,Sedan,Sedan,,,,10,2016-10-14,4,23
5,11/03/2016,7:50,MANHATTAN,10065,40.761486,-73.96062,POINT (-73.96062 40.761486),EAST 62 STREET,1 AVENUE,,...,3553289,Taxi,Taxi,,,,11,2016-11-03,3,7
9,10/27/2016,18:00,MANHATTAN,10012,40.724236,-73.997795,POINT (-73.997795 40.724236),PRINCE STREET,BROADWAY,,...,3551464,Sedan,Sedan,,,,10,2016-10-27,3,18


In [155]:
## We can see that number of fatalities is a continuous random variable. Hence we will categorize that data into fatal(1) or non-fatal(0)
df['output'] = df['NUMBER OF PERSONS KILLED'].apply(lambda x: 1 if x>0 else 0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [156]:
## Similarly, we will make other predictors as categorical.
## Based on initial screening, will use our sample predictors as weekday, hour, month, 'CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
#'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
## We will convert the non-numeric data into numeric by usink a lebel-encoder

from sklearn.preprocessing import LabelEncoder

le_factor = LabelEncoder()
uniquelist_factor = df['CONTRIBUTING FACTOR VEHICLE 1'].unique().tolist() + df['CONTRIBUTING FACTOR VEHICLE 2'].unique().tolist() + df['CONTRIBUTING FACTOR VEHICLE 3'].unique().tolist() + df['CONTRIBUTING FACTOR VEHICLE 4'].unique().tolist() + df['CONTRIBUTING FACTOR VEHICLE 5'].unique().tolist()
uniquelist_factor = list(dict.fromkeys(uniquelist_factor))
le_factor.fit(uniquelist_factor)
df['CONTRIBUTING FACTOR VEHICLE 1'] = le_factor.transform(df['CONTRIBUTING FACTOR VEHICLE 1'].astype(str))
df['CONTRIBUTING FACTOR VEHICLE 2'] = le_factor.transform(df['CONTRIBUTING FACTOR VEHICLE 2'].astype(str))
df['CONTRIBUTING FACTOR VEHICLE 3'] = le_factor.transform(df['CONTRIBUTING FACTOR VEHICLE 3'].astype(str))
df['CONTRIBUTING FACTOR VEHICLE 4'] = le_factor.transform(df['CONTRIBUTING FACTOR VEHICLE 4'].astype(str))
df['CONTRIBUTING FACTOR VEHICLE 5'] = le_factor.transform(df['CONTRIBUTING FACTOR VEHICLE 5'].astype(str))




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://

In [157]:
## We want to check if we have equal proportion of fatal and non-fatal cases in the input dataset to avoid overfitting
#sns.countplot(df['output'])
df['output'].value_counts()

0    277393
1       227
Name: output, dtype: int64

In [158]:
# We can see that there are only a few fata cases as compared to non-fata cases.
# Hence we will select the data to have equal values of both
nonfatal_df = df[df['output'] == 0].sample(n=227)
fatal_df = df[df['output'] == 1]
combined = [nonfatal_df, fatal_df]
final_output_df = pd.concat(combined)
final_output_df.shape

(454, 34)

In [159]:
## Now we will define our input and output variables
inputdf = final_output_df[['weekday','month','hour','CONTRIBUTING FACTOR VEHICLE 1','CONTRIBUTING FACTOR VEHICLE 2','CONTRIBUTING FACTOR VEHICLE 3','CONTRIBUTING FACTOR VEHICLE 4','CONTRIBUTING FACTOR VEHICLE 5']]
outputdf  = final_output_df['output']
print(inputdf.shape)
print(outputdf.shape)

(454, 8)
(454,)


In [160]:
## We will now create training and testing datasets
from sklearn.model_selection import train_test_split
trainx, testx, trainy, testy = train_test_split(inputdf, outputdf, test_size = 0.25)

In [161]:
## Now we will apply a simple logistic regression model to see if our model can classify the incidents correctly
from sklearn.linear_model import LogisticRegression
lm = LogisticRegression()
model = lm.fit(trainx, trainy)
predictions = model.predict(testx)

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
print(accuracy_score(testy, predictions))

0.6666666666666666




In [162]:
### Trying with ADAboost classifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn import metrics as sm

abcmodel  = AdaBoostClassifier(n_estimators = 30)
abcmodel.fit(trainx, trainy)
predictions = abcmodel.predict(testx)
sm.accuracy_score(testy, predictions)

0.8245614035087719

In [163]:
## As we can see here, the ADAboost classifier provides us with a better estimation accuracy as compared to a simple logistic regression
