# MSDS 7331 Mini - Project 1  

## SVM and Logistic Modeling

Professor: Dr. Jake Drew  
Team: Steven Hayden, Josephine MacDaniel, Korey MacVittie, Afreen Siddiqui, Eduardo Cantu

## Models

In [1]:
#load Libraries
import pandas as pd
import numpy as np
import seaborn as seaborn
import matplotlib.pyplot as plt
import matplotlib as mp1
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score




In [23]:
#Load Data
Accident_df_2016 = pd.read_csv('https://raw.githubusercontent.com/ecantu75/DataMining_Lab1/master/Data/accident_2016.csv',low_memory=False)
Accident_df_2015 = pd.read_csv('https://raw.githubusercontent.com/ecantu75/DataMining_Lab1/master/Data/accident_2015.csv',low_memory=False)
Accident_df = pd.concat([Accident_df_2015,Accident_df_2016])





2

### Data Cleaning

In [45]:
# Displays the record count of non-null Values per attribute and their data type. 
Accident_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16230 entries, 196 to 34438
Data columns (total 71 columns):
Unnamed: 0                                                     16230 non-null int64
state_number                                                   16230 non-null int64
state_name                                                     16230 non-null object
consecutive_number                                             16230 non-null int64
number_of_vehicle_forms_submitted_all                          16230 non-null int64
number_of_motor_vehicles_in_transport_mvit                     16230 non-null int64
number_of_parked_working_vehicles                              16230 non-null int64
number_of_forms_submitted_for_persons_not_in_motor_vehicles    16230 non-null int64
number_of_persons_not_in_motor_vehicles_in_transport_mvit      16230 non-null int64
number_of_persons_in_motor_vehicles_in_transport_mvit          16230 non-null int64
number_of_forms_submitted_for_persons_in_motor_v

In [46]:

#check for duplicate records. It displays the maximum count of a duplicated record. 
#Any value greater than 1 would mean that the data has duplicates
Accident_df['consecutive_number'].value_counts().max()


1

In [47]:
#Drop duplicate records and keeps the most recent record. We assume the most resent is the most accurate 
Accident_df = Accident_df.drop_duplicates(['consecutive_number'],keep = 'last')

#check for duplicate records. It displays the maximum count of a duplicated record. 
#Any value greater than 1 would mean that the data has duplicates
Accident_df['consecutive_number'].value_counts().max()

1

In [48]:
#Remove any recoreds without  time informaiton for arival and crash time
count_no_rec= Accident_df[(Accident_df['hour_of_crash']>24) | (Accident_df['hour_of_arrival_at_scene']>24)]
Accident_df = Accident_df[(Accident_df['hour_of_crash']<=24) & (Accident_df['hour_of_arrival_at_scene']<=24)]
print(count_no_rec.consecutive_number.count(), 'records were removed due to missing time data.') 


0 records were removed due to missing time data.


In [49]:
# Add a new Feature. This is the ratio between the number of fatalities and the people involved in the accident.
Accident_df['Fatalities_ratio'] = Accident_df.number_of_fatalities/(Accident_df.number_of_persons_not_in_motor_vehicles_in_transport_mvit + Accident_df.number_of_persons_in_motor_vehicles_in_transport_mvit)

#Converts hour and min to datetime type
#crash
Accident_df.hour_of_crash = pd.to_timedelta(Accident_df.hour_of_crash,unit ='h')
Accident_df.minute_of_crash= pd.to_timedelta(Accident_df.minute_of_crash,unit ='m')
#arrival
Accident_df.hour_of_arrival_at_scene = pd.to_timedelta(Accident_df.hour_of_arrival_at_scene,unit ='h')
Accident_df.minute_of_arrival_at_scene = pd.to_timedelta(Accident_df.minute_of_arrival_at_scene,unit ='m')

#concatenates Hour and Minutes together 
Accident_df['Crash_Time'] = Accident_df['hour_of_crash'] + Accident_df['minute_of_crash'] 
Accident_df['Arrival_Time'] = Accident_df['hour_of_arrival_at_scene'] + Accident_df['minute_of_arrival_at_scene']
#creates a response_time variable from the two fields above and converts to min
Accident_df['Response_Time'] = Accident_df['Arrival_Time'] - Accident_df['Crash_Time']
total_response_time_in_min = pd.DatetimeIndex(Accident_df['Response_Time'])
Accident_df['Response_Time']= total_response_time_in_min.hour * 60 + total_response_time_in_min.minute


#gut check of calculation 
Accident_df[['hour_of_crash','minute_of_crash','Crash_Time','hour_of_arrival_at_scene','minute_of_arrival_at_scene','Arrival_Time','Response_Time']].head()

Unnamed: 0,hour_of_crash,minute_of_crash,Crash_Time,hour_of_arrival_at_scene,minute_of_arrival_at_scene,Arrival_Time,Response_Time
196,20:00:00,00:47:00,20:47:00,21:00:00,00:01:00,21:01:00,14
240,19:00:00,00:10:00,19:10:00,19:00:00,00:24:00,19:24:00,14
274,14:00:00,00:18:00,14:18:00,14:00:00,00:25:00,14:25:00,7
312,23:00:00,00:23:00,23:23:00,23:00:00,00:29:00,23:29:00,6
959,16:00:00,00:01:00,16:01:00,16:00:00,00:12:00,16:12:00,11


The National Fire Protection Association's (NFPA) has established standard for response time and procedures for Emergency Medical Service (EMS) to adhere to. 

From EMSword.com "The NFPA 1710 standard is based upon a combination of accepted practices and more than 30 years of study, research, testing and validation. Members of the 1710 committee that developed the standard include representatives from various fire agencies and the International Association of City/County Managers (ICMA)."

The NFPA 1710 standard allows for a one-minute call evaluation and preparation, four minutes for the arrival of a unit with first responder. For a situation that requires an advanced life support equipment like an ambulance, their standard is 8 minutes after the call preparation.  

We chose to use their standards as threshold to determine if the paramedics got to the scene of the accident in time. This would be a binary response 0, for not arriving within 9 minutes of the accident and 1 for being within the 9 minutes. 
Source: https://www.emsworld.com/article/10324786/ems-response-time-standards



In [26]:
#creates the binary variable 'within 9 minutes NFPA standard'
Accident_df['within 9 minutes NFPA standard'] = np.where(Accident_df['Response_Time']<=9,1,0)
#Accident_df[['Response_Time','within 9 minutes NFPA standard']].head(10) # Verify the binary variable

Unnamed: 0,Response_Time,within 9 minutes NFPA standard
1,52,0
12,7,1
33,1413,0
40,1418,0
54,13,0
56,12,0
59,11,0
60,5,1
62,38,0
63,8,1


### Logistic Regression Model

In [27]:
# Check for columns that might not be needed
#Accident_df.info()
#list(Accident_df)
Accident_df.route_signing_name.unique()


array(['U.S. Highway', 'Local Street – Municipality', 'Other',
       'Interstate', 'Unknown', 'State Highway', 'County Road',
       'Local Street – Township',
       'Local Street – Frontage Road (Since 1994)'], dtype=object)

In [28]:
#Make a copy of the original Data
Accident_forLr=Accident_df.copy()

# 1. Remove attributes that just arent useful for us
for col in ['Unnamed: 0',
             'state_number',
             'consecutive_number',
             'county',
             'city',
             'day_of_crash',
             'month_of_crash',
             'year_of_crash',
             'day_of_week',
             'hour_of_crash',
             'minute_of_crash',
             'national_highway_system',
             'land_use',
             'land_use_name',
             'functional_system',
             'functional_system_name',
             'ownership',
             'ownership_name',
             'route_signing',
             'trafficway_identifier',
             'trafficway_identifier_2',
             'latitude',
             'longitude',
             'special_jurisdiction',
             'special_jurisdiction_name',
             'first_harmful_event',
             'first_harmful_event_name',
             'manner_of_collision',
             'manner_of_collision_name',
             'relation_to_junction_within_interchange_area',
             'relation_to_junction_specific_location',
             'relation_to_junction_specific_location_name',
             'type_of_intersection',
             'work_zone',
             'relation_to_trafficway',
             'relation_to_trafficway_name',
             'light_condition',
             'light_condition_name',
             'atmospheric_conditions_1',
             'atmospheric_conditions_1_name',
             'atmospheric_conditions_2',
             'atmospheric_conditions_2_name',
             'atmospheric_conditions',
             'school_bus_related',
             'rail_grade_crossing_identifier',
             'hour_of_notification',
             'minute_of_notification',
             'hour_of_arrival_at_scene',
             'minute_of_arrival_at_scene',
             'hour_of_ems_arrival_at_hospital',
             'minute_of_ems_arrival_at_hospital',
             'related_factors_crash_level_1',
             'related_factors_crash_level_1_name',
             'related_factors_crash_level_2',
             'related_factors_crash_level_2_name',
             'related_factors_crash_level_3',
             'related_factors_crash_level_3_name',
             'milepoint',
             'number_of_parked_working_vehicles',
             'number_of_forms_submitted_for_persons_not_in_motor_vehicles',
             'number_of_persons_not_in_motor_vehicles_in_transport_mvit',
             'number_of_persons_in_motor_vehicles_in_transport_mvit',
             'number_of_forms_submitted_for_persons_in_motor_vehicles',
             'timestamp_of_crash',
             'number_of_fatalities',
             'number_of_drunk_drivers',
             'Fatalities_ratio',
             'number_of_vehicle_forms_submitted_all',
             'Crash_Time',
             'Arrival_Time',
             'Response_Time',
             'number_of_motor_vehicles_in_transport_mvit']:
                    if col in Accident_forLr:
                        del Accident_forLr[col]
# List the columns left in the df
list(Accident_forLr)

#Accident_forLr.head(20)

['state_name',
 'route_signing_name',
 'atmospheric_conditions_name',
 'within 9 minutes NFPA standard']

In [30]:
# Drop rows with an unknown informaiton in the route, atmospheric conditions and state name
Accident_forLr= Accident_forLr[(Accident_forLr['route_signing_name'] != 'Unknown')]
Accident_forLr= Accident_forLr[(Accident_forLr['route_signing_name'] != 'Other')]
Accident_forLr= Accident_forLr[(Accident_forLr['atmospheric_conditions_name'] != 'Unknown')]
Accident_forLr= Accident_forLr[(Accident_forLr['atmospheric_conditions_name'] != 'Other')]
Accident_forLr= Accident_forLr[(Accident_forLr['atmospheric_conditions_name'] != 'Not Reported')]
Accident_forLr.count()

# Rearrange colums
Accident_forLr=Accident_forLr[['within 9 minutes NFPA standard','state_name','route_signing_name','atmospheric_conditions_name']]
#list(Accident_forLr)  # Check for the correct column sequence

In [31]:
# perform one-hot encoding of the categorical data "state_name"
tmp_state_df = pd.get_dummies(Accident_forLr.state_name,prefix='state')
#tmp_state_df.head(20)

# perform one-hot encoding of the categorical data "route_signing_name"
tmp_route_df = pd.get_dummies(Accident_forLr.route_signing_name,prefix='route')
#tmp_route_df.head(20)

# perform one-hot encoding of the categorical data "atmospheric_conditions_name"
tmp_atmos_df = pd.get_dummies(Accident_forLr.atmospheric_conditions_name,prefix='atmos')
#tmp_atmos_df.head(20)

Accident_forLr = pd.concat((Accident_forLr,tmp_state_df,tmp_route_df,tmp_atmos_df),axis=1) # add back into the dataframe
list(Accident_forLr)
#delete the categorical variable columns
del Accident_forLr['state_name']
del Accident_forLr['route_signing_name']
del Accident_forLr['atmospheric_conditions_name']

#list(Accident_forLr) # Check for the last colums

In [33]:
#Accident_forLr.count() # Count Records and show columns

within 9 minutes NFPA standard                     14354
state_Alabama                                      14354
state_Alaska                                       14354
state_Arizona                                      14354
state_Arkansas                                     14354
state_California                                   14354
state_Colorado                                     14354
state_Connecticut                                  14354
state_Delaware                                     14354
state_District of Columbia                         14354
state_Florida                                      14354
state_Georgia                                      14354
state_Hawaii                                       14354
state_Idaho                                        14354
state_Illinois                                     14354
state_Iowa                                         14354
state_Kansas                                       14354
state_Kentucky                 

### Data Split
Training and Test

In [34]:
# Code addapted from the Dataming Notbooks. Logistic Regression Notbook 4.
from sklearn.model_selection import ShuffleSplit

# we want to predict the X and y data as follows:
if 'within 9 minutes NFPA standard' in Accident_forLr:
    y = Accident_forLr['within 9 minutes NFPA standard'].values # get the labels we want
    del Accident_forLr['within 9 minutes NFPA standard'] # get rid of the class label
    X = Accident_forLr.values # use everything else to predict!
    
        ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
        #    have converted them into simple matrices to use with scikit learn

# to use the cross validation object in scikit learn, we need to grab an instance
#    of the object and set it up. This object will be able to split our data into 
#    training and testing splits
num_cv_iterations = 10
num_instances = len(y)

# Cross Validation Object
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)
                         
print(X)
print(cv_object)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
ShuffleSplit(n_splits=10, random_state=None, test_size=0.2, train_size=None)


In [41]:
# Code addapted from the Dataming Notbooks. Logistic Regression Notbook 4.
# run logistic regression and vary some parameters
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt

# first we create a reusable logisitic regression object
#   here we can setup the object with different learning parameters and constants
OnTime_lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=None) # get object

# now we can use the cv_object that we setup before to iterate through the 
#    different training and testing sets. Each time we will reuse the logisitic regression 
#    object, but it gets trained on different data each time we use it.

iter_num=0
# the indices are the rows used for training and testing in each iteration
for train_indices, test_indices in cv_object.split(X,y): 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # train the reusable logisitc regression model on the training data
    OnTime_lr_clf.fit(X_train,y_train)  # train object
    y_hat = OnTime_lr_clf.predict(X_test) # get test set precitions

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    iter_num+=1
    
# Also note that every time you run the above code
#   it randomly creates a new training and testing set, 
#   so accuracy will be different each time

====Iteration 0  ====
accuracy 0.6367119470567747
confusion matrix
 [[1202  411]
 [ 632  626]]
====Iteration 1  ====
accuracy 0.6353187042842215
confusion matrix
 [[1194  381]
 [ 666  630]]
====Iteration 2  ====
accuracy 0.6349703935910832
confusion matrix
 [[1232  349]
 [ 699  591]]
====Iteration 3  ====
accuracy 0.6283524904214559
confusion matrix
 [[1202  408]
 [ 659  602]]
====Iteration 4  ====
accuracy 0.6384535005224661
confusion matrix
 [[1240  392]
 [ 646  593]]
====Iteration 5  ====
accuracy 0.6060606060606061
confusion matrix
 [[1162  412]
 [ 719  578]]
====Iteration 6  ====
accuracy 0.6405433646812957
confusion matrix
 [[1205  387]
 [ 645  634]]
====Iteration 7  ====
accuracy 0.634622082897945
confusion matrix
 [[1218  409]
 [ 640  604]]
====Iteration 8  ====
accuracy 0.625914315569488
confusion matrix
 [[1182  385]
 [ 689  615]]
====Iteration 9  ====
accuracy 0.6363636363636364
confusion matrix
 [[1191  379]
 [ 665  636]]


### Support Vector Machine Model

## Model Discussion

## Logistic Regression Weight Interpretation

## Insights based on the Classification Task