# Introduction/ Business Problem

​Car accidents are a major problem in the United States, causing injury, property damage, and even fatalities. Often, there are factors that may lead to a higher risk being involved in an accident. By using a data set with car accident data we will try to predict the severity of a car accident given the weather, location, visibility and road conditions. By utilizing this data our goal is to reduce the frequency and/or severity of car collusions by knowing the risk factors. This analysis will also inform drivers when they may be at higher risk while driving or even choose an alternative route or time for their travel. It could also potentially help the police, government or car insurance providers to gain a deeper understanding of what the risk factors are.


## Data  

The  dataset we will be using is for all road collisions (since 2004 to present) in Seattle. The dataset consists of 37 independent fields and 194673 records, which includes both numerical and categorical data. The dependent field or label for the data set is SEVERITYCODE, which describes the fatality of an accident. The values under this label are categorised into fatality (3), serious injury (2b), inury (2), prop damage (1) and unknown (0). I will be using various factors such as location of the accident, junction type, weather, road condition, light condition, and speeding to determine the severity of the accident and what model will be most accurate. 

## Methodology 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from pylab import rcParams
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, jaccard_score, log_loss
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split


In [2]:
url ='https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv'
df = pd.read_csv(url)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [3]:
df.dtypes


SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

In [20]:
df['SPEEDING'].replace(np.nan,'N',inplace=True)
df['SPEEDING'].dtypes

dtype('O')

In [25]:
df['SPEEDING'].value_counts()

N    185340
Y      9333
Name: SPEEDING, dtype: int64

In [30]:
df_ad = df[['SEVERITYCODE','ADDRTYPE','LOCATION', 'JUNCTIONTYPE','WEATHER','ROADCOND','LIGHTCOND','SPEEDING']]
df_ad.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,2,Intersection,5TH AVE NE AND NE 103RD ST,At Intersection (intersection related),Overcast,Wet,Daylight,N
1,1,Block,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,N
2,1,Block,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Mid-Block (not related to intersection),Overcast,Dry,Daylight,N
3,1,Block,2ND AVE BETWEEN MARION ST AND MADISON ST,Mid-Block (not related to intersection),Clear,Dry,Daylight,N
4,2,Intersection,SWIFT AVE S AND SWIFT AV OFF RP,At Intersection (intersection related),Raining,Wet,Daylight,N


In [33]:
df_ad.describe(include='all')


Unnamed: 0,SEVERITYCODE,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
count,194673.0,192747,191996,188344,189592,189661,189503,194673
unique,,3,24102,7,11,9,9,2
top,,Block,BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB ...,Mid-Block (not related to intersection),Clear,Dry,Daylight,N
freq,,126926,276,89800,111135,124510,116137,185340
mean,1.298901,,,,,,,
std,0.457778,,,,,,,
min,1.0,,,,,,,
25%,1.0,,,,,,,
50%,1.0,,,,,,,
75%,2.0,,,,,,,


In [35]:
missing_data = df_ad.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

SEVERITYCODE
False    194673
Name: SEVERITYCODE, dtype: int64

ADDRTYPE
False    192747
True       1926
Name: ADDRTYPE, dtype: int64

LOCATION
False    191996
True       2677
Name: LOCATION, dtype: int64

JUNCTIONTYPE
False    188344
True       6329
Name: JUNCTIONTYPE, dtype: int64

WEATHER
False    189592
True       5081
Name: WEATHER, dtype: int64

ROADCOND
False    189661
True       5012
Name: ROADCOND, dtype: int64

LIGHTCOND
False    189503
True       5170
Name: LIGHTCOND, dtype: int64

SPEEDING
False    194673
Name: SPEEDING, dtype: int64



In [36]:
df_ad['LIGHTCOND'].replace(np.nan,'Unknown', inplace=True)
df_ad['JUNCTIONTYPE'].replace(np.nan,'Unknown', inplace=True)
df_ad['WEATHER'].replace(np.nan,'Unknown', inplace=True)
df_ad['ROADCOND'].replace(np.nan,'Unknown', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


### UNDERSAMPLING

In [43]:
df_ad["SEVERITYCODE"].value_counts()


1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [45]:
target="SEVERITYCODE"
minority_class_len = len(df_ad[df_ad[target] ==2])
majority_class_indices = df_ad[df_ad[target] ==1].index
random_majority_indices = np.random.choice(majority_class_indices,minority_class_len, replace = False)
minority_class_indices = df_ad[df_ad[target] ==2].index

under_sample_indices = np.concatenate([minority_class_indices, random_majority_indices])
df_ad = df_ad.loc[under_sample_indices]
df_ad["SEVERITYCODE"].value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

# EXPLORATORY DATA ANALYSIS¶


In [47]:
df_ad.describe(include="all")

Unnamed: 0,SEVERITYCODE,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
count,116376.0,115456,115089,116376,116376,116376,116376,116376
unique,,3,19856,7,11,9,9,2
top,,Block,AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST,Mid-Block (not related to intersection),Clear,Dry,Daylight,N
freq,,71397,181,49514,67908,76000,71484,110398
mean,1.5,,,,,,,
std,0.500002,,,,,,,
min,1.0,,,,,,,
25%,1.0,,,,,,,
50%,1.5,,,,,,,
75%,2.0,,,,,,,


In [49]:
df["ADDRTYPE"].value_counts()

Block           126926
Intersection     65070
Alley              751
Name: ADDRTYPE, dtype: int64

In [50]:
df["LOCATION"].value_counts()

BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB AND AURORA AVE N    276
BATTERY ST TUNNEL SB BETWEEN AURORA AVE N AND ALASKAN WY VI SB    271
N NORTHGATE WAY BETWEEN MERIDIAN AVE N AND CORLISS AVE N          265
AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST                    254
6TH AVE AND JAMES ST                                              252
                                                                 ... 
HOWE ST BETWEEN 3RD AVE N AND NOB HILL AVE N                        1
3RD AVE N BETWEEN HAYES ST AND BLAINE ST                            1
ROOSEVELT WAY NE AND NE 114TH N ST                                  1
40TH AVE NE AND NE 51ST ST                                          1
21ST AVE SW AND SW ROXBURY ST                                       1
Name: LOCATION, Length: 24102, dtype: int64

In [51]:
df["JUNCTIONTYPE"].value_counts()

Mid-Block (not related to intersection)              89800
At Intersection (intersection related)               62810
Mid-Block (but intersection related)                 22790
Driveway Junction                                    10671
At Intersection (but not related to intersection)     2098
Ramp Junction                                          166
Unknown                                                  9
Name: JUNCTIONTYPE, dtype: int64

In [52]:
df["WEATHER"].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [53]:
df["ROADCOND"].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [54]:
df["LIGHTCOND"].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [55]:
df["SPEEDING"].value_counts()

N    185340
Y      9333
Name: SPEEDING, dtype: int64