# Data

At a minimum, the following data is required to construct a model to estimate accident severity
- Collision statistics that include a severity measure
- Location information or road characteristics for each of the collisions to allow extraposation to other similar sections of road
- Road surface condition and other environmental features that relate to each of the collisions

The viability of producing an accurate collision severity model will utilise the collison data from the Seattle Police Department accessible via the following link: 
[Seattle Collision Data](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv).

A description of the dataset can be found via the following link: 
[Seattle Collision Metadata](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf).

The remainder of this section contains an assessment of the candidate data set and and explaination of the data elements that are used to construct the model.

## Initial Assessment
Firstly, the data is loaded for evaluation and some basic analysis is performed to get an overview of the contents of the dataset.

In [1]:
import pandas as pd
import numpy as np
import math

In [2]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1h             |       h516909a_0         2.1 MB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2020.6.20          |   py36h9880bd3_2         151 KB  conda-forge
    ------------------------------------------------------------
                       

In [2]:
collisions_data_path = "https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv"
df = pd.read_csv(collisions_data_path, low_memory=False)

Visually inspect a subset of the dataset to confirm that it has loaded and to confirm the amount of data and data types available.

In [4]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [5]:
df.shape

(194673, 38)

In [6]:
print(df.dtypes)

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

DECISION: From the Metadata descriptions and inspecting the output of the head function, the following columns containing identifier and key values will not be investigated:
- OBJECTID
- INCKEY
- COLDETKEY
- REPORTNO

Perform some high level statistical analysis of the data to aid in narrowing down relavant features.

In [7]:
firstSet = ['SEVERITYCODE','X','Y','STATUS','ADDRTYPE','INTKEY','LOCATION','EXCEPTRSNCODE']
secondSet = ['EXCEPTRSNDESC','SEVERITYCODE.1','SEVERITYDESC','COLLISIONTYPE','PERSONCOUNT','PEDCOUNT','PEDCYLCOUNT','VEHCOUNT']
thirdSet = ['INCDATE','INCDTTM','JUNCTIONTYPE','SDOT_COLCODE','SDOT_COLDESC','INATTENTIONIND','UNDERINFL']
forthSet = ['WEATHER','ROADCOND','LIGHTCOND', 'PEDROWNOTGRNT','SDOTCOLNUM', 'SPEEDING','ST_COLCODE','ST_COLDESC','SEGLANEKEY','CROSSWALKKEY','HITPARKEDCAR']

In [8]:
df[firstSet].describe(include='all')

Unnamed: 0,SEVERITYCODE,X,Y,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE
count,194673.0,189339.0,189339.0,194673,192747,65070.0,191996,84811.0
unique,,,,2,3,,24102,2.0
top,,,,Matched,Block,,BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB ...,
freq,,,,189786,126926,,276,79173.0
mean,1.298901,-122.330518,47.619543,,,37558.450576,,
std,0.457778,0.029976,0.056157,,,51745.990273,,
min,1.0,-122.419091,47.495573,,,23807.0,,
25%,1.0,-122.348673,47.575956,,,28667.0,,
50%,1.0,-122.330224,47.615369,,,29973.0,,
75%,2.0,-122.311937,47.663664,,,33973.0,,


In [9]:
df[secondSet].describe(include='all')

Unnamed: 0,EXCEPTRSNDESC,SEVERITYCODE.1,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT
count,5638,194673.0,194673,189769,194673.0,194673.0,194673.0,194673.0
unique,1,,2,10,,,,
top,"Not Enough Information, or Insufficient Locati...",,Property Damage Only Collision,Parked Car,,,,
freq,5638,,136485,47987,,,,
mean,,1.298901,,,2.444427,0.037139,0.028391,1.92078
std,,0.457778,,,1.345929,0.19815,0.167413,0.631047
min,,1.0,,,0.0,0.0,0.0,0.0
25%,,1.0,,,2.0,0.0,0.0,2.0
50%,,1.0,,,2.0,0.0,0.0,2.0
75%,,2.0,,,3.0,0.0,0.0,2.0


In [10]:
df[thirdSet].describe(include='all')

Unnamed: 0,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,INATTENTIONIND,UNDERINFL
count,194673,194673,188344,194673.0,194673,29805,189789
unique,5985,162058,7,,39,1,4
top,2006/11/02 00:00:00+00,11/2/2006,Mid-Block (not related to intersection),,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",Y,N
freq,96,96,89800,,85209,29805,100274
mean,,,,13.867768,,,
std,,,,6.868755,,,
min,,,,0.0,,,
25%,,,,11.0,,,
50%,,,,13.0,,,
75%,,,,14.0,,,


In [11]:
df[forthSet].describe(include='all')

Unnamed: 0,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
count,189592,189661,189503,4667,114936.0,9333,194655.0,189769,194673.0,194673.0,194673
unique,11,9,9,1,,1,63.0,62,,,2
top,Clear,Dry,Daylight,Y,,Y,32.0,One parked--one moving,,,N
freq,111135,124510,116137,4667,,9333,44421.0,44421,,,187457
mean,,,,,7972521.0,,,,269.401114,9782.452,
std,,,,,2553533.0,,,,3315.776055,72269.26,
min,,,,,1007024.0,,,,0.0,0.0,
25%,,,,,6040015.0,,,,0.0,0.0,
50%,,,,,8023022.0,,,,0.0,0.0,
75%,,,,,10155010.0,,,,0.0,0.0,


In [12]:
df.corr()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
SEVERITYCODE,1.0,0.010309,0.017737,0.020131,0.022065,0.022079,0.006553,1.0,0.130949,0.246338,0.214218,-0.054686,0.188905,0.004226,0.104276,0.175093
X,0.010309,1.0,-0.160262,0.009956,0.010309,0.0103,0.120754,0.010309,0.012887,0.011304,-0.001752,-0.012168,0.010904,-0.001016,-0.001618,0.013586
Y,0.017737,-0.160262,1.0,-0.023848,-0.027396,-0.027415,-0.114935,0.017737,-0.01385,0.010178,0.026304,0.017058,-0.019694,-0.006958,0.004618,0.009508
OBJECTID,0.020131,0.009956,-0.023848,1.0,0.946383,0.945837,0.046929,0.020131,-0.062333,0.024604,0.034432,-0.09428,-0.037094,0.969276,0.028076,0.056046
INCKEY,0.022065,0.010309,-0.027396,0.946383,1.0,0.999996,0.048524,0.022065,-0.0615,0.024918,0.031342,-0.107528,-0.027617,0.990571,0.019701,0.048179
COLDETKEY,0.022079,0.0103,-0.027415,0.945837,0.999996,1.0,0.048499,0.022079,-0.061403,0.024914,0.031296,-0.107598,-0.027461,0.990571,0.019586,0.048063
INTKEY,0.006553,0.120754,-0.114935,0.046929,0.048524,0.048499,1.0,0.006553,0.001886,-0.004784,0.000531,-0.012929,0.007114,0.032604,-0.01051,0.01842
SEVERITYCODE.1,1.0,0.010309,0.017737,0.020131,0.022065,0.022079,0.006553,1.0,0.130949,0.246338,0.214218,-0.054686,0.188905,0.004226,0.104276,0.175093
PERSONCOUNT,0.130949,0.012887,-0.01385,-0.062333,-0.0615,-0.061403,0.001886,0.130949,1.0,-0.023464,-0.038809,0.380523,-0.12896,0.011784,-0.021383,-0.032258
PEDCOUNT,0.246338,0.011304,0.010178,0.024604,0.024918,0.024914,-0.004784,0.246338,-0.023464,1.0,-0.01692,-0.261285,0.260393,0.021461,0.00181,0.565326


## Geospacial View
A plot Seattle with an overview of property damage (yellow) and injury (red) was produced to see if location was significant in the outcome of an incident. 

In [13]:
limit = 5000
df_collisions = df.iloc[0:limit, :]

collisions = folium.map.FeatureGroup()

for lat, lng, severity in zip(df_collisions.Y, df_collisions.X, df_collisions.SEVERITYCODE):
    if not math.isnan(lat) and not math.isnan(lng):
        if severity == 1:
            color='yellow'
        else:
            color='red'
        collisions.add_child(
            folium.features.CircleMarker(
                [lat, lng],
                radius=5, # define how big you want the circle markers to be
                color=color,
                fill=True,
                fill_color='blue',
                fill_opacity=0.6
            )
        )
        
        

# define a map centered around Seattle
collision_map = folium.Map(location=[47.6062, -122.3321], zoom_start=12)
collision_map.add_child(collisions)

The overview of the first five thousand collisions does not show an obvious bias based on location so will not be used in modelling.

## Analysing Discrete Features
A number of the colums contain discrete values which merit further investigation.  The value_counts method is used to provide a quick overview of the data.

In [14]:
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [15]:
df['STATUS'].value_counts()

Matched      189786
Unmatched      4887
Name: STATUS, dtype: int64

DECISION: STATUS will not be used for the prediciton model

In [16]:
df['ADDRTYPE'].value_counts()

Block           126926
Intersection     65070
Alley              751
Name: ADDRTYPE, dtype: int64

DECISION: ADDRTYPE appears useful for generic prediction along routes as Block, Intersection and Alley are relatively easy to determine for other road networks.

In [17]:
df['LOCATION'].value_counts()

BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB AND AURORA AVE N                          276
BATTERY ST TUNNEL SB BETWEEN AURORA AVE N AND ALASKAN WY VI SB                          271
N NORTHGATE WAY BETWEEN MERIDIAN AVE N AND CORLISS AVE N                                265
AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST                                          254
6TH AVE AND JAMES ST                                                                    252
AURORA AVE N BETWEEN N 130TH ST AND N 135TH ST                                          239
ALASKAN WY VI NB BETWEEN S ROYAL BROUGHAM WAY ON RP AND SENECA ST OFF RP                238
RAINIER AVE S BETWEEN S BAYVIEW ST AND S MCCLELLAN ST                                   231
ALASKAN WY VI SB BETWEEN COLUMBIA ST ON RP AND ALASKAN WY VI SB EFR OFF RP              212
WEST SEATTLE BR EB BETWEEN ALASKAN WY VI NB ON RP AND DELRIDGE-W SEATTLE BR EB ON RP    212
AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N                                      

DECISION: LOCATION appears too specific for a general purpose prediction.

In [18]:
df['EXCEPTRSNCODE'].value_counts()

       79173
NEI     5638
Name: EXCEPTRSNCODE, dtype: int64

In [19]:
df['EXCEPTRSNDESC'].value_counts()

Not Enough Information, or Insufficient Location Information    5638
Name: EXCEPTRSNDESC, dtype: int64

DECISION: EXCEPTRSNCODE and EXCEPTRSNDECS may be a useful detail to identify and drop incomplete information.

In [20]:
df['SEVERITYCODE.1'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE.1, dtype: int64

In [21]:
df['SEVERITYDESC'].value_counts()

Property Damage Only Collision    136485
Injury Collision                   58188
Name: SEVERITYDESC, dtype: int64

DECISION: SEVERITYCODE.1 and SEVERITYDESC appear to be duplicates of the SEVERITY column and will not be evaluated further.

In [22]:
df['COLLISIONTYPE'].value_counts()

Parked Car    47987
Angles        34674
Rear Ended    34090
Other         23703
Sideswipe     18609
Left Turn     13703
Pedestrian     6608
Cycles         5415
Right Turn     2956
Head On        2024
Name: COLLISIONTYPE, dtype: int64

DECISION: COLLISIONTYPE is unlikely to be useful as a prediction of the collion type may be difficult to predict but it may be analysed further during modelling as it may be correlated with other features useful for determining routes (e.g. Left Turn at an intersection may be more likely to result in an injury which may require an alternate route).

In [23]:
df['PERSONCOUNT'].value_counts()

2     114231
3      35553
4      14660
1      13154
5       6584
0       5544
6       2702
7       1131
8        533
9        216
10       128
11        56
12        33
13        21
14        19
15        11
17        11
16         8
44         6
18         6
20         6
25         6
19         5
26         4
22         4
27         3
28         3
29         3
47         3
32         3
34         3
37         3
23         2
21         2
24         2
30         2
36         2
57         1
31         1
35         1
39         1
41         1
43         1
48         1
53         1
54         1
81         1
Name: PERSONCOUNT, dtype: int64

In [24]:
df['PEDCOUNT'].value_counts()

0    187734
1      6685
2       226
3        22
4         4
6         1
5         1
Name: PEDCOUNT, dtype: int64

In [25]:
df['PEDCYLCOUNT'].value_counts()

0    189189
1      5441
2        43
Name: PEDCYLCOUNT, dtype: int64

In [26]:
df['PEDROWNOTGRNT'].value_counts()

Y    4667
Name: PEDROWNOTGRNT, dtype: int64

In [27]:
df['VEHCOUNT'].value_counts()

2     147650
1      25748
3      13010
0       5085
4       2426
5        529
6        146
7         46
8         15
9          9
11         6
10         2
12         1
Name: VEHCOUNT, dtype: int64

DECISION: The counts will not be further evalutated as they are a consequence of a collision and are unlikely to predict severity.

In [28]:
df['INCDATE'].value_counts()

2006/11/02 00:00:00+00    96
2008/10/03 00:00:00+00    92
2005/05/18 00:00:00+00    84
2005/11/05 00:00:00+00    83
2006/01/13 00:00:00+00    83
2008/10/31 00:00:00+00    82
2005/04/29 00:00:00+00    76
2005/04/15 00:00:00+00    75
2004/12/04 00:00:00+00    74
2007/10/19 00:00:00+00    74
2006/06/01 00:00:00+00    73
2016/10/13 00:00:00+00    73
2005/10/28 00:00:00+00    73
2007/07/20 00:00:00+00    73
2007/11/15 00:00:00+00    70
2006/11/04 00:00:00+00    70
2010/11/22 00:00:00+00    70
2006/10/18 00:00:00+00    70
2006/11/22 00:00:00+00    69
2005/11/04 00:00:00+00    69
2005/12/10 00:00:00+00    68
2005/11/11 00:00:00+00    68
2010/10/09 00:00:00+00    68
2006/04/08 00:00:00+00    68
2006/11/06 00:00:00+00    68
2006/05/05 00:00:00+00    68
2006/11/10 00:00:00+00    68
2007/01/05 00:00:00+00    68
2006/11/21 00:00:00+00    68
2006/02/24 00:00:00+00    67
                          ..
2020/04/12 00:00:00+00     6
2020/05/04 00:00:00+00     6
2020/04/07 00:00:00+00     6
2020/03/21 00:

DECISION: INCDATE may be evaluated further to determine whether season or month can improve the accuracy of the model beyond just weather, road condition or light. 

In [29]:
df['ST_COLDESC'].value_counts()

One parked--one moving                                                                   44421
Entering at angle                                                                        34674
From same direction - both going straight - one stopped - rear-end                       25771
Fixed object                                                                             13554
From same direction - both going straight - both moving - sideswipe                      12777
From opposite direction - one left turn - one straight                                   10324
From same direction - both going straight - both moving - rear-end                        7629
Vehicle - Pedalcyclist                                                                    4701
From same direction - all others                                                          4537
From same direction - one left turn - one straight                                        3093
From same direction - one right turn - one straigh

In [30]:
df['HITPARKEDCAR'].value_counts()

N    187457
Y      7216
Name: HITPARKEDCAR, dtype: int64

In [31]:
df['INCDTTM'].value_counts()

11/2/2006                 96
10/3/2008                 91
11/5/2005                 83
12/4/2004                 74
6/1/2006                  73
11/4/2006                 70
11/4/2005                 69
5/5/2006                  68
11/6/2006                 68
1/5/2007                  68
4/8/2006                  68
11/1/2005                 67
11/1/2008                 67
3/8/2006                  65
10/6/2006                 65
1/9/2006                  64
1/2/2004                  64
11/3/2006                 64
10/6/2005                 62
8/6/2004                  62
7/8/2005                  61
6/9/2005                  61
10/2/2007                 60
4/3/2006                  60
11/6/2008                 60
5/6/2009                  60
2/5/2008                  59
2/2/2006                  59
6/1/2007                  59
12/1/2005                 58
                          ..
2/9/2019 6:51:00 PM        1
12/28/2007 7:03:00 PM      1
1/17/2008 4:00:00 PM       1
3/29/2009 1:59

DECISION: INCDTTM may be used in place of INCDATE if date based improvements are required.

In [32]:
df['JUNCTIONTYPE'].value_counts()

Mid-Block (not related to intersection)              89800
At Intersection (intersection related)               62810
Mid-Block (but intersection related)                 22790
Driveway Junction                                    10671
At Intersection (but not related to intersection)     2098
Ramp Junction                                          166
Unknown                                                  9
Name: JUNCTIONTYPE, dtype: int64

DECISION: JUNCTIONTYPE may be used if there is correlation with severity.

In [33]:
df['SDOT_COLDESC'].value_counts()

MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE          85209
MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END                    54299
MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE SIDESWIPE          9928
NOT ENOUGH INFORMATION / NOT APPLICABLE                          9787
MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT                    8856
MOTOR VEHCILE STRUCK PEDESTRIAN                                  6518
MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE AT ANGLE           5852
MOTOR VEHICLE STRUCK OBJECT IN ROAD                              4741
MOTOR VEHICLE STRUCK PEDALCYCLIST, FRONT END AT ANGLE            3104
MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE         1604
MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE AT ANGLE          1440
PEDALCYCLIST STRUCK MOTOR VEHICLE FRONT END AT ANGLE             1312
MOTOR VEHICLE OVERTURNED IN ROAD                                  479
MOTOR VEHICLE STRUCK PEDALCYCLIST, REAR END                       181
PEDALCYCLIST STRUCK 

DECISION: SDOT_COLDESC will not be used for predicting severity.

In [34]:
df['INATTENTIONIND'].value_counts()

Y    29805
Name: INATTENTIONIND, dtype: int64

In [35]:
df['UNDERINFL'].value_counts()

N    100274
0     80394
Y      5126
1      3995
Name: UNDERINFL, dtype: int64

DECISION: INATTENTIONIND and UNDERINFL will not be used for predicting severity as they will not be an input into route planning.

In [36]:
df['WEATHER'].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [37]:
df['ROADCOND'].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [38]:
df['LIGHTCOND'].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

DECISION: WEATHER, ROADCOND and LIGHTCOND are likely to be useful and will require further analysis.
DECISION: The remaining columns below will not be used for model development.

In [39]:
df['PEDROWNOTGRNT'].value_counts()

Y    4667
Name: PEDROWNOTGRNT, dtype: int64

In [40]:
df['SDOTCOLNUM'].value_counts()

4116034.0     2
11200007.0    2
4112025.0     2
4116048.0     2
5036003.0     1
12030005.0    1
5036023.0     1
10161007.0    1
4028036.0     1
7087008.0     1
12004052.0    1
10161018.0    1
12027022.0    1
5036011.0     1
10342027.0    1
11161009.0    1
4028033.0     1
6078022.0     1
10278010.0    1
6078010.0     1
7087039.0     1
7219004.0     1
10209035.0    1
6078007.0     1
5118001.0     1
11210029.0    1
6316024.0     1
8209029.0     1
8161007.0     1
10204033.0    1
             ..
5051014.0     1
11228013.0    1
11213021.0    1
12250006.0    1
12236001.0    1
11358002.0    1
8267010.0     1
8152044.0     1
9212041.0     1
11212026.0    1
9070044.0     1
8152036.0     1
6262034.0     1
9002013.0     1
11172019.0    1
9278028.0     1
8337019.0     1
6176011.0     1
9097020.0     1
8337011.0     1
6277017.0     1
6227002.0     1
8337005.0     1
6277012.0     1
12157032.0    1
6262043.0     1
6162018.0     1
6277003.0     1
8267022.0     1
5071015.0     1
Name: SDOTCOLNUM, Length

In [41]:
df['SPEEDING'].value_counts()

Y    9333
Name: SPEEDING, dtype: int64

In [42]:
df['SDOT_COLCODE'].value_counts()

11    85209
14    54299
16     9928
0      9787
28     8856
24     6518
13     5852
26     4741
18     3104
15     1604
12     1440
51     1312
29      479
21      181
56      180
27      166
54      139
23      124
48      107
31      104
25      102
34       93
64       75
69       69
33       53
55       50
66       23
22       17
32       12
53        9
44        8
61        7
35        6
58        5
68        4
36        4
46        3
52        2
47        1
Name: SDOT_COLCODE, dtype: int64

In [43]:
df['ST_COLCODE'].value_counts()

32    44421
10    34674
14    25771
50    13554
11    12777
28    10324
13     7629
       4886
45     4701
23     4537
15     3093
16     2956
0      2882
20     2846
12     2435
22     2274
2      2178
21     1617
30     1302
1      1201
71     1184
26     1039
81      835
52      815
19      720
24      590
5       416
51      371
74      343
29      286
      ...  
73      167
25      132
4       111
57      108
40      103
84       94
83       86
72       73
41       57
64       50
31       47
82       35
56       34
48       32
53       26
8        23
7        18
66       11
42       11
65       11
17        9
67        9
88        8
54        7
18        5
43        2
87        2
60        1
49        1
85        1
Name: ST_COLCODE, Length: 63, dtype: int64

In [44]:
df['ST_COLDESC'].value_counts()

One parked--one moving                                                                   44421
Entering at angle                                                                        34674
From same direction - both going straight - one stopped - rear-end                       25771
Fixed object                                                                             13554
From same direction - both going straight - both moving - sideswipe                      12777
From opposite direction - one left turn - one straight                                   10324
From same direction - both going straight - both moving - rear-end                        7629
Vehicle - Pedalcyclist                                                                    4701
From same direction - all others                                                          4537
From same direction - one left turn - one straight                                        3093
From same direction - one right turn - one straigh

In [45]:
df['SEGLANEKEY'].value_counts()

0         191907
6532          19
6078          16
12162         15
10336         14
10342         13
8985          12
10354         10
10420         10
8816          10
12179         10
10368          9
10590          8
8995           8
10773          8
42777          7
10566          7
12941          7
10374          7
12649          6
8990           6
8240           6
12035          6
10532          6
42166          6
23507          6
6322           6
9002           6
10408          6
2426           6
           ...  
4467           1
6854           1
9153           1
13251          1
13891          1
35669          1
19149          1
18762          1
32460          1
8647           1
18890          1
6848           1
41943          1
15428          1
7360           1
38097          1
23674          1
11718          1
34771          1
25288          1
6215           1
41040          1
10433          1
6343           1
37987          1
35157          1
10817          1
15043         

In [46]:
df['CROSSWALKKEY'].value_counts()

0         190862
523609        17
520838        15
525567        13
521707        10
523699        10
523148         9
521863         9
521604         9
523735         9
524265         9
522891         9
522264         8
524689         8
525659         8
521040         8
523987         8
520855         8
523109         8
524029         8
522108         8
522377         8
524178         8
525644         8
521845         7
524221         7
523172         7
525079         7
521865         7
523707         7
           ...  
523320         1
525639         1
523704         1
616043         1
523578         1
29899          1
525381         1
521275         1
522811         1
522939         1
31563          1
523195         1
525508         1
521530         1
521658         1
26056          1
522373         1
523963         1
524091         1
524219         1
522298         1
522426         1
37207          1
524997         1
619243         1
521019         1
630862         1
25545         

In [47]:
df['HITPARKEDCAR'].value_counts()

N    187457
Y      7216
Name: HITPARKEDCAR, dtype: int64

## Further Analysis

This section captures a quick analysis of the candidate data for inclusion in the model.

Firstly, a quick assessment of the relationship between SEVERITYCODE and each of the candidates as well as some basic statistics (count, average and standard deviation).  The mean and standard deviation are relevant as severity code is either 1 or 2, so a mean closer to 2 indicates more likelyhood of an injury.

Further analysis of EXCEPTRSNCODE, EXCEPTRSNDECS and INCDATE may be performed in subsequent phases of the project.

### ADDRTYPE

In [48]:
df.groupby(['ADDRTYPE','SEVERITYCODE']).size()

ADDRTYPE      SEVERITYCODE
Alley         1                 669
              2                  82
Block         1               96830
              2               30096
Intersection  1               37251
              2               27819
dtype: int64

In [49]:
df.groupby('ADDRTYPE').agg({'SEVERITYCODE': ['count', 'mean', 'std']})

Unnamed: 0_level_0,SEVERITYCODE,SEVERITYCODE,SEVERITYCODE
Unnamed: 0_level_1,count,mean,std
ADDRTYPE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Alley,751,1.109188,0.312082
Block,126926,1.237115,0.425315
Intersection,65070,1.427524,0.494723


DECISION: ADDRTYPE will be used in the model as there is a significant severity ratio differance betwen Allay, Block and Intersection.

### JUNCTIONTYPE

In [50]:
df.groupby(['JUNCTIONTYPE','SEVERITYCODE']).size()

JUNCTIONTYPE                                       SEVERITYCODE
At Intersection (but not related to intersection)  1                1475
                                                   2                 623
At Intersection (intersection related)             1               35636
                                                   2               27174
Driveway Junction                                  1                7437
                                                   2                3234
Mid-Block (but intersection related)               1               15493
                                                   2                7297
Mid-Block (not related to intersection)            1               70396
                                                   2               19404
Ramp Junction                                      1                 112
                                                   2                  54
Unknown                                            1        

In [51]:
df.groupby('JUNCTIONTYPE').agg({'SEVERITYCODE': ['count', 'mean', 'std']})

Unnamed: 0_level_0,SEVERITYCODE,SEVERITYCODE,SEVERITYCODE
Unnamed: 0_level_1,count,mean,std
JUNCTIONTYPE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
At Intersection (but not related to intersection),2098,1.296949,0.457023
At Intersection (intersection related),62810,1.432638,0.495446
Driveway Junction,10671,1.303064,0.459604
Mid-Block (but intersection related),22790,1.320184,0.466557
Mid-Block (not related to intersection),89800,1.21608,0.411572
Ramp Junction,166,1.325301,0.469905
Unknown,9,1.222222,0.440959


DECISION: JUNCTIONTYPE may be added to the model after the first iteration if accurancy needs to be improved because it looks like it overlaps with ADDRTYPE

### WEATHER

In [52]:
df.groupby(['WEATHER','SEVERITYCODE']).size()

WEATHER                   SEVERITYCODE
Blowing Sand/Dirt         1                  41
                          2                  15
Clear                     1               75295
                          2               35840
Fog/Smog/Smoke            1                 382
                          2                 187
Other                     1                 716
                          2                 116
Overcast                  1               18969
                          2                8745
Partly Cloudy             1                   2
                          2                   3
Raining                   1               21969
                          2               11176
Severe Crosswind          1                  18
                          2                   7
Sleet/Hail/Freezing Rain  1                  85
                          2                  28
Snowing                   1                 736
                          2                 171
U

In [53]:
df.groupby('WEATHER').agg({'SEVERITYCODE': ['count', 'mean', 'std']})

Unnamed: 0_level_0,SEVERITYCODE,SEVERITYCODE,SEVERITYCODE
Unnamed: 0_level_1,count,mean,std
WEATHER,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Blowing Sand/Dirt,56,1.267857,0.44685
Clear,111135,1.322491,0.467432
Fog/Smog/Smoke,569,1.328647,0.470135
Other,832,1.139423,0.346596
Overcast,27714,1.315544,0.464741
Partly Cloudy,5,1.6,0.547723
Raining,33145,1.337185,0.472756
Severe Crosswind,25,1.28,0.458258
Sleet/Hail/Freezing Rain,113,1.247788,0.433651
Snowing,907,1.188534,0.391353


DECISION: WEATHER will be used in the model as there appears to be enough variation across the different weather conditions that it may be useful.

### ROADCOND

In [54]:
df.groupby(['ROADCOND','SEVERITYCODE']).size()

ROADCOND        SEVERITYCODE
Dry             1               84446
                2               40064
Ice             1                 936
                2                 273
Oil             1                  40
                2                  24
Other           1                  89
                2                  43
Sand/Mud/Dirt   1                  52
                2                  23
Snow/Slush      1                 837
                2                 167
Standing Water  1                  85
                2                  30
Unknown         1               14329
                2                 749
Wet             1               31719
                2               15755
dtype: int64

In [55]:
df.groupby('ROADCOND').agg({'SEVERITYCODE': ['count', 'mean', 'std']})

Unnamed: 0_level_0,SEVERITYCODE,SEVERITYCODE,SEVERITYCODE
Unnamed: 0_level_1,count,mean,std
ROADCOND,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Dry,124510,1.321773,0.467158
Ice,1209,1.225806,0.418285
Oil,64,1.375,0.48795
Other,132,1.325758,0.470443
Sand/Mud/Dirt,75,1.306667,0.464215
Snow/Slush,1004,1.166335,0.372566
Standing Water,115,1.26087,0.441031
Unknown,15078,1.049675,0.21728
Wet,47474,1.331866,0.470888


DECISION: ROADCOND will be used in the model as there appears to be enough variation across the different road conditions that it may be useful.

### LIGHTCOND

In [56]:
df.groupby(['LIGHTCOND','SEVERITYCODE']).size()

LIGHTCOND                 SEVERITYCODE
Dark - No Street Lights   1                1203
                          2                 334
Dark - Street Lights Off  1                 883
                          2                 316
Dark - Street Lights On   1               34032
                          2               14475
Dark - Unknown Lighting   1                   7
                          2                   4
Dawn                      1                1678
                          2                 824
Daylight                  1               77593
                          2               38544
Dusk                      1                3958
                          2                1944
Other                     1                 183
                          2                  52
Unknown                   1               12868
                          2                 605
dtype: int64

In [57]:
df.groupby('LIGHTCOND').agg({'SEVERITYCODE': ['count', 'mean', 'std']})

Unnamed: 0_level_0,SEVERITYCODE,SEVERITYCODE,SEVERITYCODE
Unnamed: 0_level_1,count,mean,std
LIGHTCOND,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Dark - No Street Lights,1537,1.217306,0.412547
Dark - Street Lights Off,1199,1.263553,0.440743
Dark - Street Lights On,48507,1.298411,0.457565
Dark - Unknown Lighting,11,1.363636,0.504525
Dawn,2502,1.329337,0.470066
Daylight,116137,1.331884,0.470892
Dusk,5902,1.32938,0.470028
Other,235,1.221277,0.415992
Unknown,13473,1.044905,0.207102


DECISION: LIGHTCOND will be used in the model as there appears to be enough variation across the different light conditions that it may be useful.

In [58]:
df.groupby('LOCATION').agg({'SEVERITYCODE': ['count', 'mean', 'std']})

Unnamed: 0_level_0,SEVERITYCODE,SEVERITYCODE,SEVERITYCODE
Unnamed: 0_level_1,count,mean,std
LOCATION,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
10TH AVE AND E ALDER ST,1,1.000000,
10TH AVE AND E JEFFERSON ST,10,1.100000,0.316228
10TH AVE AND E MADISON ST,10,1.100000,0.316228
10TH AVE AND E PIKE ST,23,1.391304,0.499011
10TH AVE AND E PINE ST,21,1.523810,0.511766
10TH AVE AND E SENECA ST,29,1.586207,0.501230
10TH AVE AND E SPRUCE ST,1,1.000000,
10TH AVE AND E TERRACE ST,7,1.285714,0.487950
10TH AVE AND E UNION ST,24,1.291667,0.464306
10TH AVE AND E YESLER WAY,10,1.500000,0.527046


# Data Preparation

In [3]:
collision_df = df[['SEVERITYCODE','ADDRTYPE','WEATHER','ROADCOND','LIGHTCOND']]

In [4]:
collision_df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,WEATHER,ROADCOND,LIGHTCOND
0,2,Intersection,Overcast,Wet,Daylight
1,1,Block,Raining,Wet,Dark - Street Lights On
2,1,Block,Overcast,Dry,Daylight
3,1,Block,Clear,Dry,Daylight
4,2,Intersection,Raining,Wet,Daylight


In [5]:
missing_data = collision_df.isnull()

In [6]:
missing_data.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,WEATHER,ROADCOND,LIGHTCOND
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False


In [7]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")  

SEVERITYCODE
False    194673
Name: SEVERITYCODE, dtype: int64

ADDRTYPE
False    192747
True       1926
Name: ADDRTYPE, dtype: int64

WEATHER
False    189592
True       5081
Name: WEATHER, dtype: int64

ROADCOND
False    189661
True       5012
Name: ROADCOND, dtype: int64

LIGHTCOND
False    189503
True       5170
Name: LIGHTCOND, dtype: int64



In [8]:
collision_df.dropna(axis=0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [9]:
collision_df.shape

(187525, 5)

In [10]:
dummy_variable_1 = pd.get_dummies(collision_df["ADDRTYPE"])
dummy_variable_1.head()

Unnamed: 0,Alley,Block,Intersection
0,0,0,1
1,0,1,0
2,0,1,0
3,0,1,0
4,0,0,1


In [11]:
dummy_df = pd.concat([collision_df, pd.get_dummies(collision_df["ADDRTYPE"])], axis=1)
dummy_df.drop("ADDRTYPE", axis = 1, inplace=True)
dummy_df = pd.concat([dummy_df, pd.get_dummies(collision_df["WEATHER"])], axis=1)
dummy_df.drop("WEATHER", axis = 1, inplace=True)
dummy_df.rename(columns = {'Other':'Other Weather'}, inplace = True)
dummy_df.rename(columns = {'Unknown':'Unknown Weather'}, inplace = True)
dummy_df = pd.concat([dummy_df, pd.get_dummies(collision_df["ROADCOND"])], axis=1)
dummy_df.drop("ROADCOND", axis = 1, inplace=True)
dummy_df.rename(columns = {'Other':'Other Road'}, inplace = True)
dummy_df.rename(columns = {'Unknown':'Unknown Road'}, inplace = True)
dummy_df = pd.concat([dummy_df, pd.get_dummies(collision_df["LIGHTCOND"])], axis=1)
dummy_df.drop("LIGHTCOND", axis = 1, inplace=True)
dummy_df.rename(columns = {'Other':'Other Light'}, inplace = True)
dummy_df.rename(columns = {'Unknown':'Unknown Light'}, inplace = True)

In [12]:
dummy_df.head()

Unnamed: 0,SEVERITYCODE,Alley,Block,Intersection,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Other Weather,Overcast,Partly Cloudy,...,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk,Other Light,Unknown Light
0,2,0,0,1,0,0,0,0,1,0,...,1,0,0,0,0,0,1,0,0,0
1,1,0,1,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
2,1,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
3,1,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,2,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0


In [13]:
dummy_df.corr()

Unnamed: 0,SEVERITYCODE,Alley,Block,Intersection,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Other Weather,Overcast,Partly Cloudy,...,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk,Other Light,Unknown Light
SEVERITYCODE,1.0,-0.026807,-0.195055,0.199161,-0.001337,0.051938,0.003228,-0.022501,0.011724,0.003332,...,0.036763,-0.016642,-0.006466,-0.005618,0.001005,0.006828,0.081135,0.010924,-0.005624,-0.149796
Alley,-0.026807,1.0,-0.087352,-0.045078,-0.001019,-0.000902,0.001199,0.010327,-0.002643,-0.000325,...,-0.010539,0.041565,0.011062,-0.00522,-0.000483,-0.001377,-0.012859,-0.002014,0.000249,0.017694
Block,-0.195055,-0.087352,1.0,-0.991227,0.004712,-0.037437,0.001183,0.021945,-0.014228,-0.002804,...,-0.045801,0.019322,0.004877,0.005395,-0.000343,-0.005904,-0.074202,-0.0087,0.012838,0.133265
Intersection,0.199161,-0.045078,-0.991227,1.0,-0.00459,0.037662,-0.001345,-0.023377,0.014619,0.002855,...,0.047328,-0.024891,-0.006358,-0.004718,0.000408,0.006103,0.076117,0.008991,-0.012907,-0.135988
Blowing Sand/Dirt,-0.001337,-0.001019,0.004712,-0.00459,1.0,-0.019364,-0.000887,-0.001052,-0.006709,-8.3e-05,...,-0.003298,0.002208,-0.001289,0.001054,-0.000124,0.003888,-0.004175,0.000899,-0.000563,0.003569
Clear,0.051938,-0.000902,-0.037437,0.037662,-0.019364,1.0,-0.065726,-0.077904,-0.497055,-0.006185,...,-0.605398,-0.01233,-0.016103,-0.075139,-0.004928,-0.04202,0.219193,-0.01441,-0.019876,-0.253117
Fog/Smog/Smoke,0.003228,0.001199,0.001183,-0.001345,-0.000887,-0.065726,1.0,-0.003569,-0.022773,-0.000283,...,0.004317,0.006964,0.016545,0.031259,-0.00042,0.023431,-0.029349,-0.006475,0.000893,-0.011612
Other Weather,-0.022501,0.010327,0.021945,-0.023377,-0.001052,-0.077904,-0.003569,1.0,-0.026993,-0.000336,...,-0.025406,0.002356,1.3e-05,-0.012471,0.021002,-0.001794,-0.033541,-0.001713,0.037989,0.082195
Overcast,0.011724,-0.002643,-0.014228,0.014619,-0.006709,-0.497055,-0.022773,-0.026993,1.0,-0.002143,...,0.130655,-0.000872,0.004383,0.014653,-0.001212,0.0375,0.014842,0.022762,-0.002318,-0.088089
Partly Cloudy,0.003332,-0.000325,-0.002804,0.002855,-8.3e-05,-0.006185,-0.000283,-0.000336,-0.002143,1.0,...,-0.000616,0.011028,-0.000412,-0.000676,-4e-05,0.008422,-0.002287,-0.000926,-0.00018,-0.001386


In [14]:
clean_df = dummy_df.copy()

In [15]:
clean_df.shape

(187525, 33)

# Model Development
## Decision Tree
Investigated whether alternate encoding resulted in different performance.

In [16]:
labelled_df = collision_df.copy()
labelled_df.dtypes

SEVERITYCODE     int64
ADDRTYPE        object
WEATHER         object
ROADCOND        object
LIGHTCOND       object
dtype: object

In [17]:
labelled_df = collision_df.copy()
labelled_df["ADDRTYPE"] = labelled_df["ADDRTYPE"].astype('category')
labelled_df["ADDRTYPE"] = labelled_df["ADDRTYPE"].cat.codes
labelled_df["WEATHER"] = labelled_df["WEATHER"].astype('category')
labelled_df["WEATHER"] = labelled_df["WEATHER"].cat.codes
labelled_df["ROADCOND"] = labelled_df["ROADCOND"].astype('category')
labelled_df["ROADCOND"] = labelled_df["ROADCOND"].cat.codes
labelled_df["LIGHTCOND"] = labelled_df["LIGHTCOND"].astype('category')
labelled_df["LIGHTCOND"] = labelled_df["LIGHTCOND"].cat.codes
labelled_df.dtypes
labelled_df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,WEATHER,ROADCOND,LIGHTCOND
0,2,2,4,8,5
1,1,1,6,8,2
2,1,1,4,0,5
3,1,1,1,0,5
4,2,2,6,8,5


In [18]:
labelledX = labelled_df.drop('SEVERITYCODE', axis=1)

In [75]:
labelledY = labelled_df['SEVERITYCODE']

In [76]:
from sklearn.model_selection import train_test_split

x_train_coll, x_test_coll, y_train_coll, y_test_coll = train_test_split(labelledX, labelledY, test_size=0.10, random_state=1)

print("number of test samples :", x_test_coll.shape[0])
print("number of training samples:",x_train_coll.shape[0])

number of test samples : 18753
number of training samples: 168772


In [77]:
from sklearn.tree import DecisionTreeClassifier

In [78]:
tree = DecisionTreeClassifier(criterion="entropy", max_depth = 16)

In [79]:
tree.fit(x_train_coll,y_train_coll)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=16,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [80]:
predTree = tree.predict(x_test_coll)

In [81]:
predTree.sum()

18805

In [82]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test_coll, predTree))

DecisionTrees's Accuracy:  0.6887431344318242


## Training/Test split

In [19]:
X = clean_df.drop('SEVERITYCODE', axis=1)

In [20]:
Y = clean_df['SEVERITYCODE']

In [21]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.10, random_state=1)

print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

number of test samples : 18753
number of training samples: 168772


## Classification

In [86]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [87]:
from sklearn.neighbors import KNeighborsClassifier

In [88]:
k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)
neigh

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=4, p=2,
           weights='uniform')

In [89]:
yhat = neigh.predict(x_test)
yhat[0:5]

array([1, 1, 1, 1, 1])

In [90]:
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(x_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

Train set Accuracy:  0.669933401275093
Test set Accuracy:  0.6630939049752039


NOTE: Training time relatively slow!!!

## Decision Tree
Using dummy instead of explicit label encoding.

In [91]:
from sklearn.tree import DecisionTreeClassifier

In [92]:
tree = DecisionTreeClassifier(criterion="entropy", max_depth = 32)

In [93]:
tree.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=32,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [94]:
predTree = tree.predict(x_test)

In [95]:
predTree.sum()

18803

In [96]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predTree))

DecisionTrees's Accuracy:  0.6888497840345544


In [48]:
from sklearn import svm
# Following lines commented because execution is slow
clf = svm.SVC(kernel='rbf')
clf.fit(x_train, y_train) 



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [49]:
yhatSVC = clf.predict(x_test)

In [50]:
print("SVC's Accuracy: ", metrics.accuracy_score(y_test, yhatSVC))

SVC's Accuracy:  0.6889564336372846


In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn import metrics

In [25]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train,y_train)
yhatLR = LR.predict(x_test)
print("LR's Accuracy: ", metrics.accuracy_score(y_test, yhatLR))

LR's Accuracy:  0.6889564336372846


In [26]:
yhatLR.sum()

18753

In [27]:
yhat_prob = LR.predict_proba(x_test)

In [28]:
yhat_prob[0:10]

array([[0.71828522, 0.28171478],
       [0.73286936, 0.26713064],
       [0.74318961, 0.25681039],
       [0.72912429, 0.27087571],
       [0.72912429, 0.27087571],
       [0.71517946, 0.28482054],
       [0.73286936, 0.26713064],
       [0.75623049, 0.24376951],
       [0.73286936, 0.26713064],
       [0.71517946, 0.28482054]])

In [29]:
yhat_prob[:,1].max()

0.4637682642887758

In [30]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhatLR)

0.6889564336372846

In [31]:
for column in x_train.columns.values.tolist():
    print(column)
    print (x_train[column].value_counts())
    print("")  

Alley
0    168107
1       665
Name: Alley, dtype: int64

Block
1    110974
0     57798
Name: Block, dtype: int64

Intersection
0    111639
1     57133
Name: Intersection, dtype: int64

Blowing Sand/Dirt
0    168730
1        42
Name: Blowing Sand/Dirt, dtype: int64

Clear
1    99476
0    69296
Name: Clear, dtype: int64

Fog/Smog/Smoke
0    168272
1       500
Name: Fog/Smog/Smoke, dtype: int64

Other Weather
0    168065
1       707
Name: Other Weather, dtype: int64

Overcast
0    144022
1     24750
Name: Overcast, dtype: int64

Partly Cloudy
0    168767
1         5
Name: Partly Cloudy, dtype: int64

Raining
0    139073
1     29699
Name: Raining, dtype: int64

Severe Crosswind
0    168748
1        24
Name: Severe Crosswind, dtype: int64

Sleet/Hail/Freezing Rain
0    168673
1        99
Name: Sleet/Hail/Freezing Rain, dtype: int64

Snowing
0    167968
1       804
Name: Snowing, dtype: int64

Unknown Weather
0    156106
1     12666
Name: Unknown Weather, dtype: int64

Dry
1    111357
0     

In [32]:
for column in x_train.columns.values.tolist():
    print(column)
    LR2 = LogisticRegression(C=0.01, solver='liblinear').fit(x_train[[column]],y_train)
    yhatLR2 = LR2.predict(x_test[[column]])
    print("LR's Accuracy for column: ", metrics.accuracy_score(y_test, yhatLR2))
    print("")  



Alley
LR's Accuracy for column:  0.6889564336372846

Block
LR's Accuracy for column:  0.6889564336372846

Intersection
LR's Accuracy for column:  0.6889564336372846

Blowing Sand/Dirt
LR's Accuracy for column:  0.6889564336372846

Clear
LR's Accuracy for column:  0.6889564336372846

Fog/Smog/Smoke
LR's Accuracy for column:  0.6889564336372846

Other Weather
LR's Accuracy for column:  0.6889564336372846

Overcast
LR's Accuracy for column:  0.6889564336372846

Partly Cloudy
LR's Accuracy for column:  0.6889564336372846

Raining
LR's Accuracy for column:  0.6889564336372846

Severe Crosswind
LR's Accuracy for column:  0.6889564336372846

Sleet/Hail/Freezing Rain
LR's Accuracy for column:  0.6889564336372846

Snowing
LR's Accuracy for column:  0.6889564336372846

Unknown Weather
LR's Accuracy for column:  0.6889564336372846

Dry
LR's Accuracy for column:  0.6889564336372846

Ice
LR's Accuracy for column:  0.6889564336372846

Oil
LR's Accuracy for column:  0.6889564336372846

Other Road
LR'

## Scaling

In [33]:
from sklearn import preprocessing

In [34]:
scaler = preprocessing.StandardScaler().fit(x_train)

  return self.partial_fit(X, y)


In [35]:
scaled_x_train = scaler.transform(x_train)

  if __name__ == '__main__':


In [36]:
scaledLR = LogisticRegression(C=0.01, solver='liblinear').fit(scaled_x_train,y_train)

In [37]:
yhatScaledLR = scaledLR.predict(scaler.transform(x_test))

  if __name__ == '__main__':


In [38]:
print("Scaled LR's Accuracy: ", metrics.accuracy_score(y_test, yhatScaledLR))

Scaled LR's Accuracy:  0.6888497840345544


In [39]:
y_test.head()

112600    1
165482    2
150055    1
78951     1
129993    1
Name: SEVERITYCODE, dtype: int64

In [40]:
yhatScaledLR[0:5]

array([1, 1, 1, 1, 1])

## Analysis feedback

In [41]:
selected_df = clean_df[['Intersection','Raining','Fog/Smog/Smoke','Oil','Wet','Daylight']]

In [42]:
selected_df

Unnamed: 0,Intersection,Raining,Fog/Smog/Smoke,Oil,Wet,Daylight
0,1,0,0,0,1,1
1,0,1,0,0,1,0
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,1,1,0,0,1,1
5,1,0,0,0,0,1
6,1,1,0,0,1,1
7,1,0,0,0,0,1
8,0,0,0,0,0,1
9,1,0,0,0,0,1


In [43]:
selected_df.head()

Unnamed: 0,Intersection,Raining,Fog/Smog/Smoke,Oil,Wet,Daylight
0,1,0,0,0,1,1
1,0,1,0,0,1,0
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,1,1,0,0,1,1


In [44]:
x_train_sel, x_test_sel, y_train_sel, y_test_sel = train_test_split(selected_df, Y, test_size=0.10, random_state=1)

In [45]:
selectedLR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train_sel,y_train_sel)

In [46]:
yhatSelectedLR = selectedLR.predict(x_test_sel)

In [47]:
print("Selected LR's Training Accuracy: ", metrics.accuracy_score(y_train_sel, selectedLR.predict(x_train_sel)))
print("Selected LR's Test Accuracy: ", metrics.accuracy_score(y_test_sel, yhatSelectedLR))

Selected LR's Training Accuracy:  0.697520915791719
Selected LR's Test Accuracy:  0.6889564336372846
