# Capstone Project - Car Accident Prediction
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In modern socities, car accidents are responsible for millions of deaths and injuries every year in the world. The World Health Organization describes the road traffic system is the most complex and the most dangerous system with which people have to deal every day . 

To reduce car accident is an important public safety challenge and big data analytics has emerged with powerful techniques to provide insights on factors leading to the increased risk of accidents. Therefore, it can be used for individual drivers to be more aware of potential accident risk when planning trips. More importantly, it can be used to develop prevention operations and public traffic policies to reduce overall accidents.   


## Data <a name="data"></a>

### Data Source and Feature Selection
Shared example dataset(Data-Collisions.csv) is used and based on problem definition, below factors are chosen in data analysis and prediction of accident severity: 
* Accident Location 
* Address Type 
* Person Count involved in accident
* Vehicle Count involved in accident
* Weather
* Road Condition
* Light Condition
* Speeding  
* Whether inattention
* Whether driver(s) under influnce 

### Data Cleaning 
* Data with status-unmatched is removed
* 10 features are selected for new data frame
* Categorical features are converted to numerical values
* Data with missing feature values are removed  
* Dataset is checked and is imbalanced (more severity 1 than 2), and the dataset will be further re-sampled to reduce bias

In [346]:
import pandas as pd
import numpy as np

In [331]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


In [332]:
df = pd.read_csv('https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [333]:
df["STATUS"].value_counts()

Matched      189786
Unmatched      4887
Name: STATUS, dtype: int64

In [334]:
#Remove unmatched data 
df=df[df["STATUS"]=="Matched"]
df.shape

(189786, 38)

In [335]:
df["SEVERITYCODE"].value_counts()

1    132627
2     57159
Name: SEVERITYCODE, dtype: int64

In [340]:
df1=df[['SEVERITYCODE','LOCATION','ADDRTYPE','PERSONCOUNT','VEHCOUNT','WEATHER','ROADCOND','LIGHTCOND','SPEEDING','INATTENTIONIND','UNDERINFL']]
df1.head()

Unnamed: 0,SEVERITYCODE,LOCATION,ADDRTYPE,PERSONCOUNT,VEHCOUNT,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INATTENTIONIND,UNDERINFL
0,2,5TH AVE NE AND NE 103RD ST,Intersection,2,2,Overcast,Wet,Daylight,,,N
1,1,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Block,2,2,Raining,Wet,Dark - Street Lights On,,,0
2,1,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Block,4,3,Overcast,Dry,Daylight,,,0
3,1,2ND AVE BETWEEN MARION ST AND MADISON ST,Block,3,3,Clear,Dry,Daylight,,,N
4,2,SWIFT AVE S AND SWIFT AV OFF RP,Intersection,2,2,Raining,Wet,Daylight,,,0


In [341]:
df1['LOCATION'].value_counts()

BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB AND AURORA AVE N    274
BATTERY ST TUNNEL SB BETWEEN AURORA AVE N AND ALASKAN WY VI SB    268
N NORTHGATE WAY BETWEEN MERIDIAN AVE N AND CORLISS AVE N          260
AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST                    247
6TH AVE AND JAMES ST                                              242
                                                                 ... 
29TH AVE W AND W RAYE ST                                            1
1ST AVE N BETWEEN PROSPECT N ST AND HIGHLAND S DR                   1
24TH AVE W BETWEEN 24 UPPER AVE W AND W RUFFNER ST                  1
8TH AVE NE BETWEEN NE 123RD ST AND NE 125TH ST                      1
DEARBORN OFF RP BETWEEN I5 SB COLLECTOR AND S DEARBORN ST           1
Name: LOCATION, Length: 23956, dtype: int64

In [342]:
df1['ADDRTYPE'].value_counts()

Block           123663
Intersection     63559
Alley              747
Name: ADDRTYPE, dtype: int64

In [343]:
df1.groupby(['ADDRTYPE'])['SEVERITYCODE'].value_counts(normalize=True)

ADDRTYPE      SEVERITYCODE
Alley         1               0.891566
              2               0.108434
Block         1               0.761473
              2               0.238527
Intersection  1               0.568731
              2               0.431269
Name: SEVERITYCODE, dtype: float64

In [344]:
df1['ADDRTYPE'].replace(to_replace=['Alley','Block','Intersection'],value=[0,1,2],inplace=True)
df1.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Filling in ``NaN`` in a Series via polynomial interpolation or splines:


Unnamed: 0,SEVERITYCODE,LOCATION,ADDRTYPE,PERSONCOUNT,VEHCOUNT,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INATTENTIONIND,UNDERINFL
0,2,5TH AVE NE AND NE 103RD ST,2.0,2,2,Overcast,Wet,Daylight,,,N
1,1,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,1.0,2,2,Raining,Wet,Dark - Street Lights On,,,0
2,1,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,1.0,4,3,Overcast,Dry,Daylight,,,0
3,1,2ND AVE BETWEEN MARION ST AND MADISON ST,1.0,3,3,Clear,Dry,Daylight,,,N
4,2,SWIFT AVE S AND SWIFT AV OFF RP,2.0,2,2,Raining,Wet,Daylight,,,0


In [313]:
df1['WEATHER'].value_counts()

Clear                       111134
Raining                      33144
Overcast                     27713
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [314]:
df1.groupby(['WEATHER'])['SEVERITYCODE'].value_counts(normalize=True)

WEATHER                   SEVERITYCODE
Blowing Sand/Dirt         1               0.732143
                          2               0.267857
Clear                     1               0.677506
                          2               0.322494
Fog/Smog/Smoke            1               0.671353
                          2               0.328647
Other                     1               0.860577
                          2               0.139423
Overcast                  1               0.684444
                          2               0.315556
Partly Cloudy             2               0.600000
                          1               0.400000
Raining                   1               0.662805
                          2               0.337195
Severe Crosswind          1               0.720000
                          2               0.280000
Sleet/Hail/Freezing Rain  1               0.752212
                          2               0.247788
Snowing                   1               0

In [315]:
df1['WEATHER'].replace(to_replace=['Other','Unknown','Clear','Raining','Overcast','Snowing','Fog/Smog/Smoke','Sleet/Hail/Freezing Rain','Blowing Sand/Dirt','Severe Crosswind','Partly Cloudy'],value=[0,0,1,2,3,4,5,6,7,8,9],inplace=True)
df1.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Filling in ``NaN`` in a Series via polynomial interpolation or splines:


Unnamed: 0,SEVERITYCODE,LOCATION,ADDRTYPE,PERSONCOUNT,VEHCOUNT,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INATTENTIONIND,UNDERINFL
0,2,5TH AVE NE AND NE 103RD ST,2.0,2,2,3.0,Wet,Daylight,,,N
1,1,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,1.0,2,2,2.0,Wet,Dark - Street Lights On,,,0
2,1,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,1.0,4,3,3.0,Dry,Daylight,,,0
3,1,2ND AVE BETWEEN MARION ST AND MADISON ST,1.0,3,3,1.0,Dry,Daylight,,,N
4,2,SWIFT AVE S AND SWIFT AV OFF RP,2.0,2,2,2.0,Wet,Daylight,,,0


In [316]:
df1['ROADCOND'].value_counts()

Dry               124508
Wet                47473
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [317]:
df1.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True)

ROADCOND        SEVERITYCODE
Dry             1               0.678221
                2               0.321779
Ice             1               0.774194
                2               0.225806
Oil             1               0.625000
                2               0.375000
Other           1               0.674242
                2               0.325758
Sand/Mud/Dirt   1               0.693333
                2               0.306667
Snow/Slush      1               0.833665
                2               0.166335
Standing Water  1               0.739130
                2               0.260870
Unknown         1               0.950325
                2               0.049675
Wet             1               0.668127
                2               0.331873
Name: SEVERITYCODE, dtype: float64

In [318]:
df1['ROADCOND'].replace(to_replace=['Other','Unknown','Dry','Wet','Ice','Snow/Slush','Standing Water','Sand/Mud/Dirt','Oil'],value=[0,0,1,2,3,4,5,6,7],inplace=True)
df1.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Filling in ``NaN`` in a Series via polynomial interpolation or splines:


Unnamed: 0,SEVERITYCODE,LOCATION,ADDRTYPE,PERSONCOUNT,VEHCOUNT,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INATTENTIONIND,UNDERINFL
0,2,5TH AVE NE AND NE 103RD ST,2.0,2,2,3.0,2.0,Daylight,,,N
1,1,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,1.0,2,2,2.0,2.0,Dark - Street Lights On,,,0
2,1,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,1.0,4,3,3.0,1.0,Daylight,,,0
3,1,2ND AVE BETWEEN MARION ST AND MADISON ST,1.0,3,3,1.0,1.0,Daylight,,,N
4,2,SWIFT AVE S AND SWIFT AV OFF RP,2.0,2,2,2.0,2.0,Daylight,,,0


In [319]:
df1['LIGHTCOND'].value_counts()

Daylight                    116135
Dark - Street Lights On      48506
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [320]:
df1.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts(normalize=True)

LIGHTCOND                 SEVERITYCODE
Dark - No Street Lights   1               0.782694
                          2               0.217306
Dark - Street Lights Off  1               0.736447
                          2               0.263553
Dark - Street Lights On   1               0.701583
                          2               0.298417
Dark - Unknown Lighting   1               0.636364
                          2               0.363636
Dawn                      1               0.670663
                          2               0.329337
Daylight                  1               0.668110
                          2               0.331890
Dusk                      1               0.670620
                          2               0.329380
Other                     1               0.778723
                          2               0.221277
Unknown                   1               0.955095
                          2               0.044905
Name: SEVERITYCODE, dtype: float64

In [321]:
df1['LIGHTCOND'].replace(to_replace=['Other','Unknown','Daylight','Dark - Street Lights On','Dusk','Dawn','Dark - No Street Lights','Dark - Street Lights Off','Dark - Unknown Lighting'],value=[0,0,1,2,3,4,5,6,7],inplace=True)
df1.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Filling in ``NaN`` in a Series via polynomial interpolation or splines:


Unnamed: 0,SEVERITYCODE,LOCATION,ADDRTYPE,PERSONCOUNT,VEHCOUNT,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INATTENTIONIND,UNDERINFL
0,2,5TH AVE NE AND NE 103RD ST,2.0,2,2,3.0,2.0,1.0,,,N
1,1,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,1.0,2,2,2.0,2.0,2.0,,,0
2,1,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,1.0,4,3,3.0,1.0,1.0,,,0
3,1,2ND AVE BETWEEN MARION ST AND MADISON ST,1.0,3,3,1.0,1.0,1.0,,,N
4,2,SWIFT AVE S AND SWIFT AV OFF RP,2.0,2,2,2.0,2.0,1.0,,,0


In [322]:
df1['SPEEDING'].value_counts()

Y    9333
Name: SPEEDING, dtype: int64

In [323]:
values = {'SPEEDING': 0, 'INATTENTIONIND': 0}
df1=df1.fillna(value=values)
df1

Unnamed: 0,SEVERITYCODE,LOCATION,ADDRTYPE,PERSONCOUNT,VEHCOUNT,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INATTENTIONIND,UNDERINFL
0,2,5TH AVE NE AND NE 103RD ST,2.0,2,2,3.0,2.0,1.0,0,0,N
1,1,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,1.0,2,2,2.0,2.0,2.0,0,0,0
2,1,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,1.0,4,3,3.0,1.0,1.0,0,0,0
3,1,2ND AVE BETWEEN MARION ST AND MADISON ST,1.0,3,3,1.0,1.0,1.0,0,0,N
4,2,SWIFT AVE S AND SWIFT AV OFF RP,2.0,2,2,2.0,2.0,1.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
194668,2,34TH AVE S BETWEEN S DAKOTA ST AND S GENESEE ST,1.0,3,2,1.0,1.0,1.0,0,0,N
194669,1,AURORA AVE N BETWEEN N 85TH ST AND N 86TH ST,1.0,2,2,2.0,2.0,1.0,0,Y,N
194670,2,20TH AVE NE AND NE 75TH ST,2.0,3,2,1.0,1.0,1.0,0,0,N
194671,2,GREENWOOD AVE N AND N 68TH ST,2.0,2,1,1.0,1.0,3.0,0,0,N


In [324]:
df1['INATTENTIONIND'].replace(to_replace=['Y'],value=[1],inplace=True)
df1['INATTENTIONIND'].value_counts()

0    159981
1     29805
Name: INATTENTIONIND, dtype: int64

In [325]:
df1['UNDERINFL'].value_counts()

N    100274
0     80391
Y      5126
1      3995
Name: UNDERINFL, dtype: int64

In [326]:
df1['UNDERINFL'].replace(to_replace=['N','0','Y','1'],value=[0,0,1,1],inplace=True)
df1['UNDERINFL'].value_counts()

0    180665
1      9121
Name: UNDERINFL, dtype: int64

In [337]:
df1=df1.dropna(how='any')
df1

Unnamed: 0,SEVERITYCODE,LOCATION,ADDRTYPE,PERSONCOUNT,VEHCOUNT,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INATTENTIONIND,UNDERINFL
0,2,5TH AVE NE AND NE 103RD ST,2.0,2,2,3.0,2.0,1.0,0,0,0
1,1,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,1.0,2,2,2.0,2.0,2.0,0,0,0
2,1,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,1.0,4,3,3.0,1.0,1.0,0,0,0
3,1,2ND AVE BETWEEN MARION ST AND MADISON ST,1.0,3,3,1.0,1.0,1.0,0,0,0
4,2,SWIFT AVE S AND SWIFT AV OFF RP,2.0,2,2,2.0,2.0,1.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
194668,2,34TH AVE S BETWEEN S DAKOTA ST AND S GENESEE ST,1.0,3,2,1.0,1.0,1.0,0,0,0
194669,1,AURORA AVE N BETWEEN N 85TH ST AND N 86TH ST,1.0,2,2,2.0,2.0,1.0,0,1,0
194670,2,20TH AVE NE AND NE 75TH ST,2.0,3,2,1.0,1.0,1.0,0,0,0
194671,2,GREENWOOD AVE N AND N 68TH ST,2.0,2,1,1.0,1.0,3.0,0,0,0


In [338]:
df1['SEVERITYCODE'].value_counts()

1    129979
2     56803
Name: SEVERITYCODE, dtype: int64