## Coursera Capstone Part 1: Description of the problem and data  
 - **Creator: Wenzhuo Song**
 - **Email: wenzhuosong1996@outlook.com**

### 1. A description of the problem and a discussion of the background. 

**Video: [Introduction to the capstone](https://www.coursera.org/learn/applied-data-science-capstone/lecture/vQGoA/introduction-to-the-capstone)**  
  
A car accident may be caused by a variety of reasons, and in some cases it may cause casualties, such as extreme weather and poor road conditions. If people can estimate the probability or severity of a car accident in advance by learning some information, they will drive more carefully, thereby reducing the probability and loss of accidents.  
  
The main people who would be interested in this project are some traffic polices, because to reduce accidents and loss, they need to reasonably arrange the traffic flow according to the forecast. Besides, hospitals also need such systems to prepare for accidents rescue in advance, and drivers can drive more carefully with the prediction.
  
In this project, the goal is to **build a model which can predict the severity of an accident**.  
  
According to personal experience, there are several reasons for car accidents.  
 - **Road conditions**. Sometimes, the condition is too bad, which causes driving difficult; or the road conditions are good, which makes drivers careless to drive.
 - **Light conditions**. On roads with poor visibility, like night, the driver may not be able to accurately and timely judge the situation, which can cause a car accident.
 - **Extreme weather**. In some extreme weather, driving is very dangerous.
 - **Bad driving habits**. Some drivers have bad driving habits, like to play with their mobile phones while driving and high speed, which may cause a car accident.
 - **Drunk/drug driving**. When the driver is in an abnormal state, it is extremely prone to car accidents.
 - **Bicycles/pedestrians**. If the accident is related with bicycles or pedestrians, it will make more loss and even casualties.  
  
In the training dataset, there are many features, some of which are about the above discussion. Therefore, by analzing important features and using a supervised learning algorithm, it can build a model to predict the severity of an accident to some degree.

### 2. A description of the data and how it will be used to solve the problem. 

The [dataset](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv) used in the project is all collisions provided by SPD and recorded by Traffic Records, which includes different collisions with their severity and other conditions, and more introduction is [here](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf).

The problem is about supervised learning, and there are some main steps: data preprocessing, model building, evaluation and improvement.  
  
In the data preprocessing, drop unuseful features, fix missing and wrong values, analyze the importance of features, extract more information from the dataset if needed, and then think about what model will perform well.  
  
In the model building, cleaned data need to be splited as train, validation and test parts, and then several model will be built, including baselines and better models.  
  
In the evaluation and improvement, the performance of models need to be analyzed, by which a better one can be created.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### 2.1 Read the data

In [61]:
df = pd.read_csv("D:/Coursera_capstone/Data-Collisions.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [62]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [63]:
df.shape

(194673, 38)

In [64]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

**In this data set, there are 194673 instances, with 29 differnet features and 1 target. It is obvious that some features are not important and some values are missing or invalid, so in the future work, data need to be cleaned before model building.**

#### 2.2 More description of data

**A. Drop unuseful features**  
 - **Spetial features**: 'X'(Longitude), 'Y'(Latitude), 'LOCATION'.  
 - **Identifications**: 'OBJECTID', 'INCKEY', 'COLDETKEY', 'INTKEY', 'REPORTNO', 'SDOTCOLNUM', 'STATUS'.

In [65]:
df.drop(['X', 'Y', 'LOCATION', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'INTKEY', 'REPORTNO', 'SDOTCOLNUM', 'STATUS'], axis=1, inplace=True)

In [66]:
df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,EXCEPTRSNCODE,EXCEPTRSNDESC,SEVERITYCODE.1,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,...,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,Intersection,,,2,Injury Collision,Angles,2,0,0,...,Overcast,Wet,Daylight,,,10,Entering at angle,0,0,N
1,1,Block,,,1,Property Damage Only Collision,Sideswipe,2,0,0,...,Raining,Wet,Dark - Street Lights On,,,11,From same direction - both going straight - bo...,0,0,N
2,1,Block,,,1,Property Damage Only Collision,Parked Car,4,0,0,...,Overcast,Dry,Daylight,,,32,One parked--one moving,0,0,N
3,1,Block,,,1,Property Damage Only Collision,Other,3,0,0,...,Clear,Dry,Daylight,,,23,From same direction - all others,0,0,N
4,2,Intersection,,,2,Injury Collision,Angles,2,0,0,...,Raining,Wet,Daylight,,,10,Entering at angle,0,0,N


**The proportion of missing values in each feature**

In [67]:
df.isna().sum()/len(df)

SEVERITYCODE      0.000000
ADDRTYPE          0.009894
EXCEPTRSNCODE     0.564341
EXCEPTRSNDESC     0.971039
SEVERITYCODE.1    0.000000
SEVERITYDESC      0.000000
COLLISIONTYPE     0.025191
PERSONCOUNT       0.000000
PEDCOUNT          0.000000
PEDCYLCOUNT       0.000000
VEHCOUNT          0.000000
INCDATE           0.000000
INCDTTM           0.000000
JUNCTIONTYPE      0.032511
SDOT_COLCODE      0.000000
SDOT_COLDESC      0.000000
INATTENTIONIND    0.846897
UNDERINFL         0.025088
WEATHER           0.026100
ROADCOND          0.025746
LIGHTCOND         0.026557
PEDROWNOTGRNT     0.976026
SPEEDING          0.952058
ST_COLCODE        0.000092
ST_COLDESC        0.025191
SEGLANEKEY        0.000000
CROSSWALKKEY      0.000000
HITPARKEDCAR      0.000000
dtype: float64

There are 194673 instances, but some of them have many mising values, which is shown above. If the proportion of missing values of a feature is too high, like EXCEPTRSNDESC(0.97), PEDROWNOTGRNT(0.98) and SPEEDING(0.95), it should be deleted dirsctly, while others can be filled by statistics. Here, I decide to drop features which miss more than half values.

In [68]:
df.drop(['EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'INATTENTIONIND', 'PEDROWNOTGRNT', 'SPEEDING'], axis=1, inplace=True)

In [69]:
df.isna().sum()/len(df)

SEVERITYCODE      0.000000
ADDRTYPE          0.009894
SEVERITYCODE.1    0.000000
SEVERITYDESC      0.000000
COLLISIONTYPE     0.025191
PERSONCOUNT       0.000000
PEDCOUNT          0.000000
PEDCYLCOUNT       0.000000
VEHCOUNT          0.000000
INCDATE           0.000000
INCDTTM           0.000000
JUNCTIONTYPE      0.032511
SDOT_COLCODE      0.000000
SDOT_COLDESC      0.000000
UNDERINFL         0.025088
WEATHER           0.026100
ROADCOND          0.025746
LIGHTCOND         0.026557
ST_COLCODE        0.000092
ST_COLDESC        0.025191
SEGLANEKEY        0.000000
CROSSWALKKEY      0.000000
HITPARKEDCAR      0.000000
dtype: float64

**Then, use mode value of each feature to fill the missing**

In [70]:
df.ADDRTYPE.fillna(df.ADDRTYPE.mode()[0], inplace=True)
df.COLLISIONTYPE.fillna(df.COLLISIONTYPE.mode()[0], inplace=True)
df.JUNCTIONTYPE.fillna(df.JUNCTIONTYPE.mode()[0], inplace=True)
df.UNDERINFL.fillna(df.UNDERINFL.mode()[0], inplace=True)
df.WEATHER.fillna(df.WEATHER.mode()[0], inplace=True)
df.ROADCOND.fillna(df.ROADCOND.mode()[0], inplace=True)
df.LIGHTCOND.fillna(df.LIGHTCOND.mode()[0], inplace=True)
df.ST_COLCODE.fillna(df.ST_COLCODE.mode()[0], inplace=True)
df.ST_COLDESC.fillna(df.ST_COLDESC.mode()[0], inplace=True)

In [71]:
df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,SEVERITYCODE.1,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,...,SDOT_COLDESC,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,Intersection,2,Injury Collision,Angles,2,0,0,2,2013/03/27 00:00:00+00,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",N,Overcast,Wet,Daylight,10,Entering at angle,0,0,N
1,1,Block,1,Property Damage Only Collision,Sideswipe,2,0,0,2,2006/12/20 00:00:00+00,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE ...",0,Raining,Wet,Dark - Street Lights On,11,From same direction - both going straight - bo...,0,0,N
2,1,Block,1,Property Damage Only Collision,Parked Car,4,0,0,3,2004/11/18 00:00:00+00,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",0,Overcast,Dry,Daylight,32,One parked--one moving,0,0,N
3,1,Block,1,Property Damage Only Collision,Other,3,0,0,3,2013/03/29 00:00:00+00,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",N,Clear,Dry,Daylight,23,From same direction - all others,0,0,N
4,2,Intersection,2,Injury Collision,Angles,2,0,0,2,2004/01/28 00:00:00+00,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",0,Raining,Wet,Daylight,10,Entering at angle,0,0,N


**'UNDERINFL' has N/0 and Y/1 together, exchange them**

In [79]:
df.UNDERINFL.replace(['N', 'Y'], [0, 1], inplace=True)
df.UNDERINFL = df.UNDERINFL.astype(int)

In [81]:
df.UNDERINFL.value_counts()

0    185552
1      9121
Name: UNDERINFL, dtype: int64

**Make sure each feature has right data type**

In [82]:
df.dtypes

SEVERITYCODE       int64
ADDRTYPE          object
SEVERITYCODE.1     int64
SEVERITYDESC      object
COLLISIONTYPE     object
PERSONCOUNT        int64
PEDCOUNT           int64
PEDCYLCOUNT        int64
VEHCOUNT           int64
INCDATE           object
INCDTTM           object
JUNCTIONTYPE      object
SDOT_COLCODE       int64
SDOT_COLDESC      object
UNDERINFL          int32
WEATHER           object
ROADCOND          object
LIGHTCOND         object
ST_COLCODE        object
ST_COLDESC        object
SEGLANEKEY         int64
CROSSWALKKEY       int64
HITPARKEDCAR      object
dtype: object

In [95]:
time = [x[1] for x in df.INCDATE.str.split(' ')]
pd.value_counts(time)

00:00:00+00    194673
dtype: int64

**In the INCDATE, all values have 00:00:00+00, so delete it and keep Y-M-D info.**

In [105]:
df.DATE = [x[0] for x in df.INCDATE.str.split(' ')]
df.drop(['INCDATE'], axis=1, inplace=True)

In [106]:
df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,SEVERITYCODE.1,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDTTM,...,SDOT_COLDESC,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,Intersection,2,Injury Collision,Angles,2,0,0,2,3/27/2013 2:54:00 PM,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",0,Overcast,Wet,Daylight,10,Entering at angle,0,0,N
1,1,Block,1,Property Damage Only Collision,Sideswipe,2,0,0,2,12/20/2006 6:55:00 PM,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE ...",0,Raining,Wet,Dark - Street Lights On,11,From same direction - both going straight - bo...,0,0,N
2,1,Block,1,Property Damage Only Collision,Parked Car,4,0,0,3,11/18/2004 10:20:00 AM,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",0,Overcast,Dry,Daylight,32,One parked--one moving,0,0,N
3,1,Block,1,Property Damage Only Collision,Other,3,0,0,3,3/29/2013 9:26:00 AM,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",0,Clear,Dry,Daylight,23,From same direction - all others,0,0,N
4,2,Intersection,2,Injury Collision,Angles,2,0,0,2,1/28/2004 8:04:00 AM,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",0,Raining,Wet,Daylight,10,Entering at angle,0,0,N


**Now, the string values need to be translated to numbers, which will be finished in**