## Coursera Capstone Part 1: Description of the problem and data  
 - **Creator: Wenzhuo Song**
 - **Email: wenzhuosong1996@outlook.com**

### 1. A description of the problem and a discussion of the background. 

**Video: [Introduction to the capstone](https://www.coursera.org/learn/applied-data-science-capstone/lecture/vQGoA/introduction-to-the-capstone)**  
  
A car accident may be caused by a variety of reasons, and in some cases it may cause casualties, such as extreme weather and poor road conditions. If people can estimate the probability or severity of a car accident in advance by learning some information, they will drive more carefully, thereby reducing the probability and loss of accidents.  
  
The main people who would be interested in this project are some traffic polices, because to reduce accidents and loss, they need to reasonably arrange the traffic flow according to the forecast. Besides, hospitals also need such systems to prepare for accidents rescue in advance, and drivers can drive more carefully with the prediction.
  
In this project, the goal is to **build a model which can predict the severity of an accident**.  
  
According to personal experience, there are several reasons for car accidents.  
 - **Road conditions**. Sometimes, the condition is too bad, which causes driving difficult; or the road conditions are good, which makes drivers careless to drive.
 - **Light conditions**. On roads with poor visibility, like night, the driver may not be able to accurately and timely judge the situation, which can cause a car accident.
 - **Extreme weather**. In some extreme weather, driving is very dangerous.
 - **Bad driving habits**. Some drivers have bad driving habits, like to play with their mobile phones while driving and high speed, which may cause a car accident.
 - **Drunk/drug driving**. When the driver is in an abnormal state, it is extremely prone to car accidents.
 - **Bicycles/pedestrians**. If the accident is related with bicycles or pedestrians, it will make more loss and even casualties.  
  
In the training dataset, there are many features, some of which are about the above discussion. Therefore, by analzing important features and using a supervised learning algorithm, it can build a model to predict the severity of an accident to some degree.

### 2. A description of the data and how it will be used to solve the problem. 

The [dataset](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv) used in the project is all collisions provided by SPD and recorded by Traffic Records, which includes different collisions with their severity and other conditions, and more introduction is [here](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf).

The problem is about supervised learning, and there are some main steps: data preprocessing, model building, evaluation and improvement.  
  
In the data preprocessing, drop unuseful features, fix missing and wrong values, analyze the importance of features, extract more information from the dataset if needed, and then think about what model will perform well.  
  
In the model building, cleaned data need to be splited as train, validation and test parts, and then several model will be built, including baselines and better models.  
  
In the evaluation and improvement, the performance of models need to be analyzed, by which a better one can be created.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**Read the data**

In [2]:
df = pd.read_csv("D:/Coursera_capstone/Data-Collisions.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [4]:
df.shape

(194673, 38)

In [5]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

**In this data set, there are 194673 instances, with 29 differnet features and 1 target. It is obvious that some features are not important and some values are missing or invalid, so in the future work, data need to be cleaned before model building.**