# <center>  Predict Severity of Car Accident </center>

# 1. Introduction
This report is part of the __Applied Data Science Capstone__ project of the IBM Data Science Course. The aim of this project is to review what we have learned by solving a data analytic problem.

## 1.1 Background
Traffic accidents have become an increasingly important issue threatening people's public safety. In some cases, due to the lag of information, rescue organizations cannot understand the details of the accident in time, more and more vehicles are congested on the road, and the situation is getting worse...
Therefore, if we can predict the severity of the accident based on the existing relevant data (such as road conditions, weather, etc.), it will help to make corresponding preparations when the accident occurs and buy time for the next action. Real-time reminders of traffic accident predictions can also help alleviate traffic pressure.
## 1.2 Problem
In this project we will predict the severity of an accident given the relative data like weather, road condition, etc.

## 1.3 Interest
The stakeholder of this project can be drivers who can use the prediction to get more careful o the traffic emergency  department to get more prepared for the possible accident.

# 2. Data

## 2.1 Data Acquisition
I use the dataset [National Collision Database](https://open.canada.ca/data/en/dataset/1eb9eba7-71d1-4b30-9fb1-30cbdab7e63a) which is a database containing all police-reported motor vehicle collisions on public roads in Canada. Selected variables (data elements) relating to fatal and injury collisions for the collisions from 1999 to 2017. The data can be found [here](https://opendatatc.blob.core.windows.net/opendatatc/NCDB_1999_to_2017.csv) and the data dictionary [here](https://opendatatc.blob.core.windows.net/opendatatc/NCDB_Data_Dictionary.docx)

Our object is to predict the severity of collision from public data like road condition, weather, traffic control, etc. We will use the following fields:

| Field          | Description                              |
|----------------|:-----------------------------------------|
| C_WDAY         | WDay of week                             |
| C_SEV          | Collision severity                       |
| C_VEHS         | Number of vehicles involved in collision |
| C_CONF         | Collision configuration                  |
| C_RCFG         | Roadway configuration                    |
| C_WTHR         | Weather condition                        |
| C_RSUR         | Road surface                             |
| C_RALN         | Road alignment                           |
| C_TRAF         | Traffic control                          |

The dependent variable is __C_SEV__, and the rest is independent variable.


### 2.1.1 Data Cleaning
#### Identify and handle missing values
In the dataset, missing data comes with "Q","U","X","QQ","UU" or "XX". We replace them with NaN (Not a Number), which is Python's default missing value marker, for reasons of computational speed and convenience.
As the number of missing value rows is not significant, we can simplly drop these rows.
#### Resampling
We can see that this is an imbalanced dataset, so we need to resample it first.

C_SEV  | Counts
---- | ----
2 | 4518716
1 | 79445

A widely adopted technique for dealing with highly unbalanced datasets is called __resampling__. It consists of removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling).

<img src="images/resampling.png" alt="Drawing" style="width: 550px;"/>

In this project, as we have a large dataset of each class, we use the __Random under-sampling__ method. 

After resampling, the dataset get balanced.

<img src="images/balanced-class.png" alt="Drawing" style="width: 300px;"/>

#### Correct data format
As we can see that only the variable *C_VEHS* is a continuous numerical variable, the others are all categorical data. So we need to change to the proper type to avoid problems in the future.

#### Binning Data
The field __C_CONF__ has too many values, we will bin the values in 4 groups according to there description in data dictionary.

|value|description|
|-----|:---------|
|01-06| Single Vehicle in Motion|
|21-25| Two Vehicles in Motion - Same Direction of Travel|
|31-36| Two Vehicles in Motion - Different Direction of Travel|
|41| Two Vehicles - Hit a Parked Motor Vehicle|



## 3. Methodology
### 3.1 Exploratory Data Analysis

Lets look at the relation between individual feature and the severity using Visualization.

#### Number of vehicles involved VS severity of collision
![image](images/01-vehs.png)

#### Day of week VS severity of collision
![image](images/02-wday.png)
As we can see that the day of week doesn't affact much on the severity.

#### Type of Collision VS severity of collision
![image](images/03-conf.png)

#### Type of Roadway Configuration VS severity of collision
![image](images/04-rcfg.png)

#### Weather condition VS severity of collision
![image](images/05-wthr.png)

#### Road surface VS severity of collision
![image](images/06-rsur.png)

#### Road alignment VS severity of collision
![image](images/08-raln.png)

#### Traffic control VS severity of collision
![image](images/07-traf.png)


## 3.2 Feature Selection

<h3>Conclusion: Important Variables</h3>
<p>We now have a better idea of what our data looks like and which variables are important to take into account when predicting the severity of an accident. We have narrowed it down to the following variables:</p>

Continuous numerical variables:
<ul>
    <li>C_VEHS</li>  
</ul>
    
Categorical variables:
<ul>
    <li>C_WDAY</li>
    <li>C_CONF</li>
    <li>C_RCFG</li>
    <li>C_WTHR</li>
    <li>C_RSUR</li>
    <li>C_RALN</li>
    <li>C_TRAF</li>
</ul>

### 3.3 Convert Categorical features to numerical values
#### One Hot Encoding
In this part we use one hot encoding to convert categorical data to numerical values. 

Features after one hot encoding:

- 'C_VEHS', 

- '1-Vehicle', '2-Vehicles-Same', '2-Vehicles-Different','2-Vehicles-Parked', 

- 'Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday', 'Saturday', 'Sunday', 

- 'Non-intersection','intersection-2public', 'intersection-parking', 'railroad-crossing','Bridge-overpass-viaduct', 'Tunnel-underpass', 'Passing-climbing-lane','Ramp', 'Traffic-circle', 'Express-lane', 

- 'Clear-sunny','Overcast-cloudy-no-precipitation', 'Raining','Snowing-not-drifting-snow', 'Freezing-rain-sleet-hail',

- 'Visibility-limitation', 'Strong wind', 'Dry-normal', 'Wet', 'Snow','Slush-wet-snow', 'Icy', 'Sand-gravel-dirt', 'Muddy', 'Oil', 'Flooded',

- 'Straight-level', 'Straight-gradient', 'Curved-level','Curved-gradient', 'Top-hill-gradient', 'Bottom-hill-gradient',

- 'fully-operational', 'flashing-mode', 'Stop-sign', 'Yield-sign','Warning-sign', 'Pedestrian-crosswalk', 'Police-officer','School-guard', 'school-crossing', 'Reduced-speed-zone','No-passing-sign', 'Markings-on-the-road','School-bus-stopped-flashing', 'Railway-crossing-signals-or-gate','Railway-crossing-signs', 'Control-device-not-specified','No-control-present'

### 3.4  Clasification Modeling and Evaluation
The dependent variable is a labeled data with binary values, so we can easily think of clasification models.

We will split the processed dataset into training and test, and we use the training dataset to build an accurate model. Then use the test set to report the accuracy of the model You should use the following algorithm:

- K Nearest Neighbor(KNN) (we need to use only train set to find the best k value)
- Decision Tree
- Support Vector Machine
- Logistic Regression

## Results 
I observed that the quantity of data hasn't make difference on the accuracy.
For example, when 20% of the rebalanced set was used. The evaluation metrics:

| Algorithm          | Jaccard | F1-score | LogLoss |
|--------------------|---------|----------|---------|
| KNN                | 0.6645       | 0.6644        | NA      |
| Decision Tree      | 0.6916       | 0.6905        | NA      |
| SVM                | __0.7081__       | 0.7080        | NA      |
| LogisticRegression | 0.7047       | 0.7047        | 0.57    |

When we use 30% of the rebalanced set

| Algorithm          | Jaccard | F1-score | LogLoss |
|--------------------|---------|----------|---------|
| KNN                | 0.6605       | 0.6604        | NA      |
| Decision Tree      | 0.6932       | 0.6911        | NA      |
| SVM                | __0.7063__       | 0.7063        | NA      |
| LogisticRegression | 0.7057       | 0.7057        | 0.58    |


## Discussion
In this project I've only tried the most used classfication models. 
However, we have more work to do in the part of features selection. It could be improved in several ways, such as:
- [Feature importance scores](https://machinelearningmastery.com/calculate-feature-importance-with-python/) can provide insight into the dataset. The relative scores can highlight which features may be most relevant to the target, and the converse, which features are the least relevant. This may be interpreted by a domain expert and could be used as the basis for gathering more or different data.
- [Search for Categorical Correlation](https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9)

- [Find Correlation among multiple categorical variables](https://stackoverflow.com/questions/48035381/correlation-among-multiple-categorical-variables-pandas)

On the other hand, more features should be included like the visibility and this will contribute more information to the prediction and help the build better model.

## Conclusion 

In this project I have applied what I have learned in this course. Following the principle methodology of data analysis and trying to find a proper model to predict the severity of a traffic accident.
I have done the part of data acquisition, data pre-prossing and explotory data visualization, and finally to find the best classifier.
However, I've encontered problems of feature selection on categorical features and the reduction of them. In this respect, we can try the method mentioned in the __Discussion__ section in the future and build a better model. 