# Predicting Future Traffic Accident Severity using A Countrywide Traffic Accident Dataset

## Prerequisite
* Dataset: https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents
* Jupyter Notebook Environment
* python version: `>=3.9,<3.13`
* python package: `pandas, numpy, matplotlib, seaborn`


Download the dataset from kaggle to `/data` folder and rename it as `US_Accidents_March23.csv`

In [1]:
import pandas as pd

df = pd.read_csv('./data/US_Accidents_March23.csv')

Preview the firt 5 row of data

In [2]:
df.head()

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,Source2,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,Source2,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,Source2,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,0.01,...,False,False,False,False,True,False,Night,Night,Day,Day
3,A-4,Source2,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,0.01,...,False,False,False,False,False,False,Night,Day,Day,Day
4,A-5,Source2,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,0.01,...,False,False,False,False,True,False,Day,Day,Day,Day


### Handle missing value

First we inspect the percentage of missing values of each feature

In [3]:
missing = pd.DataFrame(df.isnull().sum()).reset_index()
missing.columns = ['Feature', 'Missing_Percent(%)']
missing['Missing_Percent(%)'] = missing['Missing_Percent(%)'].apply(lambda x: x / df.shape[0] * 100)
missing.loc[missing['Missing_Percent(%)']>0,:]

Unnamed: 0,Feature,Missing_Percent(%)
7,End_Lat,44.029355
8,End_Lng,44.029355
10,Description,6.5e-05
11,Street,0.140637
12,City,0.003274
15,Zipcode,0.024779
17,Timezone,0.10103
18,Airport_Code,0.292881
19,Weather_Timestamp,1.555666
20,Temperature(F),2.120143


We could observe that some feature has missing values, we should handle those values before the next step

### Drop Features

Some features does provide few or none useful information for accident prediction. First, we drop the obvious features that provide no useful information such as `ID`, `Source`, `Description`, `Country`, `Turning_Loop` (Country, Turning_Loop only have one class)

In [None]:
df.drop(['ID', 'Source', 'Description', 'Country', 'Turning_Loop'], axis=1, inplace=True)