In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import statistics 
from datetime import datetime
sns.set(style="darkgrid")
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



# Project Overview # 
There were 33,654 fatal motor vehicle crashes in the United States in 2018 in which 36,560 deaths occurred  – this means every 15 minutes a traffic accidents happened. Every traffic accident is individual and cause personal fate. Nevertheless, risk factors increase the probability for a traffic accident. As visible at nhtsa  in US different risk factors per state cause a different risk level. So for example, the SUV proportion or the mix of urban areas influence the accident risk and severity. 

# Problem Statement # 

As shown different risk factors cause the accident probability per state. Combining this facts and the idea of crime prediction leads to the prediction of country per traffic accident. With this prediction of risk factors per state individual measures can defined to reduce the local risk. For a country or state agency this can be a daily tool to understand early trends and fight against individual tragically fates.  

# EDA #

For memory and training-time-reasons a sample of 500.000 records will used 

In [None]:
df_org=pd.read_csv('/kaggle/input/us-accidents/US_Accidents_Dec19.csv')
df=df_org.sample(500000, random_state =21)

### Feature Overview ###
For a first overview output of the features

In [None]:
df.info()

The dataset contains a single entry for each accident with different features. The features can clustered in geography, accident-focused, weather, time

**Geo-Features:**
* Start_Lat                float64
* Start_Lng                float64
* End_Lat                  float64
* End_Lng                  float64
* Number                   float64
* Street                   object
* County                   object
* State                    object
* Zipcode                  object
* Country                  object
* Airport_Code             object

**Accident-Features:**
* Severity                 int64
* Distance(mi)             float64
* Description              object
* Amenity                  bool
* Bump                     bool
* Crossing                 bool
* Give_Way                 bool
* Junction                 bool
* No_Exit                  bool
* Railway                  bool
* Roundabout               bool
* Station                  bool
* Stop                     bool
* Traffic_Calming          bool
* Traffic_Signal           bool
* Turning_Loop             bool

**Weather-Condition:**
* Weather_Timestamp        object
* Temperature(F)           float64
* Wind_Chill(F)            float64
* Humidity(%)              float64
* Pressure(in)             float64
* Visibility(mi)           float64
* Wind_Direction           object
* Wind_Speed(mph)          float64
* Precipitation(in)        float64
* Weather_Condition        object

**Time-Features:**
* Start_Time               object
* End_Time                 object
* Timezone                 object
* Sunrise_Sunset           object
* Civil_Twilight           object
* Nautical_Twilight        object
* Astronomical_Twilight    object

### Detail View

**Geography features (input variables)**

* Start_Lat/Start_Lng: starting point of accident, floating values of coordinates
* End_Lat/End_Lng: end point of accident, floating values of coordinates
* Number/State/Street/County/Zipcode/Country: address of the accident’s location
* Airport_Code: string value of nearest airport code

These features will not include in the prediction model because the state should not be predicted by address features.


**Geography features (target variable)**

The target value for the multilabel classification is imbalance - this have to be considered later in predicition

In [None]:
plt.figure(figsize=(20, 5))
plt.title('Distribution of recordings per state')
sns.countplot(x=df['State'], data=df)

**Accident-focused features**

* Severity: The severity of an accident is described as number between 1 to 4 with a mean value of 2.5 for all recordings

In [None]:
plt.figure(figsize=(10, 4))
plt.subplot(111)
sns.distplot(df['Severity'])
plt.title('Severity distribution as histogram')

* Distance(mi): Distance means the difference starting to end point of the accident. As visible in general nearly zero miles are recorded but some accidents have outliners

In [None]:
plt.figure(figsize=(10, 4))
plt.subplot(111)
sns.distplot(df['Distance(mi)'])
plt.title('Distance distribution as histogram')

* Description: String of an description the accident occurs like “Two right lane blocked and right hand shoulder blocked due to accident on I-270 Northbound after I-55”
* multiple features like "junction","stopping"... Boundary conditions of accident for example if there was a junction (0 for negative / 1 for positive)

In [None]:
bool_features=['Amenity',
              'Bump',
              'Crossing',
              'Give_Way',
              'Junction',
              'No_Exit',
              'Railway',
              'Roundabout',
              'Station',
              'Stop',
              'Traffic_Calming',
              'Traffic_Signal',
              'Turning_Loop']

for i in bool_features:
    df_temp=df[i].copy()
    df_temp[df_temp==False]=0
    df_temp[df_temp==True]=1
    df[i] = df_temp
    

In [None]:
plt.figure(figsize=(20, 10))
plt.subplot(431)
sns.countplot(df['Amenity'])
plt.title('Amenity')
plt.xlabel('')
plt.subplot(432)
sns.countplot(df['Bump'])
plt.title('Bump')
plt.xlabel('')
plt.subplot(433)
sns.countplot(df['Crossing'])
plt.title('Crossing')
plt.xlabel('')
plt.subplot(434)
sns.countplot(df['Give_Way'])
plt.title('Give_Way')
plt.xlabel('')
plt.subplot(435)
sns.countplot(df['Junction'])
plt.title('Junction')
plt.xlabel('')
plt.subplot(436)
sns.countplot(df['No_Exit'])
plt.title('No_Exit')
plt.xlabel('')
plt.subplot(437)
sns.countplot(df['Railway'])
plt.title('Railway')
plt.xlabel('')
plt.subplot(438)
sns.countplot(df['Roundabout'])
plt.title('Roundabout')
plt.xlabel('')
plt.subplot(439)
sns.countplot(df['Station'])
plt.title('Station')

In [None]:
plt.figure(figsize=(20, 10))
plt.subplot(431)
sns.countplot(df['Stop'])
plt.title('Stop')
plt.xlabel('')
plt.subplot(432)
sns.countplot(df['Traffic_Calming'])
plt.title('Traffic_Calming')
plt.xlabel('')
plt.subplot(433)
sns.countplot(df['Traffic_Signal'])
plt.title('Traffic_Signal')
plt.xlabel('')
plt.subplot(434)
sns.countplot(df['Turning_Loop'])
plt.title('Turning_Loop')
plt.xlabel('')

**Weather features**

* Weather Timestamp: Date/time value for api-supported matching of weather dates
* Temperature: Temperature as float-value
* Wind_Chill: Temperature as float-value
* Humidity: Humidity as float-value
* Pressure: Pressure as float-value
* Visibility: Visibility in miles as float-value
* Wind_Direction: String value of wind direction, cleaned and converted as int 
* Wind_Speed: Wind speed as float-value
* Precipitation: Amount of precipitation as float-value
* Weather Condition: Description as string like “mostly cloudy”, converted as int

Data-Handling:
* For missing values the median of the features will used
* Outliners: Following outliners will not considered for prediction
* Temperature > 134°F – 134°F was the hottest recording ever in US 
* Wind speed > 253 – 253mph was the highest wind speed ever recorded in the us 
* Pressure: The lowest barometric pressure ever recorded was 25.69.  In this dataset a lot of recordings have values below – the pressure feature will not used for prediction model


In [None]:
weather_features=['Weather_Timestamp',
                  'Temperature(F)',
                  'Wind_Chill(F)',
                  'Humidity(%)',
                  'Pressure(in)',
                  'Visibility(mi)',
                  'Wind_Direction',
                  'Wind_Speed(mph)',
                  'Precipitation(in)',
                  'Weather_Condition']
print('Types')  
print(df[weather_features].info())
print('\n')
print('Count nan-values')
print(df[weather_features].isna().sum())

**wind direction**

Cleaning data and factorising wind direction to numbers

In [None]:
df['Wind_Direction'].unique()

In [None]:
df['Wind_Direction']=df['Wind_Direction'].replace('E', 'East')
df['Wind_Direction']=df['Wind_Direction'].replace('N', 'North')
df['Wind_Direction']=df['Wind_Direction'].replace('W', 'West')
df['Wind_Direction']=df['Wind_Direction'].replace('S', 'South')
df['Wind_Direction']=df['Wind_Direction'].replace('CALM', 'Calm')

In [None]:
df['Wind_Direction'].unique()

In [None]:
factor_wd = pd.factorize(df['Wind_Direction'])
df['Wind_Direction'] = factor_wd[0]

**weather condition**

Cleaning data and factorising wind direction to numbers

In [None]:
df['Weather_Condition'].unique()

In [None]:
factor_wc = pd.factorize(df['Weather_Condition'])
df['Weather_Condition'] = factor_wc[0]

**other weather features**

The continuous values will be taken to the ML-model. The null-values will be filled with median values

In [None]:
acc_features=['Temperature(F)',
              'Wind_Chill(F)',
              'Humidity(%)',
              'Pressure(in)',
              'Visibility(mi)',
              'Wind_Speed(mph)',
              'Precipitation(in)']

for feature in acc_features:
    df[feature]=df[feature].fillna(df[feature].median())

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(341)
sns.boxplot(x=df['Temperature(F)'])
plt.title('Temperature(F)')
plt.xlabel('')
plt.subplot(342)
sns.boxplot(x=df['Wind_Chill(F)'])
plt.title('Wind_Chill(F)')
plt.xlabel('')
plt.subplot(343)
sns.boxplot(x=df['Humidity(%)'])
plt.title('Humidity(%)')
plt.xlabel('')
plt.subplot(344)
sns.boxplot(x=df['Pressure(in)'])
plt.title('Pressure(in)')
plt.xlabel('')
plt.subplot(345)
sns.boxplot(x=df['Visibility(mi)'])
plt.title('Visibility(mi)')
plt.xlabel('')
plt.subplot(346)
sns.boxplot(x=df['Wind_Speed(mph)'])
plt.title('Wind_Speed(mph)')
plt.xlabel('')
plt.subplot(347)
sns.boxplot(x=df['Precipitation(in)'])
plt.title('Precipitation(in)')
plt.xlabel('')


In [None]:
df=df[df['Wind_Speed(mph)'] <= 253]
df=df[df['Temperature(F)'] <= 134]

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(211)
g=sns.countplot(x=df['Wind_Direction'])
plt.title('Wind_Direction')
plt.xlabel('')
x_list=[]
for i in factor_wd[1]:
    x_list.append(i)
g.set_xticklabels(x_list)
plt.subplot(212)
o=sns.countplot(df['Weather_Condition'])
plt.title('Weather_Condition')
plt.xlabel('')
x_list=[]
for i in factor_wc[1]:
    x_list.append(i)
o.set_xticklabels(x_list, size=10, rotation=90)
plt.show()

**Time features **

* Start_Time/ End Time: Date and Time of start/end traffic accident
* Timezone: Timezone in US in the categories Pacific/Mountain/Central/Eastern. The Timezone values used for normalizing the additional feature Start_Time_hour
* Start_Time_hour: Extracted as additional feature weekday of start hour – normalized of Central timezone (Pacific Time +2 / Mountain Time +1 / Eastern Time -1) 
* Start_Time_weekday: Extracted as additional feature weekday of start time


In [None]:
df['Start_Time']=pd.to_datetime(df['Start_Time'])
df['End_Time']=pd.to_datetime(df['End_Time'])
df['Start_Time_weekday']=df['Start_Time'].dt.dayofweek
df['Start_Time_hour']=df['Start_Time'].dt.hour

In [None]:
df['Timezone'].unique()

In [None]:
for row in df.index:
    if df.loc[row,'Timezone']=='US/Eastern':
        df.loc[row,'Start_Time_hour']=df.loc[row,'Start_Time_hour']-1
    elif df.loc[row,'Timezone']=='US/Pacific':
        df.loc[row,'Start_Time_hour']=df.loc[row,'Start_Time_hour']+2
    elif df.loc[row,'Timezone']=='US/Mountain':
        df.loc[row,'Start_Time_hour']=df.loc[row,'Start_Time_hour']+1

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(211)
sns.distplot(df['Start_Time_weekday'])
plt.title('Start_Time_weekday')
plt.xlabel('')
plt.subplot(212)
sns.distplot(df['Start_Time_hour'])
plt.title('Start_Time_hour')
plt.xlabel('')

**Sunrise / Twilight data**

The Sunrise/Twilight data are categorical features if some nan and 0-values. The 4 columns are identical - taking only the Sunrise_Sunset column and removing the nan and 0-values with the median. Converting 0 to day and 1 to night

In [None]:
print('Number of records wth day')
print(df[df['Sunrise_Sunset']=='Day']['ID'].count())
print('Number of records wth night')
print(df[df['Sunrise_Sunset']=='Night']['ID'].count())

# Filling the nans with day

df['Sunrise_Sunset']=df['Sunrise_Sunset'].fillna('Day')
print('nans left')
print(df['Sunrise_Sunset'].isna().sum())

In [None]:
df['Sunrise_Sunset']=df['Sunrise_Sunset'].replace('Day',0)
df['Sunrise_Sunset']=df['Sunrise_Sunset'].replace('Night',1)

In [None]:
plt.figure(figsize=(10, 6))
plt.subplot(211)
g=sns.countplot(x=df['Sunrise_Sunset'])
plt.title('Sunrise_Sunset')
plt.xlabel('')
g.set_xticklabels(['Day','Night'])

# Dataset preparation #

In the problem statement is described the prediction of state by accident data. Taken all available features can cause a prediction of state for example by temperature (because California is hotter as Michigan) – but perhaps a hotter temperature can also be a risk factor for accidents. 
For this reason two separate datasets will created for prediction model – one with all relevant features and a separate dataset with accident focused features. 


In [None]:
feat_columns=['State',
              'Severity',
              'Distance(mi)', 
              'Temperature(F)',
              'Wind_Chill(F)',
              'Humidity(%)',
              'Wind_Direction',
              'Weather_Condition',
              'Visibility(mi)',
              'Wind_Speed(mph)',
              'Precipitation(in)',
              'Start_Time_hour',
              'Start_Time_weekday',
              'Sunrise_Sunset',
              'Amenity',
              'Bump',
              'Crossing',
              'Give_Way',
              'Junction',
              'No_Exit',
              'Railway',
              'Roundabout',
              'Station',
              'Stop',
              'Traffic_Calming',
              'Traffic_Signal',
              'Turning_Loop']

# accident focused features 
featO_columns=['State',
              'Severity',
              'Distance(mi)',
              'Start_Time_hour',
              'Start_Time_weekday',
              'Sunrise_Sunset',
              'Weather_Condition',
              'Amenity',
              'Bump',
              'Crossing',
              'Give_Way',
              'Junction',
              'No_Exit',
              'Railway',
              'Roundabout',
              'Station',
              'Stop',
              'Traffic_Calming',
              'Traffic_Signal',
              'Turning_Loop']




# Last check    
df_feat=df[feat_columns]    
#print(df_feat.isna().sum())
print(df_feat.isna().sum())
print(df_feat.info())


# Prediction with whole dataset #

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.externals import joblib
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [None]:
df_feat=df[feat_columns]

In [None]:
# Convert state to numbers
factor = pd.factorize(df_feat['State'])
df_feat['State'] = factor[0]

In [None]:
df_feat['State'].unique()

In [None]:
#Splitting the data into independent and dependent variables
target='State'
y = df_feat[target]
X = df_feat.drop(columns=target)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Evaluation functions**

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import f1_score
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score

def multiclass_roc_auc_score(y_test, y_pred, average="weighted"):
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_test)
    y_test = lb.transform(y_test)
    y_pred = lb.transform(y_pred)
    print(roc_auc_score(y_test, y_pred, average=average))
    return

def multiclass_f1_score(y_test, y_pred, average="weighted"):
    f1=f1_score(y_test, y_pred, average=average)
    print(f1)
    return
    
def multiclass_classification_report(y_test, y_pred):
    list=[]
    for i in factor[1]:
        list.append(i)
    print(classification_report(y_test,y_pred,target_names=list))
    return

**RandomForestClassifier**

In [None]:
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)

In [None]:
definitions=factor[1]
state_num= len(definitions)
y_pred = classifier.predict(X_test)

In [None]:
feature_imp = pd.Series(classifier.feature_importances_,index=X.columns).sort_values(ascending=False)

k=10
sns.barplot(x=feature_imp[:10], y=feature_imp.index[:k])
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

In [None]:
print('ROC AUC Score')
print(multiclass_roc_auc_score(y_test, y_pred))
print('F1 Score')
print(multiclass_f1_score(y_test, y_pred))

In [None]:
list=[]
for i in factor[1]:
    list.append(i)

report = classification_report(y_test, y_pred, output_dict=True, target_names=list)
df_report = pd.DataFrame(report).transpose()
df_report.to_csv('dataset_whole_report.csv')

# Prediction with accident focused dataset #

In [None]:
df_feat=df[featO_columns]

In [None]:
factor = pd.factorize(df_feat['State'])
df_feat['State'] = factor[0]
target='State'
y = df_feat[target]
X = df_feat.drop(columns=target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**RandomForestClassifier**

In [None]:
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)

In [None]:
definitions=factor[1]
state_num= len(definitions)
y_pred = classifier.predict(X_test)

In [None]:
feature_imp = pd.Series(classifier.feature_importances_,index=X.columns).sort_values(ascending=False)

k=10
sns.barplot(x=feature_imp[:10], y=feature_imp.index[:k])
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

In [None]:
print('ROC AUC Score')
print(multiclass_roc_auc_score(y_test, y_pred))
print('F1 Score')
print(multiclass_f1_score(y_test, y_pred))

In [None]:
list=[]
for i in factor[1]:
    list.append(i)

report = classification_report(y_test, y_pred, output_dict=True, target_names=list)
df_report = pd.DataFrame(report).transpose()
df_report.to_csv('dataset_acc_report.csv')

# Benchmark model #

As benchmark the proportion will used for each state in database. 

In [None]:
df_prop_state=pd.DataFrame(columns=['State','Count','Prop'])
df_prop_state

In [None]:
list_state=df['State'].unique()
list_state

In [None]:
i=0
for state in list_state:
    count=len(df[df['State']== state].index)
    prop=count/len(df.index)
    df_prop_state.loc[i,'State']=state
    df_prop_state.loc[i,'Count']=count
    df_prop_state.loc[i,'Prop']=prop
    i+=1

In [None]:
df_prop_state=df_prop_state.set_index('State')

In [None]:
df_prop_state

# Results discussion # 

## RCF-result: whole vs. accident focused dataset ##

As shown the best performance can achieve with the Random Forrest classifier. Below the top features of the prediction with whole dataset and accident-focused dataset is compared. By using all possible features the positive result is reasoned with for example “Pressure” – which is probably not relevant as risk factor for accidents. In a closer look, the time a traffic accident happens is one state-dependent risk features. This feature with distance, weather Condition and weekday of the accident gives a state dependent feature set for prediction of accident.    

## Results to benchmark model ##

In the following table in appendix the classification report for the RCF-prediction with the whole/accident-focused dataset is used to compare the results with benchmark model. 
As visible with the f1-score the prediction quality isn’t direct related to the proportion in the dataset. For example Oregon achieved in the accident focused dataset a f1-score of 0,255 (2,4% proportion) compared to Illinoi with 0,127 (2,9% proportion).
This effect is also visible comparing the prediction results with the benchmark model. Using the whole dataset the prediction improves the result about 39%. The positive results decrease with the accident-focused dataset to about 8,9%. In this case some states performed over 25% better as the benchmark model on no states has a negative delta accident focused dataset vs. benchmark. 


In [None]:
df_report_whole=pd.read_csv('/kaggle/working/dataset_whole_report.csv', index_col=0, header=0)
df_report_acc=pd.read_csv('/kaggle/working/dataset_acc_report.csv', index_col=0, header=0)

In [None]:
df_report_whole=df_report_whole.rename(columns={'precision':'whole_df_precision'})
df_report_acc=df_report_acc.rename(columns={'precision':'acc_df_precision'})
df_compare = pd.concat([df_report_whole, df_report_acc,df_prop_state], axis=1, join='inner')
df_compare=df_compare[['whole_df_precision','acc_df_precision','Prop']]
df_compare['DELTA whole_df vs prop']=df_compare['whole_df_precision']-df_compare['Prop']
df_compare['DELTA acc vs prop']=df_compare['acc_df_precision']-df_compare['Prop']
df_compare

Average improvement to benchmark model

In [None]:
df_compare['DELTA acc vs prop'].mean()

# Conclusion

Started by the historical data of 3 million traffic accidents with more than 3 million individual fates I tried to predict the state of each traffic accident. After data exploration 26 features per record remained to understand the combination of risk factors per state. In two datasets with/without geographical features, I tried different algorithms to predict the state. Compared to the benchmark model I improvement of 8,9% can achieved – in some states over 25%. 
With this prediction it is now possible to learn from the historical data to identify and avoid traffic accident risks. Taken the numbers of Oregon with over 71tsd traffic accidents in this dataset it is now possible to understand the risk factors for 18tsd traffic accidents – and try to avoid more than 18tsd individual fates.   